Optimizers

2019-08-18(Sun)
optimizers.py

After constructing your deep learning model, you will adjust it's weight parameters to improve it's performance.

At this time, optimizers will take the role of deciding "when", "how", and "how much" to update the weight parameters.

Available optimizers

SGD
Adagrad
RMSprop
Adadelta
Adam
Adamax
Nadam

learning rate¶

All Kerasy optimizers have the arguments learning_rate (lr), which is a hyperparameter used in the training of models that has a small positive value, often in the range between 0.0 and 1.0.

The learning rate hyperparameter controls the rate or speed at which the model learns. Let's see an example below. (optimizer is simplest Gradient Decent)

In [1]:

import numpy as np
import matplotlib.pyplot as plt
from kerasy.optimizers import GradientDescent

In [2]:

# Parameters
rnd = np.random.RandomState(0)
xmin, xmax = (0,1)
num_samples = 100
num_val_samples = 1000
epochs = 100

$$ y_{\text{train}} = 2x_{\text{train}} + 1 + \underset{\sim\mathcal{N}\left(0,1/10\right)}{\text{noise}} $$

In [3]:

# Training Data / Initial weights.
X_train = np.c_[
    rnd.uniform(xmin, xmax, size=(num_samples,1)),
    np.ones(shape=(num_samples,1))
]
coeff = np.asarray([2,1]).reshape(-1,1)
y_train = X_train.dot(coeff) + rnd.normal(loc=0.0, scale=1/10, size=(num_samples,1))
Xs = np.c_[
    np.linspace(xmin,xmax,num_val_samples).reshape(-1,1),
    np.ones(shape=(num_val_samples,1))
]
w = rnd.randn(2,1) # shape=(2,1)

In [4]:

fig = plt.figure(figsize=(14,4))
for i,lr in enumerate([0.01, 0.1, 0.5]):
    ax = fig.add_subplot(1, 3, i+1)
    ax.scatter(X_train[:,0], y_train, label="data")

    # Initialize the optimizer & weights.
    opt = GradientDescent(learning_rate=lr)
    w_ = w.copy()
    for epoch in range(epochs):
        grad = (1/num_samples) * 2 * X_train.T.dot(X_train.dot(w_) - y_train)
        w_ = opt.get_updates(grad, w_, f"sample{i}")

        if (epoch<10):
            y_pred = Xs.dot(w_)
            ax.plot(Xs[:,0],y_pred,color="blue",alpha=0.1*(epoch+1))

    # After 100 epochs.
    y_pred = Xs.dot(w_)
    ax.plot(Xs[:,0],y_pred,color="red",label="prediction")
    ax.set_xlabel("x"), ax.set_ylabel("y"), ax.set_title(f"learning_rate={lr}")
    ax.legend()

plt.tight_layout()
plt.show()

From the figure above, we could understand it is important to find a good value for the learning rate for your model on your training datasets.

Notation

$$\text{(gradient) }\ g_{i}^{t} :=\frac{\partial E}{\partial w_{i}}\left(\mathbf{w}^{t-1}\right)$$

SGD

class kerasy.optimizers.SGD(
    learning_rate=0.01, momentum=0., nesterov=False, **kwargs
)

SGD(Stochastic Gradient Decent) is basically a Gradient Descent method.

$$ \begin{aligned} v_{i}^{t} &= \text{momentum}\ast v_{i}^{t-1} - \text{lr}\ast g_{i}^{t} \quad \left(v_i^0=\mathbf{0}\right)\\ w_{i}^{t+1} &= \begin{cases} w_{i}^{t} + v_{i}^{t} & (\text{vanilla momentum})\\ w_{i}^{t} + \text{momentum}\ast v - \text{lr}\ast g_{i}^{t} & (\text{nesterov momentum})\\ \end{cases} \end{aligned} $$

Argments

learning_rate : The learning rate.
momentum : Accelerates gradient descent in the relevant direction and dampens oscillations.
nesterov : Whether to apply Nesterov momentum.

Reference

Adagrad

class kerasy.optimizers.Adagrad(
    learning_rate=0.01, **kwargs
)

Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. "The more updates a parameter receives, the smaller the updates."

This idea sounds great, but it also has the problem that once you updates through a steep slope, the learning rate for that axis will decrease for good.

$$ \begin{aligned} v_{i}^{t} &=v_{i}^{t-1}+\left(g_{i}^{t}\right)^{2} \quad \left(v_i^0=\mathbf{0}\right)\\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{\sqrt{v_{i}^{t}+\varepsilon}} \left(g_{i}^{t}\right) \end{aligned} $$

Argments

learning_rate : The learning rate.

Reference

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

RMSprop

class kerasy.optimizers.RMSprop(
    learning_rate=0.001, rho=0.9, **kwargs
)

If you use Adagrad and training takes many iterations, the learning rate becomes smaller and smaller, and eventually parameter will not be updated.

Therefore, RMSprop memorizes the past gradients not completely, but forgeting gradually.

$$ \begin{aligned} v_{i}^{t} &=\gamma v_{i}^{t-1}+(1-\gamma)\left(g_{i}^{t}\right)^{2} & \left(v_i^0=0\right)\\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{\sqrt{v_{i}^{t}+\varepsilon}} \left(g_{i}^{t}\right) \end{aligned} $$

Comparing $v_i^t$ update rule of Adagrad and RMSprop is as follows:

$$ \begin{cases} \begin{aligned} v_{i}^{t}&=\left(g_{i}^{t}\right)^{2}+\left(g_{i}^{t-1}\right)^{2}+\left(g_{i}^{t-2}\right)^{2}+\cdots & \text{(Adagrad)}\\ v_{i}^{t}&=(1-\gamma) \sum_{l=1}^{t} \gamma^{t-l}\left(g_{i}^{l}\right)^{2} & \text{(RMSprop)} \end{aligned} \end{cases} $$

Argments

learning_rate : The learning rate.
rho : Discounting factor for the history/coming gradient.

Reference

rmsprop: Divide the gradient by a running average of its recent magnitude

Adadelta

class kerasy.optimizers.Adadelta(
    learning_rate=1.0, rho=0.95, **kwargs
)

Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients.

$$ \begin{aligned} v_{i}^{t} &=\gamma v_{i}^{t-1}+(1-\gamma)\left(g_{i}^{t}\right)^{2} & \left(v_i^0=0\right) \\ s_{i}^{t} &=\gamma s_{i}^{t-1}+(1-\gamma)\left(\Delta w_{i}^{t-1}\right)^{2} & \left(s_i^0=0\right) \\ \Delta w_{i}^{t} &=-\frac{\sqrt{s_{i}^{t}+\epsilon}}{\sqrt{v_{i}^{t}+\epsilon}} g_{i}^{t} \\ w_{i}^{t+1} &=w_{i}^{t}+\Delta w_{i}^{t} \end{aligned} $$

Argments

learning_rate : The learning rate.
rho : The decay rate.

Reference

Adadelta - an adaptive learning rate method

Adam

class kerasy.optimizers.Adadelta(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999,
    amsgrad=False, **kwargs
)

Adam is a combination of the following two methods.

Momentum using the first power of the gradient（$m$）
Adagrad using the square of the gradient（$v$）

$$ \begin{aligned} m_{i}^{t} &=\beta_{1} m_{i}^{t-1}+\left(1-\beta_{1}\right) g_{i}^{t} & \left(m_i^0=0\right)\\ v_{i}^{t} &=\beta_{2} v_{i}^{t-1}+\left(1-\beta_{2}\right)\left(g_{i}^{t}\right)^{2} & \left(v_i^0=0\right)\\ \hat{m}_{i}^{t} &=\frac{m_{i}^{t}}{1-\beta_{1}^{t}} \\ \hat{v}_{i}^{t} &=\frac{v_{i}^{t}}{1-\beta_{2}^{t}} \\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{\sqrt{\hat{v}_{i}^{t}}+\varepsilon}\hat{m}_{i}^{t} \end{aligned} $$

Argments

learning_rate : The learning rate.
beta_1 : The exponential decay rate for the 1st moment estimates.
beta_2 : The exponential decay rate for the 2nd moment estimates.
amsgrad : Whether to apply AMSGrad variant of this algorithm.

Reference

Adamax

class kerasy.optimizers.Adamax(
    learning_rate=0.002, beta_1=0.9, beta_2=0.999, **kwargs
)

Adamax is a variant of Adam based on the infinity norm ($\beta_2\rightarrow\beta_2^p, p\rightarrow\infty$).

Since it becomes unstable as the value of $p$ becomes large, $p = 1,2$ that is, the $L_1$ norm and $L_2$ norm are generally preferred, but $L_\infty$ is also known to show stable behavior.

$$ \begin{aligned} v_i^{t} &=\beta_{2}^{\infty} v_i^{t-1}+\left(1-\beta_{2}^{\infty}\right)\left|g_i^{t}\right|^{\infty} \\ &=\max \left(\beta_{2} \cdot v_i^{t-1},\left|g_i^{t}\right|\right)\\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{v_i^t}\hat{m}_{i}^{t} \end{aligned} $$

Argments

learning_rate : The learning rate.
beta_1 : The exponential decay rate for the 1st moment estimates.
beta_2 : The exponential decay rate for the 2nd moment estimates.

Reference

Adam - A Method for Stochastic Optimization

Nadam

class kerasy.optimizers.Nadam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, **kwargs
)

Nadam is Adam with Nesterov momentum(Nesterov's Accelerated Gradient Method), so a combination of the following two methods.

"Nesterov" Momentum using the first power of the gradient（$m$）
Adagrad using the square of the gradient（$v$）

Momentum	Nesterov's accelerated gradient
$$\begin{aligned}\mathbf{g}_t &\leftarrow \nabla_{\theta_{t-1}}f\left(\theta_{t-1}\right)\\ \mathbf{m}_t&\leftarrow \mu\mathbf{m}_{t-1} + \mathbf{g}_t\\\theta_{t}&\leftarrow \theta_{t-1} - \eta\mathbf{m}_t\end{aligned}$$	$$\begin{aligned}\mathbf{g}_t &\leftarrow \nabla_{\theta_{t-1}}f\left(\theta_{t-1}-\eta\mu\mathbf{m}_{t-1}\right)\\\mathbf{m}_t &\leftarrow \mu\mathbf{m}_{t-1} + \mathbf{g}_t\\\theta_{t}&\leftarrow \theta_{t-1} - \eta\mathbf{m}_t \end{aligned}$$

Argments

learning_rate : The learning rate.
beta_1 : The exponential decay rate for the 1st moment estimates.
beta_2 : The exponential decay rate for the 2nd moment estimates.

Reference