Optimizers

After constructing your deep learning model, you will adjust it's weight parameters to improve it's performance.

At this time, optimizers will take the role of deciding "when", "how", and "how much" to update the weight parameters.

Available optimizers

learning rate

All Kerasy optimizers have the arguments learning_rate (lr), which is a hyperparameter used in the training of models that has a small positive value, often in the range between 0.0 and 1.0.

The learning rate hyperparameter controls the rate or speed at which the model learns. Let's see an example below. (optimizer is simplest Gradient Decent)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from kerasy.optimizers import GradientDescent
In [2]:
# Parameters
rnd = np.random.RandomState(0)
xmin, xmax = (0,1)
num_samples = 100
num_val_samples = 1000
epochs = 100
$$ y_{\text{train}} = 2x_{\text{train}} + 1 + \underset{\sim\mathcal{N}\left(0,1/10\right)}{\text{noise}} $$
In [3]:
# Training Data / Initial weights.
X_train = np.c_[
    rnd.uniform(xmin, xmax, size=(num_samples,1)),
    np.ones(shape=(num_samples,1))
]
coeff = np.asarray([2,1]).reshape(-1,1)
y_train = X_train.dot(coeff) + rnd.normal(loc=0.0, scale=1/10, size=(num_samples,1))
Xs = np.c_[
    np.linspace(xmin,xmax,num_val_samples).reshape(-1,1),
    np.ones(shape=(num_val_samples,1))
]
w = rnd.randn(2,1) # shape=(2,1)
In [4]:
fig = plt.figure(figsize=(14,4))
for i,lr in enumerate([0.01, 0.1, 0.5]):
    ax = fig.add_subplot(1, 3, i+1)
    ax.scatter(X_train[:,0], y_train, label="data")

    # Initialize the optimizer & weights.
    opt = GradientDescent(learning_rate=lr)
    w_ = w.copy()
    for epoch in range(epochs):
        grad = (1/num_samples) * 2 * X_train.T.dot(X_train.dot(w_) - y_train)
        w_ = opt.get_updates(grad, w_, f"sample{i}")

        if (epoch<10):
            y_pred = Xs.dot(w_)
            ax.plot(Xs[:,0],y_pred,color="blue",alpha=0.1*(epoch+1))

    # After 100 epochs.
    y_pred = Xs.dot(w_)
    ax.plot(Xs[:,0],y_pred,color="red",label="prediction")
    ax.set_xlabel("x"), ax.set_ylabel("y"), ax.set_title(f"learning_rate={lr}")
    ax.legend()

plt.tight_layout()
plt.show()

From the figure above, we could understand it is important to find a good value for the learning rate for your model on your training datasets.

Notation

$$\text{(gradient) }\ g_{i}^{t} :=\frac{\partial E}{\partial w_{i}}\left(\mathbf{w}^{t-1}\right)$$

SGD

class kerasy.optimizers.SGD(
    learning_rate=0.01, momentum=0., nesterov=False, **kwargs
)

SGD(Stochastic Gradient Decent) is basically a Gradient Descent method.

$$ \begin{aligned} v_{i}^{t} &= \text{momentum}\ast v_{i}^{t-1} - \text{lr}\ast g_{i}^{t} \quad \left(v_i^0=\mathbf{0}\right)\\ w_{i}^{t+1} &= \begin{cases} w_{i}^{t} + v_{i}^{t} & (\text{vanilla momentum})\\ w_{i}^{t} + \text{momentum}\ast v - \text{lr}\ast g_{i}^{t} & (\text{nesterov momentum})\\ \end{cases} \end{aligned} $$

Argments

  • learning_rate : The learning rate.
  • momentum : Accelerates gradient descent in the relevant direction and dampens oscillations.
  • nesterov : Whether to apply Nesterov momentum.

Adagrad

class kerasy.optimizers.Adagrad(
    learning_rate=0.01, **kwargs
)

Adagrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. "The more updates a parameter receives, the smaller the updates."

This idea sounds great, but it also has the problem that once you updates through a steep slope, the learning rate for that axis will decrease for good.

$$ \begin{aligned} v_{i}^{t} &=v_{i}^{t-1}+\left(g_{i}^{t}\right)^{2} \quad \left(v_i^0=\mathbf{0}\right)\\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{\sqrt{v_{i}^{t}+\varepsilon}} \left(g_{i}^{t}\right) \end{aligned} $$

Argments

  • learning_rate : The learning rate.

RMSprop

class kerasy.optimizers.RMSprop(
    learning_rate=0.001, rho=0.9, **kwargs
)

If you use Adagrad and training takes many iterations, the learning rate becomes smaller and smaller, and eventually parameter will not be updated.

Therefore, RMSprop memorizes the past gradients not completely, but forgeting gradually.

$$ \begin{aligned} v_{i}^{t} &=\gamma v_{i}^{t-1}+(1-\gamma)\left(g_{i}^{t}\right)^{2} & \left(v_i^0=0\right)\\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{\sqrt{v_{i}^{t}+\varepsilon}} \left(g_{i}^{t}\right) \end{aligned} $$

Comparing $v_i^t$ update rule of Adagrad and RMSprop is as follows:

$$ \begin{cases} \begin{aligned} v_{i}^{t}&=\left(g_{i}^{t}\right)^{2}+\left(g_{i}^{t-1}\right)^{2}+\left(g_{i}^{t-2}\right)^{2}+\cdots & \text{(Adagrad)}\\ v_{i}^{t}&=(1-\gamma) \sum_{l=1}^{t} \gamma^{t-l}\left(g_{i}^{l}\right)^{2} & \text{(RMSprop)} \end{aligned} \end{cases} $$

Argments

  • learning_rate : The learning rate.
  • rho : Discounting factor for the history/coming gradient.

Adadelta

class kerasy.optimizers.Adadelta(
    learning_rate=1.0, rho=0.95, **kwargs
)

Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients.

$$ \begin{aligned} v_{i}^{t} &=\gamma v_{i}^{t-1}+(1-\gamma)\left(g_{i}^{t}\right)^{2} & \left(v_i^0=0\right) \\ s_{i}^{t} &=\gamma s_{i}^{t-1}+(1-\gamma)\left(\Delta w_{i}^{t-1}\right)^{2} & \left(s_i^0=0\right) \\ \Delta w_{i}^{t} &=-\frac{\sqrt{s_{i}^{t}+\epsilon}}{\sqrt{v_{i}^{t}+\epsilon}} g_{i}^{t} \\ w_{i}^{t+1} &=w_{i}^{t}+\Delta w_{i}^{t} \end{aligned} $$

Argments

  • learning_rate : The learning rate.
  • rho : The decay rate.

Adam

class kerasy.optimizers.Adadelta(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999,
    amsgrad=False, **kwargs
)

Adam is a combination of the following two methods.

  • Momentum using the first power of the gradient($m$)
  • Adagrad using the square of the gradient($v$)
$$ \begin{aligned} m_{i}^{t} &=\beta_{1} m_{i}^{t-1}+\left(1-\beta_{1}\right) g_{i}^{t} & \left(m_i^0=0\right)\\ v_{i}^{t} &=\beta_{2} v_{i}^{t-1}+\left(1-\beta_{2}\right)\left(g_{i}^{t}\right)^{2} & \left(v_i^0=0\right)\\ \hat{m}_{i}^{t} &=\frac{m_{i}^{t}}{1-\beta_{1}^{t}} \\ \hat{v}_{i}^{t} &=\frac{v_{i}^{t}}{1-\beta_{2}^{t}} \\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{\sqrt{\hat{v}_{i}^{t}}+\varepsilon}\hat{m}_{i}^{t} \end{aligned} $$

Argments

  • learning_rate : The learning rate.
  • beta_1 : The exponential decay rate for the 1st moment estimates.
  • beta_2 : The exponential decay rate for the 2nd moment estimates.
  • amsgrad : Whether to apply AMSGrad variant of this algorithm.

Adamax

class kerasy.optimizers.Adamax(
    learning_rate=0.002, beta_1=0.9, beta_2=0.999, **kwargs
)

Adamax is a variant of Adam based on the infinity norm ($\beta_2\rightarrow\beta_2^p, p\rightarrow\infty$).

Since it becomes unstable as the value of $p$ becomes large, $p = 1,2$ that is, the $L_1$ norm and $L_2$ norm are generally preferred, but $L_\infty$ is also known to show stable behavior.

$$ \begin{aligned} v_i^{t} &=\beta_{2}^{\infty} v_i^{t-1}+\left(1-\beta_{2}^{\infty}\right)\left|g_i^{t}\right|^{\infty} \\ &=\max \left(\beta_{2} \cdot v_i^{t-1},\left|g_i^{t}\right|\right)\\ w_{i}^{t+1} &=w_{i}^{t}-\frac{\eta}{v_i^t}\hat{m}_{i}^{t} \end{aligned} $$

Argments

  • learning_rate : The learning rate.
  • beta_1 : The exponential decay rate for the 1st moment estimates.
  • beta_2 : The exponential decay rate for the 2nd moment estimates.

Nadam

class kerasy.optimizers.Nadam(
    learning_rate=0.001, beta_1=0.9, beta_2=0.999, **kwargs
)

Nadam is Adam with Nesterov momentum(Nesterov's Accelerated Gradient Method), so a combination of the following two methods.

  • "Nesterov" Momentum using the first power of the gradient($m$)
  • Adagrad using the square of the gradient($v$)
Momentum Nesterov's accelerated gradient
$$\begin{aligned}\mathbf{g}_t &\leftarrow \nabla_{\theta_{t-1}}f\left(\theta_{t-1}\right)\\ \mathbf{m}_t&\leftarrow \mu\mathbf{m}_{t-1} + \mathbf{g}_t\\\theta_{t}&\leftarrow \theta_{t-1} - \eta\mathbf{m}_t\end{aligned}$$ $$\begin{aligned}\mathbf{g}_t &\leftarrow \nabla_{\theta_{t-1}}f\left(\theta_{t-1}-\eta\mu\mathbf{m}_{t-1}\right)\\\mathbf{m}_t &\leftarrow \mu\mathbf{m}_{t-1} + \mathbf{g}_t\\\theta_{t}&\leftarrow \theta_{t-1} - \eta\mathbf{m}_t \end{aligned}$$

Argments

  • learning_rate : The learning rate.
  • beta_1 : The exponential decay rate for the 1st moment estimates.
  • beta_2 : The exponential decay rate for the 2nd moment estimates.