SGD Variants

Posted on Thu 29 March 2018 in Basics

To train our neural net we detailed the algorithm of Stochastic Gradient Descent in this article and implemented it in this one. To make it easier for our model to learn, there are a few ways we can improve it.

Vanilla SGD

Just to remember what we are talking about, the basic algorithm consists in changing each of our parameter this way

\begin{equation*} p_{t} = p_{t-1} - \hbox{lr} \times g_{t-1} \end{equation*}

where \(p_{t}\) represents one of our parameters at a given step \(t\) in our loop, \(g_{t}\) the gradient of the loss with respect to \(p_{t}\) and \(\hbox{lr}\) is an hyperparameter called learning rate. In a pytorch-like syntax, this can be coded:

for p in parameters:
    p = p - lr * p.grad

Momentum

This amelioration is based on the observation that with SGD, we don't really manage to follow the line down a steep ravine, but rather bounce from one side to the other.

In this case, we would go faster by following the blue line. To do this, we will take some kind of average over the gradients, by updating our parameters like this:

\begin{align*} v_{t} &= \beta v_{t-1} + \hbox{lr} \times g_{t} \\ p_{t} &= p_{t} - v_{t} \end{align*}

where \(\beta\) is a new hyperparameter (often equals to 0.9). More details here. In code, this would look like:

for p in parameters:
    v[p] = beta * v[p] + lr * p.grad
    p = p - v[p]

where we would store the values of the various variables v in a dictionary indexed by the parameters.

Nesterov

Nesterov is an amelioration of momentum based on the observation that in the momentum variant, when the gradient start to really change direction (because we have passed our minimum for instance), it takes a really long time for the averaged values to realize it. In this variant, we first take the jump from \(p_{t}\) to \(p_{t} - \beta v_{t-1}\) then we compute the gradient. To be more precise, instead of using \(g_{t} = \overrightarrow{\hbox{grad}} \hbox{ loss}(p_{t})\) we use

\begin{align*} v_{t} &= \beta v_{t-1} + \hbox{lr} \times \overrightarrow{\hbox{grad}} \hbox{ loss}(p_{t} - \beta v_{t-1}) \\ p_{t} &= p_{t} - v_{t} \end{align*}

This picture (coming from this website) explains the difference with momentum

In code, this needs to have a function that reevaluates the gradients after we do this first step.

for p in parameters:
    p1 = p - beta v[p]
model.reevaluate_grads()
for p in parameters
    v[p] = beta * v[p] + lr * p.grad
    p = p - v[p]

RMS Prop

This is another variant of SGD that has been proposed by Geoffrey Hinton in this course It suggests to divide each gradient by a moving average of its norm. More specifically, the update in this method looks like this:

\begin{align*} n_{t} &= \beta n_{t-1} + (1-\beta) g_{t}^{2} \\ p_{t} &= p_{t} - \frac{\hbox{lr}}{\sqrt{n_{t}+\epsilon}} g_{t} \end{align*}

where \(\beta\) is a new hyperparameter (usually 0.9) and \(\epsilon\) a small value to avoid dividing by zero (usually \(10^{-8}\)). It's easily coded:

for p in parameters:
    n[p] = beta * n[p] + (1-beta) * p.grad ** 2
    p = p - lr * p.grad / np.sqrt(n[p] + epsilon)

Adam

Adam is a mix between RMS Prop and momentum. Here is the full article explaining it. The update in this method is:

\begin{align*} m_{t} &= \beta_{1} m_{t-1} + (1-\beta_{1}) g_{t} \\ n_{t} &= \beta_{2} n_{t-1} + (1-\beta_{2}) g_{t}^{2} \\ \widehat{m_{t}} &= \frac{m_{t}}{1-\beta_{1}^{t}} \\ \widehat{n_{t}} &= \frac{n_{t}}{1-\beta_{2}^{t}} \\ p_{t} &= p_{t} - \frac{\hbox{lr}}{\sqrt{\widehat{n_{t}}+\epsilon}} \widehat{m_{t}} \end{align*}

where \(\beta_{1}\) and \(\beta_{2}\) are two hyperparameters, the advised values being 0.9 and 0.999. We go from \(m_{t}\) to \(\widehat{m_{t}}\) to have smoother first values (as we explained it when we implemented the learning rate finder). It's same for \(n_{t}\) and \(\widehat{n_{t}}\).

We can code this quite easily:

for p in parameters:
    m[p] = beta1 * m[p] + (1-beta1) * p.grad
    n[p] = beta2 * n[p] + (1-beta2) * p.grad ** 2
    m_hat = m[p] / (1-beta1**t)
    n_hat = n[p] / (1-beta2**t)
    p = p - lr * m_hat / np.sqrt(n_hat + epsilon)

Both RMS Prop and Adam have the advantage of smoothing the gradients with this moving average of the norm. This way, we can pick a higher learning rate while avoiding the phenomenon where gradients were exploding.

In conclusion

To get a sense on how these methods are all doing compared to the other, I've applied each of them on the digit classifier we built in pytorch and plotted the smoothed loss ( as described when we implemented the learning rate finder) for each of them.

Loss over iterations with all the SGD variants

This is the loss as we go through our mini-batches with the same initial model. All variants are better than vanilla SGD, but it's probably because it needed more time to get to a better stop. What's interesting is that RMSProp and Adam tend to get the loss really low extremely fast.