Another data science student's blog/2018-05-03T16:22:00-04:00Deep Painterly Harmonization2018-05-03T16:22:00-04:002018-05-03T16:22:00-04:00Sylvain Guggertag:None,2018-05-03:/deep-painterly-harmonization.html<p class="first last">In this article we'll decode the research article with the same name and get some cool results integrating random objects in paintings while preserving their style.</p>
<p>In this article we'll decode <a class="reference external" href="https://arxiv.org/abs/1804.03189">this article</a> and get some cool results integrating random objects in paintings while preserving their style like this</p>
<img alt="Example 1" class="align-center" src="../images/art8_eiffel.png" style="width: 600px;" />
<p>or like this</p>
<img alt="Example 2" class="align-center" src="../images/art8_shield.png" style="width: 600px;" />
<p>It goes with <a class="reference external" href="https://github.com/sgugger/Deep-Learning/blob/master/DeepPainterlyHarmonization.ipynb">this notebook</a> where I have tried to replicate the experiments (all images shown in this post come from applying it).</p>
<div class="section" id="style-transfer">
<h2>Style Transfer</h2>
<p>To read more about the basics of style transfer, I can only recommend the <a class="reference external" href="http://fast.ai">fast.ai</a> course, or <a class="reference external" href="https://medium.com/@shivamgoel1791/everything-you-need-to-know-about-neural-style-transfer-994530cc9a6e">this blog post</a> by an international fellow colleague. Since there's a lot to cover,
I will assume you are familiar with this. To make things simple, we will try to match in some way (that is going to be defined later) the features a CNN computes on our
input image with the features it computes on our output image.</p>
<p>The model the team who wrote the article chose is VGG19. However, I found similar results with VGG16 which is a bit faster, and lighter in terms of memory, so I used this one. Then
we will grab the results of five convolutional layers, the first one and the ones just after the MaxPooling layers (where we half the resolution). The idea is that each will give
us some different kind of information. The first convolutional layer is very close to the image, so it will focus more on the details, while the fifth one will be more conceptual
and its activation will represent general properties of the picture.</p>
<p>Now to properly integrate the new object in the painting, the authors of the article propose to make two different phases. The first one will focus more on the general style, giving
an intermediate result that where the object will still stand out a bit in the picture. The second phase will focus more on the details, and smoothening the edges that could have
appeared during the first part. Here is an example of the result of the two stages.</p>
<img alt="Stage 1 and 2" class="align-center" src="../images/art8_eiffel_stages.png" style="width: 600px;" />
<p>Before going further, a bit of vocabulary. As in the article, we'll call the content picture the painting with our object pasted on it and the style picture the original painting.
The input is the content picture for phase 1, the result of this first stage for phase 2. In both cases, we'll compute the results of the convolutional layers for the content
picture and the style picture at first, which will serve as our reference features. Then we compute the results of the same convolutional layers for our input, compare them to these
references and calculate a loss from that.</p>
<p>At this point, it's all a matter of classic training: we'll compute the gradients of this loss and use them to get a better input, then reiterate the process. As long as our
loss properly represents what we want (the object transformed to the style of the painting), we should get some good result. Since the number of parameters is way less than usual
(only the pixels of our input compared to all the weights of a model, usually) we can use a variant of SGD that will not only calculate the gradients, but the second derivative
as well (the hessian matrix). Without going into the details, this will allow us to make smarter steps each time we update our input, and converge a lot faster. Specifically, we'll
use the <a class="reference external" href="https://en.wikipedia.org/wiki/Limited-memory_BFGS">LBFGS optimizer</a> which is already implemented in pytorch.</p>
<p>This all seems pretty straightforward, but that's because we didn't get to the tough part yet: what are these two magical loss functions (one for stage 1 and one for stage 2) that
we will use?</p>
</div>
<div class="section" id="the-first-pass">
<h2>The first pass</h2>
<p>The loss function used in this pass is exactly the same as in the <a class="reference external" href="https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf">original work of Gatys et al.</a> There is a content loss, that measures the difference between
our input and the content image, a style loss, that measures the difference between our input and the style image, and we sum them with certain weights to get our final loss.</p>
<p>The main difference is that the article was intended to match a whole picture with a certain style, whereas we only have to worry about part of the picture, the object we add.
That means we will mask all the parts of the image that have nothing to do with it when we compute our loss.</p>
<img alt="Mask on the content image" class="align-center" src="../images/art8_mask.png" style="width: 600px;" />
<p>In practice, we will use a slightly dilated mask, that encircles a bit more than just the object we're adding (as the authors did in the <a class="reference external" href="https://github.com/luanfujun/deep-painterly-harmonization">code they published</a> ). We don't apply that mask before sending the content image, the style image or the output image in the model,
which would make us loose information too early, but we resize it to match the dimensions of our different features (the results from the convolutional layers) and apply it to
those.</p>
<p>The content loss is then pretty straightforward: it's the mean-squared error between the masked features of our content image and the masked features of our input. The authors
chose to use the result of the fourth convolutional layer only for this content loss. Using one of the first convolutional layers would force the final output to match the initial
object too closely.</p>
<p>The style loss is a bit trickier. We'll use Gram matrices like we do for regular style transfer, but the problem of the mask is that it might hide some useful information
regarding the style. For the content, all the details we needed were inside the mask, because that's where the object we are adding is, but the general style of the painting is
more global. That's why before we apply the mask to the style features, we will make some kind of matching to reorganize them.</p>
<img alt="Mask on the style image" class="align-center" src="../images/art8_mask_style.png" style="width: 600px;" />
<p>To be more specific, for each layer of results we have from our model, we'll look at each 3 by 3 (by the number of channels) part of the content features (or patch, as they call it
in the article) and find the 3 by 3 patch in the style features that looks the most like it, and match them. To measure how much two patches look alike, we'll use the cosine similarity
between them.</p>
<p>Once that mapping is done (note that it is done once and for all between the content and the style), we will transform the style features so that the centers of each 3 by 3 patch
in the content features is aligned with its match in the style features. Then
we will apply the resized mask on the input features and the style features, compute the Gram matrices of both of them then take the mean-squared error to give us the style loss.
The authors chose to use the convolutional layers number 3, 4 and 5 for this style loss, and take the mean of the three of them.</p>
<p>The final loss of this first stage is then:</p>
<div class="math">
\begin{equation*}
\mathcal{L} = 5 \mathcal{L}_{content} + 100 \mathcal{L}_{loss}
\end{equation*}
</div>
<p>Once we're done with the construction of our input (they use 1000 iterations in the paper), we make a mean between our output and the style picture to have our final output. We
could just use the mask around our object, but that will get an abrupt transition that will stand out, so we use a Gaussian blurring on the sides of the mask (so that we get from
the 1s to the 0s a it more smoothly), then compute</p>
<div class="math">
\begin{equation*}
\hbox{final output} = \hbox{blurred mask} \times \hbox{output} + (1 - \hbox{blurred mask}) \times \hbox{style picture}.
\end{equation*}
</div>
</div>
<div class="section" id="the-second-pass">
<h2>The second pass</h2>
<p>As good as the results of the first pass already are, they usually have two defaults that make the object we added in the painting stand out: first we didn't use any of the features
of the first convolutional layers so the fine details, especially those of the painting style, won't be present. Then we didn't do anything to make sure our final picture is smooth.</p>
<p>To remedy to those two things, the authors propose to do a second pass to refine the first result. The first change is in the matching process. This time, the matching between the
content and the style is done on a reference layer first, and the results will be transported to the others, but this mapping won't be different for each layer like in the first pass.
Then, after doing the first mapping between the content and the style for this reference layer like in the first stage, they refine it by trying to insure that adjacent vectors in the
style features remain adjacent through the mapping.</p>
<img alt="Neighbor matching algorithm" class="align-center" src="../images/art8_algo2.png" style="width: 400px;" />
<p>For each pixel p, we consider a certain set of candidates built by going in every direction on p' (in the code they take the full 5 by 5 square centered on p), taking the value
given by our first match, and applying to it the inverse of the translation that goes from p to p'. Then we find the candidate that minimizes the L2 loss between its style
features and the ones of its neighbors.</p>
<p>Once that matching is done for the reference layer (the authors chose the fourth one), we resize it for the other layers, then proceed like in the first stage to compute the
style loss. There is just one difference, they indicate in their article that they suppress the repetitions of the style vectors picked more than once. This is possible because
the Gram matrix doesn't care about the exact spacial location of a style feature (since we sum other all locations for each coefficient) but having too many times the same
style vectors apparently hurt a bit the performance.</p>
<p>The matching being done, the authors use this time the convolutional layers number 1, 2, 3 and 4 for the style loss (and take the mean of them), and the fourth convolutional layer
for the content loss. To add more details to the final output, they also consider two more losses. The first one, the Total Variation loss, just sums the difference between
adjacent pixel values, which will insure the result is smoother:</p>
<div class="math">
\begin{equation*}
\mathcal{L}_{tv} = \sum_{x,y} ((O_{x,y} - O_{x-1,y})^{2} + (O_{x,y} - O_{x,y-1})^{2})
\end{equation*}
</div>
<p>where O designs our output. The last one is the histogram loss introduced in <a class="reference external" href="https://arxiv.org/abs/1701.08893">this other article</a> .</p>
</div>
<div class="section" id="the-histogram-loss">
<h2>The histogram loss</h2>
<p>Histogram matching is a technique that is often used to modify a certain photograph with the luminosity or shadows of another. The technique in itself is explained on <a class="reference external" href="https://en.wikipedia.org/wiki/Histogram_matching">wikipedia</a> and here is a concrete example of application.</p>
<img alt="Histogram matching" class="align-center" src="../images/art8_hist_match.png" style="width: 600px;" />
<p>In their paper, Pierre Wilmot et al. found that applying the same technique to define another loss could help preserve the textures of the style picture. They recommended to use
it for the features of the first convolutional layer and the fourth one, for both the fine details and the more general aspects of the style.</p>
<p>The idea is, for these two layers, to compute the histogram of each channel of the style features as a reference. Then, at each pass of our training, we calculate the remapping of
our output features so that their histogram (for each channel) matches the style reference. We then define the histogram loss as being the the mean-squared error between the output
features and their remapped version. The challenge here is to compute that remapping.</p>
<p>Let's say we are trying to change x so that it matches an histogram hist. We sort x first, while keeping the permutation we had to do (it will be used at the end to put the new values
we interpolate in their right place). Then, when we treat the i-th value, we look at the first index idx such has hist.cumsum(idx) is greater than i (which means the i-th value of the
data we are trying to match the histogram is in the bin with index idx). The value attributed to x[i] is basically</p>
<div class="math">
\begin{equation*}
\hbox{min} + \hbox{idx} \times \frac{\hbox{max} - \hbox{min}}{n_{bins}}
\end{equation*}
</div>
<p>where <span class="math">\(\hbox{min}\)</span> and <span class="math">\(\hbox{max}\)</span> are the minim and the maximum values of the data. This formula is slightly corrected because if we have
several values of x with the same index idx, we want them to be evenly distributed inside the range of the bin. So we compute the ratio</p>
<div class="math">
\begin{equation*}
\hbox{ratio} = \frac{i - \hbox{hist.cumsum}(\hbox{idx}-1)}{\hbox{hist}[\hbox{idx}]}
\end{equation*}
</div>
<p>and finally put</p>
<div class="math">
\begin{equation*}
x[i] = \hbox{min} + (\hbox{idx} + \hbox{ratio} ) \times \frac{\hbox{max} - \hbox{min}}{n_{bins}}.
\end{equation*}
</div>
<p>Now we just have to do this for all the i possibles and all the channels. Of course, a simple for loop just won't do if we want to use the GPU to handle all the computations
quickly (and if we want 1000 iterations we better compute this remapping as quickly as we can). Let's assume we have our input x of size ch (for channels) by a given n (the number
of activations we keep) and a variable hist_ref of size ch by n_bins (they picked 256 in the paper). Sorting x for each channel and keeping the corresponding mapping is easy with
pytorch:</p>
<pre class="code literal-block">
sorted_x, sort_idx = x.data.sort(1)
</pre>
<p>Then we have to adapt our histogram a bit because x and our reference may not have the same number of activations (we removed some style features, the one that appeared more than
once). So an histogram for x would have a total sum of n, so we just have to compute the sum of each lines in hist_ref.</p>
<pre class="code literal-block">
hist = hist_ref * n/hist_ref.sum(1).unsqueeze(1)#Normalization between the different lengths of masks.
cum_ref = hist.cumsum(1)
cum_prev = torch.cat([torch.zeros(ch,1).cuda(), cum_ref[:,:-1]],1)
</pre>
<p>The cumsums will be used later, and we will need both the cumulative sums of hist_ref and the one that contain the cumulative sums for the previous index. To replace our for loop
we will create a tensor that contains all the values i from 1 to n. To determine the first index idx such that hist.cumsum(idx) is greater than i, I've used this line:</p>
<pre class="code literal-block">
rng = torch.arange(1,n+1).unsqueeze(0).cuda()
idx = (cum_ref.unsqueeze(1) - rng.unsqueeze(2) < 0).sum(2).long()
</pre>
<p>Since all the lines of cum_ref are sorted by ascending values, by subtracting i, the sum over the booleans corresponding to the test cum_ref - i < 0 will give us the first index
where cum_ref is greater than i. Then we use this tensor idx to get all the values in cum_prev and hist that we will need. Since pytorch doesn't like indexing with a multi-dimensional
tensor, we have to flatten everything (though that probably won't be needed anymore in pytorch 0.4)</p>
<pre class="code literal-block">
ymin, ymax = x.data.min(1)[0].unsqueeze(1), x.data.max(1)[0].unsqueeze(1)
step = (ymax-ymin)/n_bins
ratio = (rng - cum_prev.view(-1)[idx.view(-1)].view(ch,-1)) / (1e-8 + hist.view(-1)[idx.view(-1)].view(ch,-1))
ratio = ratio.squeeze().clamp(0,1)
new_x = ymin + (ratio + idx.float()) * step
</pre>
<p>At this stage new_x contains all the values of our remapping, but they are sorted. We have to use the inverse permutation of the one we applied at the beginning to finish the
process. To find the inverse permutation I've simply chose to get the arg sort:</p>
<pre class="code literal-block">
_, remap = sort_idx.sort()
new_x = new_x.view(-1)[remap.view(-1)].view(ch,-1)
</pre>
</div>
<div class="section" id="normalization">
<h2>Normalization</h2>
<p>In the end, the biggest challenge I faced while working on the implementation of this article is the imbalance between the style features and the input features: in the second
phase, the mask applied to the style features and the one applied to the input features are different, so the gram matrices we compute from them have different ranges of values. I
haven't really understood the way the authors of the paper dealt with this in their code so I chose my own approach.</p>
<p>If we apply a mask with <span class="math">\(n_{1}\)</span> elements for the style features and a mask with <span class="math">\(n_{2}\)</span> elements for the input features, I decided to multiply the style features by
<span class="math">\(\sqrt{\frac{n_{2}}{n_{1}}}\)</span> to artificially <em>resize</em> them. Why? Well the gram matrix is computed by doing a sum, which will either have <span class="math">\(n_{1}\)</span> or <span class="math">\(n_{2}\)</span>
elements, of products of two elements of our features. So inside that sum, when we compute the gram matrix of the style features, the square root will disappear and we will
multiply the result by <span class="math">\(\frac{n_{2}}{n_{1}}\)</span>, which is a way to <em>resize</em> this sum of <span class="math">\(n_{1}\)</span> elements to a sum of <span class="math">\(n_{2}\)</span> elements.</p>
<p>Without this little trick, trainings usually gave me this:</p>
<img alt="Histogram matching" class="align-center" src="../images/art8_bug_norm.png" style="width: 600px;" />
<p>For the histograms, we also have a resize to do, which is just done by multiplying the histogram of the style features by this ratio <span class="math">\(\frac{n_{2}}{n_{1}}\)</span>. Then in the article
they used the minimum and maximum values of the style features to reconstruct the remapped output features, which didn't make any sense to me, since the histogram loss then compares
those remapped features to the output features, so I used the minimums and maximums of the output features.</p>
<p>At the end, those four losses are summed with some weights to give the final loss of the second stage:</p>
<div class="math">
\begin{equation*}
\mathcal{L} = \mathcal{L}_{c} + w_{s} \mathcal{L}_{s} + w_{h} \mathcal{L}_{hist} + w_{tv} \mathcal{L}_{tv}
\end{equation*}
</div>
<p>where they determine a parameter <span class="math">\(\tau\)</span> by training a neural net they call a painting estimator then use</p>
<div class="math">
\begin{equation*}
\left \{ \begin{array}{l} w_{s} = \tau \\ w_{tv} = \frac{10 \tau}{(1 + \exp(10^{4} \hbox{mtv} -25))} \\ w_{h} = (10 - w_{tv}) * \tau \end{array} \right .
\end{equation*}
</div>
<p>I've taken the formulas used in their code, which are different from the ones they put in their article. The quantity mtv is the median of all the variational looses (the things
we sum to compute TV loss) on the style picture. Of course, the values of tau that worked for them aren't necessarily the best ones since I've used different scaling for the
losses. There are probably some better values that could be used. I didn't get the histogram loss to show any real contribution to the picture, for instance.</p>
<p>Lastly, for the last stage, we use the result from the first stage to compute the remapping but it's slightly better to use the initial input image for the reconstruction
(which the authors do in their code). See the top of the Eiffel tower here, on the left by reconstructing from the input picture and on the right from the stage one.</p>
<img alt="Comparison of inputs for stage 2" class="align-center" src="../images/art8_comp_init.png" style="width: 600px;" />
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Pointer cache for Language Model2018-04-26T17:43:00-04:002018-04-26T17:43:00-04:00Sylvain Guggertag:None,2018-04-26:/pointer-cache-for-language-model.html<p class="first last">You can easily boost the performance of a language model based on RNNs by adding a pointer cache on top of it. The idea was introduce by Grave et al. and their results showed how this simple technique can make your perplexity decrease by 10 points without additional training. This sounds exciting, so let's see what this is all about and implement that in pytorch with the fastai library.</p>
<p>You can easily boost the performance of a language model based on RNNs by adding a pointer cache on top of it. The idea was introduce by Grave et al. in <a class="reference external" href="https://arxiv.org/pdf/1612.04426.pdf">this article</a> and their results showed how this simple technique can make your perplexity decrease by 10 points without additional training.
This sounds exciting, so let's see what this is all about and implement that in pytorch with the fastai library.</p>
<div class="section" id="the-pointer-cache">
<h2>The pointer cache</h2>
<p>To understand the general idea, we have to go back to the basic of a language model built on an RNN.</p>
<img alt="An example of RNN" class="align-center" src="../images/art6_rnn.png" style="width: 500px;" />
<p>Here our inputs are words, and the outputs our predictions for the newt word to come: <span class="math">\(o_{1}\)</span> should be <span class="math">\(i_{2}\)</span>, <span class="math">\(o_{2}\)</span> should be <span class="math">\(i_{3}\)</span> and
so forth. What's actually inside the black box doesn't matter here, as long as we remember there is a hidden state that will be passed along the way, updated, and used
to make the next predictions. When the black box is a multiple-layer RNN, what we note <span class="math">\(h_{t}\)</span> is the last hidden state (the one from the final layer), which is
also the one used by the decoder to compute <span class="math">\(o_{t}\)</span>.</p>
<p>Even if we had some kind of information on all the inputs <span class="math">\(i_{1},\dots,i_{t}\)</span> in our hidden state to predict <span class="math">\(o_{t}\)</span>, it's all squeezed in the size of that
hidden state, and if <span class="math">\(t\)</span> is large, it has been a long time since we saw the first inputs, so all their context has probably been forgotten by now. The idea behind
the pointer cache is to use again those inputs to adjust a bit the prediction <span class="math">\(o_{t}\)</span>.</p>
<img alt="RNN with cache" class="align-center" src="../images/art7_rnn_cache.png" style="width: 600px;" />
<p>More precisely, when trying to predict <span class="math">\(o_{t}\)</span>, we take a look at all the previous couples <span class="math">\((h_{1},i_{2}),\dots,(h_{t-1},i_{t})\)</span>. The hidden state <span class="math">\(h_{1}\)</span>
was supposed to predict <span class="math">\(i_{2}\)</span>, the hidden state <span class="math">\(h_{2}\)</span> was supposed to predict <span class="math">\(i_{3}\)</span> and so forth. If the hidden state we have right now, <span class="math">\(h_{t}\)</span>
<em>looks a lot like</em> one of the previous hidden state <span class="math">\(h_{k}\)</span>, well maybe the word we are trying to predict is the same as <span class="math">\(h_{k}\)</span> was supposed to, and we know that
word is <span class="math">\(i_{k+1}\)</span>, so we should boost the probability of this word in our output <span class="math">\(o_{t}\)</span>.</p>
<p>That's the main idea behind this pointer cache technique: we really want to predict the same word as that previous hidden state, so we point at it. The cache is just that instead
of looking through the history since the beginning, we only take a window of a certain length <span class="math">\(n\)</span>, so we look back at the <span class="math">\(n\)</span> previous couples <span class="math">\((h_{k},i_{k+1})\)</span>.</p>
<p>There is just one thing to clarify: how does one code this <em>looks a lot like</em> thing. We simply take the dot product of <span class="math">\(h_{t}\)</span> with <span class="math">\(h_{i}\)</span> (which is the exact same
idea as the one we saw in style transfer during the last lesson of <a class="reference external" href="http://fast.ai">fast.ai</a>). The dot product will be very high if the coordinates of <span class="math">\(h_{t}\)</span> and <span class="math">\(h_{i}\)</span> are very high together or very low (aka very high negatives) together
so it gives us a sense of how much they are similar.</p>
</div>
<div class="section" id="from-the-math">
<h2>From the math...</h2>
<p>This is why in the article mentioned earlier, they come up with the formula:</p>
<div class="math">
\begin{equation*}
p_{cache}(w | h_{1..t} x_{1..t}) \propto \sum_{i=1}^{t-1} \text{𝟙}_{\{w = x_{i+1}\}} \exp(\theta h_{t}^{T} h_{i})
\end{equation*}
</div>
<p>It looks a lot more complicated but there is not much more than what I explained before in this line. Let's break it down in bits!</p>
<p>The first part is the <span class="math">\(p_{cache}(w | h_{1..t} x_{1..t})\)</span>. It represents a probability, more specifically a probability to have the word <span class="math">\(w\)</span> while
knowing <span class="math">\(h_{1..t} x_{1..t}\)</span>, which is a shorter way of writing <span class="math">\(h_{1},\dots,h_{t},x_{1},\dots,x_{t}\)</span>. The <span class="math">\(h_{k}\)</span> are the hidden states and the <span class="math">\(x_{k}\)</span> the
inputs (what I called <span class="math">\(i_{k}\)</span> because input doesn't begin with an x). So this whole thing is just a fancy way of writing what is our desired output: a vector that will
contain the probabilities that the next word is <span class="math">\(w\)</span> knowing all the previous inputs and hidden states.</p>
<p>Then there is this weird symbol <span class="math">\(\propto\)</span> (which I honestly didn't know). While looking it up to type the formula, I found this <a class="reference external" href="http://detexify.kirelabs.org/classify.html">very cool website</a> where you can draw a mathematical symbol, and it will spit you its LaTeX code, and a google search of it will probably give you
all the information you need to understand its meaning. Hope this trick can help you in breaking down future formulas.</p>
<p>Anyway, they don't use the equal sign but this <em>proportional to</em> because since we want a probability, we will have to have things that add up to one in the end. They don't want to
bother with it for now, so this is just a way of saying: we'll give that value, and at the end, divide by the sum of all of those so we're sure it adds up to one.</p>
<p>Then comes a sum, going from 1 to <span class="math">\((t-1)\)</span>, that just means we look at all our previous hidden states. All? Not really, cause this weird 𝟙 with a double bar is an indicator
function. Though more than its name, you're probably more interested in what it does. So when we have a 𝟙 like this, there is a condition written in index (here
<span class="math">\(\{w = x_{i+1}\}\)</span>) and the quantity is equal to 1 when the condition is true, 0 when the condition is false. So we're not summing over all the previous states, but only
those who respect that condition, aka the ones where <span class="math">\(x_{i+1}\)</span> (which is the word we were trying to predict) is the same as <span class="math">\(w\)</span> (the word we want to assign a
probability now).</p>
<p>Let's sum up until know: to assign a probability to this word w, let's look back at all the previous states where we trying to predict w. Now for all of those states, we compute
the quantity <span class="math">\(\exp(\theta h_{t}^{T} h_{i})\)</span>. Here <span class="math">\(h_{t}^{T}h_{i}\)</span> is another way to write the dot product of <span class="math">\(h_{t}\)</span> and <span class="math">\(h_{i}\)</span>, which we already
established is a measure of how much <span class="math">\(h_{t}\)</span> and <span class="math">\(h_{i}\)</span> look a like. We multiply this by an hyper-parameter <span class="math">\(\theta\)</span> and then take the exponential of it.</p>
<p>Why the exponential? Remember the little bit with the weird symbol <span class="math">\(\propto\)</span>, we will have to divide by the sum of everything at the end. Taking exponentials of quantities
then divide by the sum of them all... this should remind you of something. That's right, a softmax! For one, this will insure that all our probabilities add up to one, but
mostly, it will make one of them stand out more than the others, because that's what softmax does. In the end, it'll help us point at one specific previous hidden state, the one
that looks the most like the one we have.</p>
<p>So in the end, we compute the softmax s of <span class="math">\(\theta h_{1} \cdot h_{t}, \dots, \theta h_{t-1} \cdot h_{t}\)</span> and attribute to <span class="math">\(p_{cache}(w)\)</span> the sum of all the
coordinates of s corresponding to hidden state <span class="math">\(h_{i}\)</span> where we were trying to predict <span class="math">\(w\)</span>.</p>
<p>There is just one last step, but it's an easy one. Our final probability for the word w is</p>
<div class="math">
\begin{equation*}
p(w) = (1-\lambda)p_{vocab}(w) + \lambda p_{cache}(w).
\end{equation*}
</div>
<p>I removed all the <span class="math">\(| h_{1..t} x_{1..t}\)</span> because they aren't really useful. So our final probability is a blend between this <span class="math">\(p_{cache}(w)\)</span> we just computed and
<span class="math">\(p_{vocab}(w)\)</span>, which is their notation for the probabilities in our output <span class="math">\(o_{t}\)</span>, and we have another hyper-parameter <span class="math">\(\lambda\)</span> that will decide how much
of the cache we take, and how much of the output of our RNN.</p>
</div>
<div class="section" id="to-the-code">
<h2>...to the code</h2>
<p>Now that we have completely explained the formula, let's see how we code this. Let's say, at a given point where we have to give the probabilities for each word, we
have:</p>
<ul class="simple">
<li>our output of the network (softmaxed) in a torch vector named pv</li>
<li>the current hidden state in a torch vector named hidden</li>
<li>our cache of hidden states in a torch Tensor called hid_state</li>
<li>our cache of targets in a torch Tensor called targ_cache.</li>
</ul>
<p>Then first we take all the dot products between the hidden states in our cache and the current hidden state:</p>
<pre class="code python literal-block">
<span class="n">all_dot_prods</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">mv</span><span class="p">(</span><span class="n">theta</span> <span class="o">*</span> <span class="n">hid_cache</span><span class="p">,</span> <span class="n">hiddens</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
</pre>
<p>The torch command mv is applying directly the dot product between each line of hid_cache and the vector hiddens[i]. Then we softmax this:</p>
<pre class="code python literal-block">
<span class="n">softmaxed</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">all_dot_prods</span><span class="p">)</span>
</pre>
<p>Then we want, for each word w, to take the sum of all the probabilities corresponding to states where we had to predict w. To do this, I used the same trick as the implementation
of Stephen Merity et al. <a class="reference external" href="https://github.com/salesforce/awd-lstm-lm">here on github</a>. If we consider the targets are one-hot encoded, we just have to to expand our softmaxed vector (which as the size of our cache)
on the first dimension to have vocab_size lines, then we multiply it by targ_cache (which will zero all the things we don't want) and sum over the first axis. All of
this is done with:</p>
<pre class="code python literal-block">
<span class="n">softmaxed</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">all_dot_prods</span><span class="p">)</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">p_cache</span> <span class="o">=</span> <span class="p">(</span><span class="n">softmaxed</span><span class="o">.</span><span class="n">expand_as</span><span class="p">(</span><span class="n">targ_cache</span><span class="p">)</span> <span class="o">*</span> <span class="n">targ_cache</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>
</pre>
<p>Then our final predictions are given by</p>
<pre class="code python literal-block">
<span class="n">p</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">lambd</span><span class="p">)</span> <span class="o">*</span> <span class="n">pv</span> <span class="o">+</span> <span class="n">lambd</span> <span class="o">*</span> <span class="n">p_cache</span>
</pre>
<p>and the associated CrossEntropy Loss is given by</p>
<pre class="code python literal-block">
<span class="o">-</span><span class="n">torch</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="n">target</span><span class="p">])</span><span class="o">.</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre>
<p>if the current target is named target.</p>
<p>With all of this, we're ready to fully code the cache pointer and I've done an implementation relying on the <a class="reference external" href="https://github.com/fastai/fastai">fastai library</a> that you can find in <a class="reference external" href="https://github.com/sgugger/Deep-Learning/blob/master/Cache%20pointer.ipynb">this notebook</a>. As an example, the model I provide for testing goes from a perplexity of 74.06 to 54.43.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Recurrent Neural Network2018-04-14T16:31:00-04:002018-04-14T16:31:00-04:00Sylvain Guggertag:None,2018-04-14:/recurrent-neural-network.html<p class="first last">In Natural Language Processing, traditional neural networks struggle to properly execute the task we give them. To predict the next work in a sentence for instance, or grasp its meaning to somehow classify it, you need to have a structure that can keeps some memory of the words it saw before. That's why Recurrent Neural Network have been designed to do, and we'll look into them in this article.</p>
<p>A Recurrent Neural Network is called as such because it executes the same task repeatedly, getting an input (for instance a word in a sentence), using it to update a hidden state
and giving an output. This hidden state is the crucial part that allows the RNN to get some memory of what it's saw, encoding the general meaning of the part of the sentence it read.</p>
<div class="section" id="the-general-principle">
<h2>The general principle</h2>
<p>A Recurrent Neural Network basically looks like this:</p>
<img alt="An example of RNN" class="align-center" src="../images/art6_rnn.png" style="width: 500px;" />
<p>From our first input <span class="math">\(i_{1}\)</span>, we compute a first hidden state <span class="math">\(h_{1}\)</span>. How? Like a neural net does, with a linear layer. From this hidden state, we compute an output
<span class="math">\(o_{1}\)</span>, again with a linear layer. This part is no different than a regular neural net with one hidden layer. What changes is that when we have our second input <span class="math">\(i_{2}\)</span>,
we do use the same linear layer as <span class="math">\(i_{1}\)</span> to compute the arrow going from <span class="math">\(i_{2}\)</span> to <span class="math">\(h_{2}\)</span> but we had something: the previous hidden state <span class="math">\(h_{1}\)</span> goes
through a linear layer of its own and we merge the two results to get <span class="math">\(h_{2}\)</span>, then <span class="math">\(o_{2}\)</span> is calculated from <span class="math">\(h_{2}\)</span> the same way <span class="math">\(o_{1}\)</span> was from
<span class="math">\(h_{1}\)</span>.</p>
<p>Then we continue the same way up until the end. There may be many arrows in that figure, but there's only three linear layers in this neural net, they are just repeated several times.
Specifically, we need a layer <span class="math">\(L_{ih}\)</span>, that goes from input to hidden, a layer <span class="math">\(L_{hh}\)</span> that goes from hidden to hidden, and a layer <span class="math">\(L_{ho}\)</span> that goes from the
hidden state to the output. Following the same notations, we have weight matrices <span class="math">\(W_{ih}\)</span>, <span class="math">\(W_{hh}\)</span> and <span class="math">\(W_{ho}\)</span>, bias vectors <span class="math">\(b_{ih}\)</span>, <span class="math">\(b_{hh}\)</span> and
<span class="math">\(b_{ho}\)</span>. If we note <span class="math">\(n_{in}\)</span> the size of the input, <span class="math">\(n_{hid}\)</span> the size of the hidden state and <span class="math">\(n_{out}\)</span> the size of the output, naturally we have the
following sizes:</p>
<div class="math">
\begin{equation*}
\begin{array}{|c|c|c|}
\hline W_{ih} & W_{hh} & W_{ho} \\ \hline n_{in} \times n_{hid} & n_{hid} \times n_{hid} & n_{hid} \times n_{out} \\
\hline b_{ih} & b_{hh} & b_{ho} \\ \hline n_{hid} & n_{hid} & n_{out} \\ \hline \end{array}
\end{equation*}
</div>
<p>and the following equations to compute the next stage from the previous one:</p>
<div class="math">
\begin{equation*}
\left \{ \begin{array}{l} h_{k} = \hbox{tanh}(W_{ih} i_{k} + b_{ih} + W_{hh} h_{k-1} + b_{hh}) \\ o_{k} = f(W_{ho} h_{k} + b_{ho}) \end{array} \right .
\end{equation*}
</div>
<p>To complete these, the first value of hidden state <span class="math">\(h_{0}\)</span> is assumed to be zeros. Note that the way we merged the two arrows here is by summing them, but after applying the
linearity. An equivalent way to see this is that we concatenated <span class="math">\(i_{k}\)</span> and <span class="math">\(h_{k-1}\)</span> then applied a weight matrix of size <span class="math">\((n_{in}+n_{hid}) \times n_{hid}\)</span>.</p>
<p>The two biases <span class="math">\(b_{ih}\)</span> and <span class="math">\(b_{hh}\)</span> are redundant, so we could only use one of them. The non-linearity inside a RNN is often tanh, because it has the advantage of
spitting values between -1 and 1; a ReLU would allow for values to grow larger and larger as we apply the same linear unit every time, giving a more unstable model, and a sigmoid
wouldn't give us any negatives. The non-linearity that goes to the output <span class="math">\(f\)</span> can vary depending on our needs.</p>
</div>
<div class="section" id="how-to-use-them">
<h2>How to use them</h2>
<p>A classical use for RNNs is to try to predict the next character (or word) in a text, being given the previous ones. Both problems are the same, the only difference is the size
of our vocabulary: for a character-level model, we would have something between 26 and a few hundreds characters (if we want to have all the special ones, lower and upper cases).
At a word-level, the model will easily have a size in the tens (if not hundreds) of thousands.</p>
<p>In both cases, the first step will be to split the text into tokens, which will be the characters or the words composing it (in the second case, we should actually be smarter
than just taking the words, but this would be a subject for another article). Those tokens are then numericalized from 1 to n, the size of our vocabulary. We can a few extra tokens
like <bos> (for beginning of stream), <eos> (end of stream) or <unk> (unknown, very useful when working at a word level: there's no use keeping the words that are barely present
in our corpus since the model probably won't learn anything about them, so we can replace them all by <unk>).</p>
<p>Having done that, we're almost ready to feed text to a RNN. The last step is to transform those numbers that compose our texts into vectors that can pass through a neural net. When
dealing at a character level, we often one-hot encode those numbers, wich means transforming i into the vector of size n with everything nil except the i-th coordinate that is 1.
But doing this with words where n can be so large might not be the best idea. Instead, in those cases, we use embeddings, which is replacing each token by a vector of a given size,
with random coordinates at the beginning, but that the network will then learn to make better.</p>
<p>We'll go into the specifics of this in another article. For now let's just see how to implement a basic RNN. The model in itself is very easily coded in pytorch, since it's a
dynamical language, we can just do a for loop in the forward pass. There's a trick for the initialization: the matrix <span class="math">\(W_{hh}\)</span> should be the identity matrix at the beginning
and not a random one. It makes sense when we think it will be applied a lot of times to the hidden state, and not changing it at the beginning seems like a good idea. Of course,
with SGD, it won't stay very long as an identity matrix.</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">RNN</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_in</span><span class="p">,</span> <span class="n">n_hid</span><span class="p">,</span> <span class="n">n_out</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">wgts_ih</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_in</span><span class="p">,</span><span class="n">n_hid</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">/</span><span class="n">n_in</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">wgts_hh</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">n_hid</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">wgts_ho</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_hid</span><span class="p">,</span><span class="n">n_out</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">/</span><span class="n">n_out</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b_ih</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">n_hid</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">b_ho</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">n_out</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_hid</span> <span class="o">=</span> <span class="n">size</span><span class="p">,</span> <span class="n">n_hid</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">):</span>
<span class="n">bs</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">hid</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">bs</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_hid</span><span class="p">)</span>
<span class="n">outs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">size</span><span class="p">):</span>
<span class="n">hid</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">mm</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">wgts_ih</span><span class="p">)</span> <span class="o">+</span> <span class="n">torch</span><span class="o">.</span><span class="n">mm</span><span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">wgts_hh</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b_ih</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">mm</span><span class="p">(</span><span class="n">hid</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">wgts_ho</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b_ho</span>
<span class="n">outs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">out</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">cat</span><span class="p">(</span><span class="n">outs</span><span class="p">)</span>
</pre>
<p>Note that here in the forward pass, our tensor x has three dimensions: there is the length of the sentence, the mini-batch and the size of our vocabulary. Often in RNNs, they
are indexed in this order, since it allows us to easily go through each character of our sentences via x[0], x[1], ... Here we presented the output in the same way, but
depending on the goal, you might want to only keep the final output. Lastly, there's no activation function for the output since it will be computed in the loss.</p>
<p>Now that we have seen what's inside a RNN, we can use the module of the same name in pytorch, which would just be:</p>
<pre class="code python literal-block">
<span class="n">myRNN</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">RNN</span><span class="p">(</span><span class="n">n_in</span><span class="p">,</span> <span class="n">n_hid</span><span class="p">)</span>
</pre>
<p>Here we don't specify an output size since pytorch will only give us the list of hidden states. We can decide to apply the same linear layer as we did before if we need to.
When we call this network on an input (of size sequence length by batch size by vocab size), we can also specify a initial hidden state (which will be zeros if we don't), then
the output will be a tupe containing two things: the list of hidden states in a tensor (except the initial one) and the last hidden state.</p>
<p>If we want to reproduce the previous RNN, we then have to do the following:</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">RNN</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_in</span><span class="p">,</span> <span class="n">n_hid</span><span class="p">,</span> <span class="n">n_out</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">rnn</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">RNN</span><span class="p">(</span><span class="n">n_in</span><span class="p">,</span><span class="n">n_hid</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">linear</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_hid</span><span class="p">,</span><span class="n">n_out</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">):</span>
<span class="n">out</span><span class="p">,</span> <span class="n">hid</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">rnn</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">linear</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
</pre>
<p>The last linear layer will be applied on the two first dimensions of the out tensor (one for the sequence length and one for the batch size).</p>
</div>
<div class="section" id="let-s-see-some-results">
<h2>Let's see some results!</h2>
<p>By training those kind of networks a very long time on large corpus of text, one can get surprising results, even if they are just character-level trained RNNs. This
<a class="reference external" href="https://arxiv.org/abs/1803.09820">article</a> shows quite a few of them.</p>
<p>To get a model trained on an RNN to generate text, we just have to give it a seed: since it knows how to predict the next word from the last few, it needs a beginning. In the
lesson 10 of the <cite>fasta.ai
<http://fast.ai></cite> MOOK, Jeremy shows a language model pre-trained on a subset of wikipedia. It's slightly more complex than the basic RNN we just saw, using three LSTMs, but
we'll get in the depth of that in another article. Once the model is loaded in his notebook, we can use it to generate predictions by implementing this function:</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">what_next</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">res_len</span><span class="p">):</span>
<span class="n">learner</span><span class="o">.</span><span class="n">model</span><span class="o">.</span><span class="n">reset</span><span class="p">()</span>
<span class="n">tok</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">()</span><span class="o">.</span><span class="n">proc_text</span><span class="p">(</span><span class="n">seq</span><span class="p">)</span>
<span class="n">ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">stoi</span><span class="p">[</span><span class="n">word</span><span class="p">]</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">tok</span><span class="p">]</span>
<span class="n">res</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">ids</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">V</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">i</span><span class="p">))</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">learner</span><span class="o">.</span><span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">val</span><span class="p">,</span><span class="n">idx</span> <span class="o">=</span> <span class="n">preds</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">res_len</span><span class="p">):</span>
<span class="n">res</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">idx</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">V</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">learner</span><span class="o">.</span><span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">val</span><span class="p">,</span><span class="n">idx</span> <span class="o">=</span> <span class="n">preds</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="p">[</span><span class="n">itos</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">res</span><span class="p">]</span>
</pre>
<p>where Tokenizer is the object we use to tokenize text (the spacy tokenizer in the notebook), itos and stoi are the mapping from tokens to ids. Using this to ask the
language model <em>What is a recurrent neural network?</em> I got this answer:</p>
<pre class="literal-block">
the first of these was the first of the series, the first of which was released in october of that year.
the first, " the last of the u_n ", was released on october 1, and the second, " the last of the u_n ",
was released on november 1.
</pre>
<p>It's not perfect, obviously, but it's interesting to note that it clearly learned basic grammar, or how to use punctuation, even closing its own quotes.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The 1cycle policy2018-04-07T15:23:00-04:002018-04-07T15:23:00-04:00Sylvain Guggertag:None,2018-04-07:/the-1cycle-policy.html<p class="first last">Properly setting the hyper-parameters of a neural network can be challenging, fortunately, there are some recipe that can help.</p>
<p>Here, we will dig into the <a class="reference external" href="https://arxiv.org/abs/1803.09820">first part</a> of Leslie Smith's work about setting hyper-parameters (namely learning rate, momentum and weight decay). In particular, his 1cycle policy
gives very fast results to train complex models. As an example, we'll see how it allows us to train a resnet-56 on cifar10 to the same or a better precision than the authors in
<a class="reference external" href="https://arxiv.org/abs/1512.03385">their original paper</a> but with far less iterations.</p>
<p>By training with high learning rates we can reach a model that gets <strong>93% accuracy in 70 epochs</strong> which is less than
7k iterations (as opposed to the 64k iterations which made roughly 360 epochs in the original paper).</p>
<p><a class="reference external" href="https://github.com/sgugger/Deep-Learning/blob/master/Cyclical%20LR%20and%20momentums.ipynb">This notebook</a> contains all the experiments.
They are done with the same data-augmentation as in this original paper with one minor tweak: we random flip the picture horizontally and a random crop after adding a padding of
4 pixels on each side. The minor tweak is that we don't color the padded pixels in black, but use a reflection padding, since it's the one implemented
in the fastai library. This is probably why we get slightly better results than Leslie when doing the experiments with the same hyper-parameters.</p>
<div class="section" id="using-high-learning-rates">
<h2>Using high learning rates</h2>
<p>We have already seen how to implement the <a class="reference external" href="/how-do-you-find-a-good-learning-rate.html">learning rate finder</a>. Begin to train the model while increasing the learning rate from
a very low to a very large one, stop when the loss starts to really get out of control. Plot the losses against the learning rates and pick a value a bit before the minimum,
where the loss still improves. Here for instance, anything between <span class="math">\(10^{-2}\)</span> and <span class="math">\(3 \times 10^{-2}\)</span> seems like a good idea.</p>
<img alt="An example of curve when finder the learning rate" class="align-center" src="../images/art2_courbe_lr.png" style="width: 400px;" />
<p>This was already an idea of the same author and he completes it in his <a class="reference external" href="https://arxiv.org/abs/1803.09820">last article</a> with a good approach to adopt during training.</p>
<p>He recommends to do a cycle with two steps of equal lengths, one going from a lower learning rate to a higher one than go back to the minimum. The maximum should be the value picked
with the Learning Rate Finder, and the lower one can be ten times lower. Then, the length of this cycle should be slightly less than the total number of epochs, and, in the last
part of training, we should allow the learning rate to decrease more than the minimum, by several orders of magnitude.</p>
<img alt="Learning rates to use in a cycle" class="align-center" src="../images/art5_lr_schedule.png" style="width: 400px;" />
<p>The idea of starting slower isn't new: using a lower value to warm-up the training is often done, and this is exactly what the first part is achieving. Leslie doesn't recommend
to switch to a higher value directly, however, but to rather slowly go there linearly, and to take as much time going up as going down.</p>
<p>What he observed during his experiments is that the during the middle of the cycle, the high learning rates will act as regularization method, and keep the network from overfitting.
They will prevent the model to land in a steep area of the loss function, preferring to find a minimum that is flatter. He explained in <a class="reference external" href="https://arxiv.org/abs/1708.07120">this other paper</a> how he observed that by using this policy, approximates of the hessian were lower, indicating that the SGD was finding a wider flat area.</p>
<p>Then the last part of the training, with descending learning rates up until annihilation will allow us to go inside a steeper local minimum inside that smoother part. During the
par with high learning rates, we don't see substantial improvements in the loss or the accuracy, and the validation loss sometimes spikes very high, but we see all the benefits
of doing this when we finally lower the learning rates at the end.</p>
<img alt="Losses during a full cycle" class="align-center" src="../images/art5_losses.png" style="width: 400px;" />
<p>In this graph, the learning rate was rising from 0.08 to 0.8 between epochs 0 and 41, getting back to 0.08 between epochs 41 and 82 then going to one hundredth of 0.08 in the last
few epochs. We can see how the validation loss gets a little bit more volatile during the high learning rate part of the cycle (epochs 20 to 60 mostly) but the important part is
that on average, the distance between the training loss and the validation loss doesn't increase. We only really start to overfit at the end of the cycle, when the learning rate
gets annihilated.</p>
<p>Surprisingly, applying this policy even allows us to pick larger maximum learning rates, closer to the minimum of the plot we draw when using the learning rate finder. Those
trainings are a bit more dangerous in the sense that the loss can go too far away and make the whole thing diverge. In those cases, it can be worth to try with a longer cycle before going
to a slower learning rate, since a long warm-up seems to help.</p>
<img alt="Losses during a full cycle" class="align-center" src="../images/art5_superconvergence.png" style="width: 400px;" />
<p>In this graph, the learning rate was rising from 0.15 to 3 between epochs 0 and 22.5, getting back to 0.15 between epochs 22.5 and 45 then going to one hundredth of 0.15 in the last
few epochs. With very high learning rates, we get to learn faster <strong>and</strong> prevent overfitting. The difference between the validation loss and the training loss stays extremely low
up until we annihilate the learning rates. This is the phenomenon Leslie Smith describes as super convergence.</p>
<p>With this technique, we can train a resnet-56 to have 92.3% accuracy on cifar10 in barely 50 epochs. Going to a cycle of 70 epochs gets us at 93% accuracy.</p>
<p>By opposition, a smaller cycle followed by a longer annihilation will result in something like this:</p>
<img alt="An example of overfitting" class="align-center" src="../images/art5_overfitting.png" style="width: 400px;" />
<p>Here our two steps end at epoch 42 and the rest of the training is spent with a learning rate slowly decreasing. The validation loss stops decreasing causing bigger and bigger
overfitting, and the accuracy barely gets up.</p>
</div>
<div class="section" id="cyclical-momentum">
<h2>Cyclical momentum</h2>
<p>To accompany the movement toward larger learning rates, Leslie found in his experiments that decreasing the momentum led to better results. This supports the intuition that in
that part of the training, we want the SGD to quickly go in new directions to find a flatter area, so the new gradients need to be given more weight. In practice, he recommends
to pick two values likes 0.85 and 0.95, and decrease from the higher one to the lower one when we increase the learning rate, then go back to the higher momentum as the learning
rate goes down.</p>
<img alt="Learning rate and momentum schedule" class="align-center" src="../images/art5_full_schedule.png" style="width: 600px;" />
<p>According to Leslie, the exact best value of momentum chosen during the whole training can give us the same final results, but using cyclical momentums removes the hassle of trying multiple values
and running several full cycles, losing precious time.</p>
<p>Even if using cyclical momentum always gave slightly better results, I didn't find the same gap as in the paper between using a constant momentum and cyclical ones.</p>
</div>
<div class="section" id="all-the-other-parameters-matter">
<h2>All the other parameters matter</h2>
<p>The way we tune all the other hyper-parameters of the model will impact the best learning rate. That's why when we run the Learning Rate Finder, it's very important to use it
with the exact same conditions as during our training. For instance different batch sizes or weight decays will impact the results:</p>
<img alt="LR Finder for various weight decay values" class="align-center" src="../images/art5_wds.png" style="width: 400px;" />
<p>This can be useful to set some hyper-parameters. For instance, with weight decay, Leslie's advice is to run the learning rate finder
for a few values of weight decay, and pick the largest one that will still let us train at a high maximum learning rate. This is how we can come up with the <span class="math">\(10^{-4}\)</span> used
in our experiments.</p>
<p>In his opinion, the batch size should be set to the highest possible value to fit in the available memory. Then the other hyper-parameters we may have (dropout for instance) can
be tuned the same way as weight decay, or just by trying on a cycle and see the results they give. The only thing is to never forget to re-run the Learning Rate Finder, especially
when deciding to pick a strategy with an aggressive learning rate close to the maximum possible value.</p>
<p>Training with the 1cycle policy at high learning rates is a method of regularization in itself, so we shouldn't be surprised if we have to reduce the other forms of regularization
we were previously using when we put it in place. It will however be more efficient, since we can train for a long time at large learning rates.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Convolution in depth2018-04-05T11:03:00-04:002018-04-05T11:03:00-04:00Sylvain Guggertag:None,2018-04-05:/convolution-in-depth.html<p class="first last">CNNs (Convolutional Neural Network) are the most powerful networks used in computer vision. Let's see what a convolutional layer is all about, from the definition to the implementation in numpy, even with the back propagation.</p>
<p>Since AlexNet won the ImageNet competition in 2012, linear layers have been progressively replaced by convolutional ones in neural networks trained to perform a task related to
image recognition. Let's see what those layers do and how to implement them from scratch.</p>
<div class="section" id="what-is-convolution">
<h2>What is convolution?</h2>
<p>The idea behind convolution is the use of image kernels. A kernel is a small matrix (usually of size 3 by 3) used to apply effect to an image (like sharpening, blurring...). This
is best shown on <a class="reference external" href="http://setosa.io/ev/image-kernels/">this super cool page</a> where you can actually see the direct effect on any image you like.</p>
<p>The core idea is that an image is just a bunch of numbers. Its representation in a computer is an array of size width by heights pixels, and each pixel is associated to three float
values ranging from 0 to 1 (or integers going from 0 to 255). This three numbers represent the red-ness, green-ness and blue-ness of said pixel, the combination of the three
capturing its color. A fourth channel can be added to represent the transparency of the pixel but we won't focus on that.</p>
<p>If the image is black and white, a single value can be use per pixel, with 0 meaning black and 1 (or 255) meaning white. Let's begin with this for the explanation. The convolution
of our image by a given kernel of a given size is obtained by putting the kernel in front of every area of the picture, like a sliding window, to then do the element-wise product
of the numbers in our kernel by the ones in the picture it overlaps and summing all of these, like in this picture:</p>
<img alt="One convolution" class="align-center" src="../images/art4_one_conv.png" style="width: 600px;" />
<p>Then we repeat the process by moving the kernel on every possible area of the picture.</p>
<img alt="Full convolutional map" class="align-center" src="../images/art4_full_conv.png" style="width: 600px;" />
<p>As shown on <a class="reference external" href="http://setosa.io/ev/image-kernels/">this page mentioned earlier</a>, by doing this on all the areas of our picture, sliding the kernel in all the possible positions, it will give another array of number that
we can also interpret as a picture, and depending on the values inside of our kernel, this will apply different effects on the original image. The process is shown on <a class="reference external" href="https://youtu.be/Oqm9vsf_hvU">this video</a></p>
<p>The idea behind a convolutional layer in a neural net is then to initialize the weights of kernels like the one we just saw at random, then use SGD to find the best possible
parameters. It's possible to do this since the operation we are doing withour sliding window looks like</p>
<div class="math">
\begin{equation*}
y = \sum w_{i,j} x_{i,j}
\end{equation*}
</div>
<p>with the <span class="math">\(w_{i,j}\)</span> being the weights in our kernel and the <span class="math">\(x_{i,j}\)</span> being the values of our pixels. We can even decide to add a bias to have exactly the same
transformation as a linear layer, the only difference being that the weights of a given kernel are the same and applied to the whole picture.</p>
<p>By stacking several convolutional layers one on top of the other, the hope is to get a neural network that captures the information we want on our image.</p>
</div>
<div class="section" id="stride-and-padding">
<h2>Stride and padding</h2>
<p>Before we implement a convolutional layer in python, there is a few additional tweaks we can add. Padding consists in adding a few pixels on each (or a few) side of the picture
with a zero value. By doing this, we can have an output that has exactly the same dimension is the output. For instance if we have an 7 by 7 image with a 3 by 3 kernel like in the
picture before, you can put the sliding window on 5 (7 - 3 + 1) different position in width and height, so you get a 5 by 5 output. Adding a border of width one pixel all around
the picture will change the original image into a 9 by 9 picture and make an output of size 7 by 7.</p>
<p>A stride in convolution is just like a by in a for loop: instead of going through every window one after the other, we skip a given amount each time. Here is the result of
a convolution with a padding of one and a stride of two:</p>
<img alt="Full convolutional map with padding and stride" class="align-center" src="../images/art4_conv_stridepad.png" style="width: 600px;" />
<p>In the end, if our picture as <span class="math">\(n_{1}\)</span> rows and <span class="math">\(n_{2}\)</span> columns, our kernel <span class="math">\(k_{1}\)</span> rows and <span class="math">\(k_{2}\)</span> columns, with a padding of <span class="math">\(p\)</span> and a stride of <span class="math">\(s\)</span>,
the dimension of the new picture is</p>
<div class="math">
\begin{equation*}
\left \lfloor \frac{n_{1} + 2*p - k_{1}}{s} \right \rfloor + 1 \quad \hbox{by} \quad \left \lfloor \frac{n_{2} + 2*p - k_{2}}{s} \right \rfloor + 1.
\end{equation*}
</div>
<p>Why is that? Well for the width dimension, our picture has a size of <span class="math">\(n_{1} + 2*p\)</span> since we added <span class="math">\(p\)</span> pixels on each side. We begin in position 0 and the maximum index
at the far right is <span class="math">\(n_{1} + 2*p-k_{1}\)</span>. Since we move by steps of length <span class="math">\(s\)</span>, the last position we reach is <span class="math">\(\hbox{nb} s\)</span> where <span class="math">\(\hbox{nb}\)</span> is the highest
number satisfying</p>
<div class="math">
\begin{equation*}
\hbox{nb} \leq n_{1} + 2*p - k_{1}
\end{equation*}
</div>
<p>which gives us</p>
<!-- math:
\hbox{nb} = left \lfloor \frac{n_{1} + 2*p - k_{1}}{s} \right \rfloor. -->
<p>Then from 0 to <span class="math">\(\hbox{nb}\)</span>, there is <span class="math">\(\hbox{nb}+1\)</span> integer, which is how we find the width of the output. It's the same reasoning for the height.</p>
</div>
<div class="section" id="more-channels">
<h2>More channels</h2>
<p>We gave the example of a black and white picture, but when we have an image with colors, there are three different channels. This means that our filter will need the same number
of channels. In the previous example, instead of having just one 3 by 3 kernel, we'll have three. One for the red values of each pixel, one for the green values of each pixel and
one of their blue values. So our filter is 3 channels by 3 by 3. We place the red part in front of the red channel of our picture, the green part in front of the green channel
and the blue part in front of the blue channel, each time at exactly the same place like this.</p>
<img alt="A convolution on three channels at once" class="align-center" src="../images/art4_conv_three_channels.png" style="width: 600px;" />
<p>The results of those three intermediate convolutions <span class="math">\(y_{R}\)</span>, <span class="math">\(y_{G}\)</span> and <span class="math">\(y_{B}\)</span> are computed as before, and we sum them to get our final activation. It's get
a bit more complicated because this is just one kernel. To make a full layer, we will consider several of them and use them all on all the possible places of our picture, with
padding and stride if applicable.</p>
<p>Before the layer we had a picture with 3 channels, width <span class="math">\(n_{1}\)</span> and height <span class="math">\(n_{2}\)</span>, after, we have another representation of that image with as many channels as we
decided to take filters (let's say <span class="math">\(nb_{F}\)</span>), width <span class="math">\(nb_{1}\)</span> and height <span class="math">\(nb_{2}\)</span> (those two numbers being calculated with the formula above). If the initial
channels represented the red-ness, green-ness, blue-ness of a pixel, the new ones will represent things like horizontal-ness or bluri-ness of a given area.</p>
<p>When we stack this into a new convolutional layer (with kernels of size <span class="math">\(nb_{F}\)</span> by 3 by 3) it becomes harder to figure what the channels we obtain represent, but we
don't necessarily need to understand their meaning. What's important is that the neural net will find a set of weights (via SGD) that helps it get the key informations it needs
to eventually perform the task it was given (like identifying the digit in the picture, or classifying it between cat and dog).</p>
</div>
<div class="section" id="let-s-code-this">
<h2>Let's code this!</h2>
<p>Coding this in numpy is not the easiest thing so feel free to skip this part. It does provide good practice on vectorization and broadcasting though. We won't code the convolution
as a loop since it would be very inefficient when with have to do it on a whole mini-batch. Instead, we will vectorize our picture so that the convolution operation just becomes
a matrix product (which it is in essence, since it's a linear operation). This means taking each small window our kernel will look at and writing the number we see in a row.</p>
<p>Taking back the 7 by 7 matrix before with a 3 by 3 kernel, it will change it like this (red window is written in red in our vector).</p>
<img alt="Vectorizing a picture" class="align-center" src="../images/art4_vectorize.png" style="width: 500px;" />
<p>If we note <span class="math">\(\hbox{vec}(x)\)</span> the vectorization of <span class="math">\(x\)</span>, and if we write our weights in a column <span class="math">\(W\)</span> (in the same order as with our windows), then the result of our
convolution is just</p>
<div class="math">
\begin{equation*}
\hbox{vec}(x) \times W
\end{equation*}
</div>
<p>If we want to have the result of the convolution with all our filters at once, we just have to concatenate the corresponding columns into a matrix (that we will still note
<span class="math">\(W\)</span>) and the same matrix product will give us all the results at the same time.</p>
<p>That's for one channel, but what happens when we have more? We can just concatenate the vectorization for each channel in a very big <span class="math">\(\hbox{vec}(x)\)</span>, and put all the weights
in the same order in a column of <span class="math">\(W\)</span>.</p>
<img alt="Color channels concatenated" class="align-center" src="../images/art4_concat_vec.png" style="width: 400px;" />
<p>Each row in this table represent a particular 3 by 3 window, which has 9 coordinates in each of the channels (red, green and blue), which is why they have 27 coordinates. There
are 25 possibilities to align a window in front of our picture, which is why there 25 rows.</p>
<p>The last thing we can do is define a bias for each of our kernel, and if we write them in a table named <span class="math">\(B\)</span>, the result of our convolution is</p>
<div class="math">
\begin{equation*}
y_{1} = \hbox{vec}(x) \times W + B
\end{equation*}
</div>
<p>where <span class="math">\(B\)</span> has as many columns as <span class="math">\(W\)</span> (<span class="math">\(nb_{F}\)</span>) and his coordinates are broadcasted over the row (e.g. repeated as many times as necessary to make <span class="math">\(B\)</span> the
same size as the matrix before.</p>
<p>Last part is to reshape the result <span class="math">\(y_{1}\)</span>. Let's go back to the input <span class="math">\(x\)</span>. In practice, we will get a whole mini-batch of them, which gives a new dimension (three
was too easy already...). So the size of <span class="math">\(x\)</span> is <span class="math">\(m\)</span> (for mini-batch) by <span class="math">\(ch\)</span> (3 if we have the original picture) by <span class="math">\(n_{1}\)</span> by <span class="math">\(n_{2}\)</span>. When
we vectorize it, <span class="math">\(\hbox{vec}(x)\)</span> has a size of <span class="math">\(m\)</span> by <span class="math">\(nb_{1} \times nb_{2}\)</span> (all the possibilities to put our filter in front of our image) by
<span class="math">\(k_{1} \times k_{2} \times ch\)</span> (where the kernel is assumed of size <span class="math">\(k_{1}\)</span> by <span class="math">\(k_{2}\)</span>).</p>
<p>Our matrix <span class="math">\(W\)</span> has <span class="math">\(nb_{F}\)</span> columns (the number of filters) and <span class="math">\(k_{1} \times k_{2} \times ch\)</span> rows, so the product will give us a result <span class="math">\(y_{1}\)</span>
of size <span class="math">\(m\)</span> by <span class="math">\(nb_{1} \times nb_{2}\)</span> by <span class="math">\(nb_{F}\)</span> (assuming we <em>broadcast</em> the product over the first dimension, doing it for all mini-batch). The channels
should be the second dimension if we want to be consistent with how we treated <span class="math">\(x\)</span> so we have to transpose the two last dimensions, and finally resize the result
as <span class="math">\(y\)</span>, with a shape of <span class="math">\(m\)</span> by <span class="math">\(nb_{F}\)</span> by <span class="math">\(nb_{1}\)</span> by <span class="math">\(nb_{2}\)</span>.</p>
<p>That sounds awfully complicated but as soon as we are done with the first part (vectorize the picture) the rest will be very easy.</p>
</div>
<div class="section" id="forward-pass">
<h2>Forward pass</h2>
<p>Let's assume first we already have that vectorization function that I'll call arr2vec. Since it's the hardest bit, let's keep it for the end. As we saw before, there's no point
in our computation where we will need the weights other than in the form of the matrix <span class="math">\(W\)</span>, so that's how we will store and create them. As for a linear layer, the best
way to initialize them is by following a normal distribution with a standard deviation of <span class="math">\(\sqrt{\frac{2}{ch}}\)</span> if <span class="math">\(ch\)</span> is the number of channels of the input.</p>
<p>For the forward pass, we vectorize the mini-batch of inputs x, then we multiply the result by our weights and our bias. Then we have to invert the two last axis and reshape
with the output size, which can be computed with the formulas above. All in all, this gives us:</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">Convolution</span><span class="p">():</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">nc_in</span><span class="p">,</span> <span class="n">nc_out</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span><span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span> <span class="o">=</span> <span class="n">kernel_size</span>
<span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">nc_in</span> <span class="o">*</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="p">,</span><span class="n">nc_out</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">/</span><span class="n">nc_in</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nc_out</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">=</span> <span class="n">stride</span>
<span class="bp">self</span><span class="o">.</span><span class="n">padding</span> <span class="o">=</span> <span class="n">padding</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">):</span>
<span class="n">mb</span><span class="p">,</span> <span class="n">ch</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">arr2vec</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">stride</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">padding</span><span class="p">),</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">y</span><span class="p">,(</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="n">n1</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">padding</span><span class="p">)</span> <span class="o">//</span><span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">p1</span> <span class="o">=</span> <span class="p">(</span><span class="n">p</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">+</span><span class="mi">2</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">padding</span> <span class="p">)</span><span class="o">//</span><span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">y</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">mb</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">n1</span><span class="p">,</span><span class="n">p1</span><span class="p">)</span>
</pre>
<p>The arr2vec function remains. To write it, let's go back to the previous picture:</p>
<img alt="Vectorizing a picture" class="align-center" src="../images/art4_vectorize.png" style="width: 500px;" />
<p>The whole problem is to do this, once this is done, we'll just have to take the corresponding elements in our array x (instead of aligning 1,2,3, we'll need x[1],x[2],x[3]). We
can note that each row in the vectorization can be deduced from the first by adding the same number everywhere. Let's forget about padding and stride to begin with and
start with this first line.</p>
<p>Since we're in Python, indexes begin at 0. Then we just want the numbers <span class="math">\(j+i*n_{2}\)</span> where <span class="math">\(j\)</span> goes from 0 to <span class="math">\(k_{1}\)</span>, <span class="math">\(i\)</span> goes from 0 to <span class="math">\(k_{2}\)</span> and
<span class="math">\(n_{2}\)</span> is the number of columns. Those will form the grid of our kernel. Then we have to determine all the possible start indexes, which correspond to the points with
coordinates <span class="math">\((i,j)\)</span> where <span class="math">\(i\)</span> can vary from 0 to <span class="math">\(n_{1} - k_{1}\)</span> and <span class="math">\(j\)</span> can vary from 0 to <span class="math">\(n_{2} - k_{2}\)</span>. For a given couple <span class="math">\((i,j)\)</span>,
the index associated is <span class="math">\(j + i * n_{2}\)</span>. This gives us:</p>
<pre class="code python literal-block">
<span class="n">grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">j</span> <span class="o">+</span> <span class="n">n2</span><span class="o">*</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k2</span><span class="p">)])</span>
<span class="n">start_idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">j</span> <span class="o">+</span> <span class="n">n2</span><span class="o">*</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n1</span><span class="o">-</span><span class="n">k1</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n2</span><span class="o">-</span><span class="n">k2</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="p">])</span>
</pre>
<p>Why the np.array? Well once we have done this, we basically want the array built by getting grid + any element in start_idx, which is very easy to do with broadcasting. Our
vectorized array of indexes is:</p>
<pre class="code python literal-block">
<span class="n">grid</span><span class="p">[</span><span class="bp">None</span><span class="p">,:]</span> <span class="o">+</span> <span class="n">start_idx</span><span class="p">[:,</span><span class="bp">None</span><span class="p">]</span>
</pre>
<p>Let's add a bit more complexity now. This is all we need for one channel, but we will actually get <span class="math">\(ch\)</span> of them. Our start indexes won't change since they are the same
for all the channels, but our grid must include more element. Specifically, we need to duplicate the grid <span class="math">\(ch\)</span> times, adding <span class="math">\(n_{1} \times n_{2}\)</span> each time we do. This
is done by</p>
<pre class="code python literal-block">
<span class="n">grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">j</span> <span class="o">+</span> <span class="n">n2</span><span class="o">*</span><span class="n">i</span> <span class="o">+</span> <span class="n">n1</span> <span class="o">*</span> <span class="n">n2</span> <span class="o">*</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">ch</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k2</span><span class="p">)])</span>
</pre>
<p>Now for the stride and padding. Padding is adding 0 on the sides so we can begin by this.</p>
<pre class="code python literal-block">
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">mb</span><span class="p">,</span><span class="n">ch</span><span class="p">,</span><span class="n">n1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">,</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">))</span>
<span class="n">y</span><span class="p">[:,:,</span><span class="n">padding</span><span class="p">:</span><span class="n">n1</span><span class="o">+</span><span class="n">padding</span><span class="p">,</span><span class="n">padding</span><span class="p">:</span><span class="n">n2</span><span class="o">+</span><span class="n">padding</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span>
</pre>
<p>This doesn't change our grid much, we will just have to adapt the sizes of our picture (now <span class="math">\(n_{1} + 2p\)</span> by <span class="math">\(n_{2} + 2p\)</span>). The start indexes will change slightly: the
upper bound for the indexes <span class="math">\(i\)</span> and <span class="math">\(j\)</span> are now <span class="math">\(n_{1} +2p - k_{1}\)</span> and <span class="math">\(n_{2} + 2p - k_{2}\)</span>. Stride only adds one thing: we loop with a step.</p>
<pre class="code python literal-block">
<span class="n">start_idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">j</span> <span class="o">+</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span><span class="o">*</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">n1</span><span class="o">-</span><span class="n">k1</span><span class="o">+</span><span class="mi">1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">,</span><span class="n">stride</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">n2</span><span class="o">-</span><span class="n">k2</span><span class="o">+</span><span class="mi">1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">,</span><span class="n">stride</span><span class="p">)</span> <span class="p">])</span>
<span class="n">grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">j</span> <span class="o">+</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span><span class="o">*</span><span class="n">i</span> <span class="o">+</span> <span class="p">(</span><span class="n">n1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">ch</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k2</span><span class="p">)])</span>
<span class="n">to_take</span> <span class="o">=</span> <span class="n">start_idx</span><span class="p">[:,</span><span class="bp">None</span><span class="p">]</span> <span class="o">+</span> <span class="n">grid</span><span class="p">[</span><span class="bp">None</span><span class="p">,:]</span>
</pre>
<p>The last step is to do this for each mini-batch. Again, this will easily be done with a bit of broadcasting:</p>
<pre class="code python literal-block">
<span class="n">batch</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">mb</span><span class="p">))</span> <span class="o">*</span> <span class="n">ch</span> <span class="o">*</span> <span class="p">(</span><span class="n">n1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span>
<span class="n">batch</span><span class="p">[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">]</span> <span class="o">+</span> <span class="n">to_take</span><span class="p">[</span><span class="bp">None</span><span class="p">,:,:]</span>
</pre>
<p>This final arrays has exactly the same shape as our desired output, and contains all the indexes we have to take in our array y. We just have to use the function numpy.take
to select the corresponding elements in y.</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">arr2vec</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="n">k1</span><span class="p">,</span><span class="n">k2</span> <span class="o">=</span> <span class="n">kernel_size</span>
<span class="n">mb</span><span class="p">,</span> <span class="n">ch</span><span class="p">,</span> <span class="n">n1</span><span class="p">,</span> <span class="n">n2</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">mb</span><span class="p">,</span><span class="n">ch</span><span class="p">,</span><span class="n">n1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">,</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">))</span>
<span class="n">y</span><span class="p">[:,:,</span><span class="n">padding</span><span class="p">:</span><span class="n">n1</span><span class="o">+</span><span class="n">padding</span><span class="p">,</span><span class="n">padding</span><span class="p">:</span><span class="n">n2</span><span class="o">+</span><span class="n">padding</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span>
<span class="n">start_idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">j</span> <span class="o">+</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span><span class="o">*</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">n1</span><span class="o">-</span><span class="n">k1</span><span class="o">+</span><span class="mi">1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">,</span><span class="n">stride</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">n2</span><span class="o">-</span><span class="n">k2</span><span class="o">+</span><span class="mi">1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">,</span><span class="n">stride</span><span class="p">)</span> <span class="p">])</span>
<span class="n">grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">j</span> <span class="o">+</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span><span class="o">*</span><span class="n">i</span> <span class="o">+</span> <span class="p">(</span><span class="n">n1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">ch</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k2</span><span class="p">)])</span>
<span class="n">to_take</span> <span class="o">=</span> <span class="n">start_idx</span><span class="p">[:,</span><span class="bp">None</span><span class="p">]</span> <span class="o">+</span> <span class="n">grid</span><span class="p">[</span><span class="bp">None</span><span class="p">,:]</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">mb</span><span class="p">))</span> <span class="o">*</span> <span class="n">ch</span> <span class="o">*</span> <span class="p">(</span><span class="n">n1</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">n2</span><span class="o">+</span><span class="mi">2</span><span class="o">*</span><span class="n">padding</span><span class="p">)</span>
<span class="k">return</span> <span class="n">y</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="n">batch</span><span class="p">[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">]</span> <span class="o">+</span> <span class="n">to_take</span><span class="p">[</span><span class="bp">None</span><span class="p">,:,:])</span>
</pre>
</div>
<div class="section" id="back-propagation">
<h2>Back propagation</h2>
<p>If you've made it this far, there is just one last step to have completely understood a convolutional layer: we need to compute the gradients of the loss with respect to the
weights, the biases and the inputs, being given the gradients of the loss with respect to the outputs.</p>
<p>At heart, a convolutional layer is just a certain type of linear layer, so the formulas we had seen for the back-propagation through a linear layer will be useful here too. There is
just a bit of reshaping, transposing, and... sadly... going back from a vector to an array. But let's keep this for last, since it'll be the worst.</p>
<p>When we receive our gradient in a variable grads, they will have the same shape as our final output y, so <span class="math">\(m\)</span> by <span class="math">\(nb_{F}\)</span> by <span class="math">\(nb_{1}\)</span> by <span class="math">\(nb_{2}\)</span>. To go
back to <span class="math">\(y_{1}\)</span>, we have to reshape our gradients as <span class="math">\(m\)</span> by <span class="math">\(nb_{F}\)</span> by <span class="math">\(nb_{1} \times nb_{2}\)</span> then invert the two alst coordinates. This will give us
grad1.</p>
<p>The operation we did at this stage is</p>
<div class="math">
\begin{equation*}
y_{1} = \hbox{vec}(x) \times W + B
\end{equation*}
</div>
<p>so we already know the gradients of the loss with respect to <span class="math">\(\hbox{vec}(x)\)</span> is <span class="math">\({}^{t} W \hbox{grad}_{1}\)</span> (like <a class="reference external" href="/a-simple-neural-net-in-numpy.html">in this article</a>)
Following the same lead, the gradients of the loss with respect to the biases should be in grad1, but this time this array has one dimension too many. That's because each bias
is used for all the activations (whereas before they were only used once). We will have to sum the gradients over all the activations they appear (that's the second dimension of
grad1), then take the mean over the mini-batch (which is the first dimension).</p>
<p>Why the sum? It comes from the chain rule. Since a given bias <span class="math">\(b\)</span> is used to compute <span class="math">\(z_{1},\dots,z_{N}\)</span> (where <span class="math">\(N = nb_{1} \times nb_{2}\)</span>) we have</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial b} = \sum_{i=1}^{n} \frac{\partial \hbox{loss}}{\partial z_{i}} \times \frac{\partial z_{i}}{\partial b} = \sum_{i=1}^{n} \frac{\partial \hbox{loss}}{\partial z_{i}}
\end{equation*}
</div>
<p>It'll be the same for the weights: for a given weight <span class="math">\(w_{i,j}\)</span>, we have to compute all the vec(x)[:,:,i] * grad1[:,:,j] then take the sum over the second axis, and take the
mean over the first axis.</p>
<p>Then, we have to reshape the gradient of the loss with respect to <span class="math">\(\hbox{vec}(x)\)</span> with the initial shape of x, which will need another function called vec2arr that we will
code last. With all of this, we can write the full Convolution class:</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">Convolution</span><span class="p">():</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">nc_in</span><span class="p">,</span> <span class="n">nc_out</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span><span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span> <span class="o">=</span> <span class="n">kernel_size</span>
<span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">nc_in</span> <span class="o">*</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="p">,</span><span class="n">nc_out</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">/</span><span class="n">nc_in</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">nc_out</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">=</span> <span class="n">stride</span>
<span class="bp">self</span><span class="o">.</span><span class="n">padding</span> <span class="o">=</span> <span class="n">padding</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">):</span>
<span class="n">mb</span><span class="p">,</span> <span class="n">ch</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_size</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="n">p</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_x</span> <span class="o">=</span> <span class="n">arr2vec</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">stride</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">padding</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">old_x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">y</span><span class="p">,(</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="n">n1</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">padding</span><span class="p">)</span> <span class="o">//</span><span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">p1</span> <span class="o">=</span> <span class="p">(</span><span class="n">p</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">+</span><span class="mi">2</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">padding</span> <span class="p">)</span><span class="o">//</span><span class="bp">self</span><span class="o">.</span><span class="n">stride</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">y</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">mb</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">biases</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">n1</span><span class="p">,</span><span class="n">p1</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">grad</span><span class="p">):</span>
<span class="n">mb</span><span class="p">,</span> <span class="n">ch_out</span><span class="p">,</span> <span class="n">n1</span><span class="p">,</span> <span class="n">p1</span> <span class="o">=</span> <span class="n">grad</span><span class="o">.</span><span class="n">shape</span>
<span class="n">grad</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">grad</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">mb</span><span class="p">,</span><span class="n">ch_out</span><span class="p">,</span><span class="n">n1</span><span class="o">*</span><span class="n">p1</span><span class="p">),(</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">grad_b</span> <span class="o">=</span> <span class="n">grad</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">grad_w</span> <span class="o">=</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">old_x</span><span class="p">[:,:,:,</span><span class="bp">None</span><span class="p">],</span><span class="n">grad</span><span class="p">[:,:,</span><span class="bp">None</span><span class="p">,:]))</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">new_grad</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">grad</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
<span class="k">return</span> <span class="n">vec2arr</span><span class="p">(</span><span class="n">new_grad</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">kernel_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">old_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">stride</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">padding</span><span class="p">)</span>
</pre>
<p>The last function takes a vectorized input and has to compute an array associated to it by <em>reversing</em> what arr2vec is doing. It's not just repositioning elements: in the
earlier example, 3 was present three times. The elements on those positions must be summed (the chain rule again) and the result placed in the position where 3 was in the initial
array.</p>
<p>So that's what we have to do: for each element on our initial array, we have to locate all the spots the arr2vec function put them in, sum the elements we get, and put that in
our result array. To use vectorization, we will create a big array of numbers, with as many row as as in our input, so <span class="math">\(N = m \times ch \times n_{1} \times n_{2}\)</span> and as
many columns as necessary. On each row, we will have the positions of where the arr2vec function would have placed that element, so we will just have to take the sum over the
second axis and reshape at the end.</p>
<p>First, let's check how many windows could have passed over the element with coordinates <span class="math">\((i,j)\)</span>. That's all the windows that started at <span class="math">\((i-k1i,j-k2j)\)</span> where <span class="math">\(k1i\)</span>
can go from 0 to <span class="math">\(k_{1}-1\)</span> and <span class="math">\(k2j\)</span> can go from 0 to <span class="math">\(k_{2}-1\)</span>. Of course, this will sometimes give us negatives coordinates, or coordinates of window that
go to far on the right or the bottom of the picture. We'll deal with those with a mask, but first, let's compute them all.</p>
<pre class="code python literal-block">
<span class="n">idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[[</span><span class="n">i</span><span class="o">-</span><span class="n">k1i</span><span class="p">,</span> <span class="n">j</span><span class="o">-</span><span class="n">k2j</span><span class="p">]</span> <span class="k">for</span> <span class="n">k1i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="p">)</span> <span class="k">for</span> <span class="n">k2j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k2</span><span class="p">)]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">p</span><span class="p">)])</span>
<span class="n">in_bounds</span> <span class="o">=</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">>=</span> <span class="o">-</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span> <span class="o"><=</span> <span class="n">n</span><span class="o">-</span><span class="n">k1</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span>
<span class="n">in_bound</span> <span class="o">*=</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span> <span class="o">>=</span> <span class="o">-</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span> <span class="o"><=</span> <span class="n">p</span><span class="o">-</span><span class="n">k2</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span>
</pre>
<p>The second array is a boolean array that checks that the corners of our windows are inside the picture, taking the padding into account. Another mask we will need is to take
the stride into account: some of those windows aren't considered if we have a stride different from one.</p>
<pre class="code python literal-block">
<span class="n">in_strides</span> <span class="o">=</span> <span class="p">((</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span><span class="o">%</span><span class="n">stride</span><span class="o">==</span><span class="mi">0</span><span class="p">)</span> <span class="o">*</span> <span class="p">((</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span><span class="o">%</span><span class="n">stride</span><span class="o">==</span><span class="mi">0</span><span class="p">)</span>
</pre>
<p>Now we can just convert the couples into indexes and our channel dimension at the bottom.</p>
<pre class="code python literal-block">
<span class="n">to_take</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">k2</span> <span class="o">+</span> <span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">k1</span><span class="o">*</span><span class="n">k2</span><span class="o">*</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ch</span><span class="p">)],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</pre>
<p>At this stage, we read on a line (when it's in bound and in-stride) the indexes of the elements to pick in each column. For this to correspond to the indexes of the element in the
array, we have to add to each column a multiple of the number of columns in our input (which I called ftrs).</p>
<pre class="code python literal-block">
<span class="n">to_take</span> <span class="o">=</span> <span class="n">to_take</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">ftrs</span> <span class="o">*</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="o">*</span><span class="n">k2</span><span class="p">)])</span>
<span class="n">to_take</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">to_take</span> <span class="o">+</span> <span class="n">md</span><span class="o">*</span><span class="n">ftrs</span><span class="o">*</span><span class="n">m</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">mb</span><span class="p">)],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</pre>
<p>Then we add all the mini-batches over the same dimension. Last thing we have to do is to expand our mask to make it the same size.</p>
<pre class="code python literal-block">
<span class="n">in_bounds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">tile</span><span class="p">(</span><span class="n">in_bounds</span> <span class="o">*</span> <span class="n">in_strides</span><span class="p">,(</span><span class="n">ch</span> <span class="o">*</span> <span class="n">mb</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
</pre>
<p>and we're ready to take our inputs and sum them!</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">vec2arr</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">old_shape</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">):</span>
<span class="n">k1</span><span class="p">,</span><span class="n">k2</span> <span class="o">=</span> <span class="n">kernel_size</span>
<span class="n">n</span><span class="p">,</span><span class="n">p</span> <span class="o">=</span> <span class="n">old_shape</span>
<span class="n">mb</span><span class="p">,</span> <span class="n">md</span><span class="p">,</span> <span class="n">ftrs</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span>
<span class="n">ch</span> <span class="o">=</span> <span class="n">ftrs</span> <span class="o">//</span> <span class="p">(</span><span class="n">k1</span><span class="o">*</span><span class="n">k2</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[[</span><span class="n">i</span><span class="o">-</span><span class="n">k1i</span><span class="p">,</span> <span class="n">j</span><span class="o">-</span><span class="n">k2j</span><span class="p">]</span> <span class="k">for</span> <span class="n">k1i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="p">)</span> <span class="k">for</span> <span class="n">k2j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k2</span><span class="p">)]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">p</span><span class="p">)])</span>
<span class="n">in_bounds</span> <span class="o">=</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">>=</span> <span class="o">-</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span> <span class="o"><=</span> <span class="n">n</span><span class="o">-</span><span class="n">k1</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span>
<span class="n">in_bounds</span> <span class="o">*=</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span> <span class="o">>=</span> <span class="o">-</span><span class="n">padding</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span> <span class="o"><=</span> <span class="n">p</span><span class="o">-</span><span class="n">k2</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span>
<span class="n">in_strides</span> <span class="o">=</span> <span class="p">((</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span><span class="o">%</span><span class="n">stride</span><span class="o">==</span><span class="mi">0</span><span class="p">)</span> <span class="o">*</span> <span class="p">((</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span><span class="o">+</span><span class="n">padding</span><span class="p">)</span><span class="o">%</span><span class="n">stride</span><span class="o">==</span><span class="mi">0</span><span class="p">)</span>
<span class="n">to_take</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">idx</span><span class="p">[:,:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">k2</span> <span class="o">+</span> <span class="n">idx</span><span class="p">[:,:,</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">k1</span><span class="o">*</span><span class="n">k2</span><span class="o">*</span><span class="n">c</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">ch</span><span class="p">)],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">to_take</span> <span class="o">=</span> <span class="n">to_take</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">ftrs</span> <span class="o">*</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k1</span><span class="o">*</span><span class="n">k2</span><span class="p">)])</span>
<span class="n">to_take</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">to_take</span> <span class="o">+</span> <span class="n">md</span><span class="o">*</span><span class="n">ftrs</span><span class="o">*</span><span class="n">m</span> <span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">mb</span><span class="p">)],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">in_bounds</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">tile</span><span class="p">(</span><span class="n">in_bounds</span> <span class="o">*</span> <span class="n">in_strides</span><span class="p">,(</span><span class="n">ch</span> <span class="o">*</span> <span class="n">mb</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">in_bounds</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">to_take</span><span class="p">),</span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">mb</span><span class="p">,</span><span class="n">ch</span><span class="p">,</span><span class="n">n</span><span class="p">,</span><span class="n">p</span><span class="p">)</span>
</pre>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>SGD Variants2018-03-29T16:35:00-04:002018-03-29T16:35:00-04:00Sylvain Guggertag:None,2018-03-29:/sgd-variants.html<p class="first last">Let's get a rapid overview and implementations of the common variants of SGD.</p>
<p>To train our neural net we detailed the algorithm of Stochastic Gradient Descent in <a class="reference external" href="/what-is-deep-learning.html">this article</a> and implemented it in
<a class="reference external" href="/a-simple-neural-net-in-numpy.html">this one</a>. To make it easier for our model to learn, there are a few ways we can improve it.</p>
<div class="section" id="vanilla-sgd">
<h2>Vanilla SGD</h2>
<p>Just to remember what we are talking about, the basic algorithm consists in changing each of our parameter this way</p>
<div class="math">
\begin{equation*}
p_{t} = p_{t-1} - \hbox{lr} \times g_{t-1}
\end{equation*}
</div>
<p>where <span class="math">\(p_{t}\)</span> represents one of our parameters at a given step <span class="math">\(t\)</span> in our loop, <span class="math">\(g_{t}\)</span> the gradient of the loss with respect to <span class="math">\(p_{t}\)</span> and <span class="math">\(\hbox{lr}\)</span>
is an hyperparameter called learning rate. In a pytorch-like syntax, this can be coded:</p>
<pre class="code python literal-block">
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">parameters</span><span class="p">:</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">p</span><span class="o">.</span><span class="n">grad</span>
</pre>
</div>
<div class="section" id="momentum">
<h2>Momentum</h2>
<p>This amelioration is based on the observation that with SGD, we don't really manage to follow the line down a steep ravine, but rather bounce from one side to the other.</p>
<img alt="Going up and down the ravine with SGD" class="align-center" src="../images/art3_momentum.png" style="width: 300px;" />
<p>In this case, we would go faster by following the blue line. To do this, we will take some kind of average over the gradients, by updating our parameters like this:</p>
<div class="math">
\begin{align*}
v_{t} &= \beta v_{t-1} + \hbox{lr} \times g_{t} \\
p_{t} &= p_{t} - v_{t}
\end{align*}
</div>
<p>where <span class="math">\(\beta\)</span> is a new hyperparameter (often equals to 0.9). More details <a class="reference external" href="https://www.sciencedirect.com/science/article/pii/S0893608098001166?via%3Dihub">here</a>. In code, this would look like:</p>
<pre class="code python literal-block">
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">parameters</span><span class="p">:</span>
<span class="n">v</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">beta</span> <span class="o">*</span> <span class="n">v</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">+</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">p</span><span class="o">.</span><span class="n">grad</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">v</span><span class="p">[</span><span class="n">p</span><span class="p">]</span>
</pre>
<p>where we would store the values of the various variables v in a dictionary indexed by the parameters.</p>
</div>
<div class="section" id="nesterov">
<h2>Nesterov</h2>
<p>Nesterov is an amelioration of momentum based on the observation that in the momentum variant, when the gradient start to really change direction (because we have passed our
minimum for instance), it takes a really long time for the averaged values to realize it. In this variant, we first take the jump from <span class="math">\(p_{t}\)</span> to
<span class="math">\(p_{t} - \beta v_{t-1}\)</span> then we compute the gradient. To be more precise, instead of using <span class="math">\(g_{t} = \overrightarrow{\hbox{grad}} \hbox{ loss}(p_{t})\)</span> we use</p>
<div class="math">
\begin{align*}
v_{t} &= \beta v_{t-1} + \hbox{lr} \times \overrightarrow{\hbox{grad}} \hbox{ loss}(p_{t} - \beta v_{t-1}) \\
p_{t} &= p_{t} - v_{t}
\end{align*}
</div>
<p>This picture (coming from <a class="reference external" href="http://cs231n.github.io/neural-networks-3/#sgd">this website</a>) explains the difference with momentum</p>
<img alt="Going up and down the ravine with SGD" class="align-center" src="../images/art3_nesterov.jpg" style="width: 600px;" />
<p>In code, this needs to have a function that reevaluates the gradients after we do this first step.</p>
<pre class="code python literal-block">
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">parameters</span><span class="p">:</span>
<span class="n">p1</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">beta</span> <span class="n">v</span><span class="p">[</span><span class="n">p</span><span class="p">]</span>
<span class="n">model</span><span class="o">.</span><span class="n">reevaluate_grads</span><span class="p">()</span>
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">parameters</span>
<span class="n">v</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">beta</span> <span class="o">*</span> <span class="n">v</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">+</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">p</span><span class="o">.</span><span class="n">grad</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">v</span><span class="p">[</span><span class="n">p</span><span class="p">]</span>
</pre>
</div>
<div class="section" id="rms-prop">
<h2>RMS Prop</h2>
<p>This is another variant of SGD that has been proposed by Geoffrey Hinton in <a class="reference external" href="http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf">this course</a> It suggests to divide each gradient by a moving average of its norm. More specifically, the update
in this method looks like this:</p>
<div class="math">
\begin{align*}
n_{t} &= \beta n_{t-1} + (1-\beta) g_{t}^{2} \\
p_{t} &= p_{t} - \frac{\hbox{lr}}{\sqrt{n_{t}+\epsilon}} g_{t}
\end{align*}
</div>
<p>where <span class="math">\(\beta\)</span> is a new hyperparameter (usually 0.9) and <span class="math">\(\epsilon\)</span> a small value to avoid dividing by zero (usually <span class="math">\(10^{-8}\)</span>). It's easily coded:</p>
<pre class="code python literal-block">
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">parameters</span><span class="p">:</span>
<span class="n">n</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">beta</span> <span class="o">*</span> <span class="n">n</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">beta</span><span class="p">)</span> <span class="o">*</span> <span class="n">p</span><span class="o">.</span><span class="n">grad</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">p</span><span class="o">.</span><span class="n">grad</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">+</span> <span class="n">epsilon</span><span class="p">)</span>
</pre>
</div>
<div class="section" id="adam">
<h2>Adam</h2>
<p>Adam is a mix between RMS Prop and momentum. Here is the <a class="reference external" href="https://arxiv.org/abs/1412.6980">full article</a> explaining it. The update in this method is:</p>
<div class="math">
\begin{align*}
m_{t} &= \beta_{1} m_{t-1} + (1-\beta_{1}) g_{t} \\
n_{t} &= \beta_{2} n_{t-1} + (1-\beta_{2}) g_{t}^{2} \\
\widehat{m_{t}} &= \frac{m_{t}}{1-\beta_{1}^{t}} \\
\widehat{n_{t}} &= \frac{n_{t}}{1-\beta_{2}^{t}} \\
p_{t} &= p_{t} - \frac{\hbox{lr}}{\sqrt{\widehat{n_{t}}+\epsilon}} \widehat{m_{t}}
\end{align*}
</div>
<p>where <span class="math">\(\beta_{1}\)</span> and <span class="math">\(\beta_{2}\)</span> are two hyperparameters, the advised values being 0.9 and 0.999. We go from <span class="math">\(m_{t}\)</span> to <span class="math">\(\widehat{m_{t}}\)</span> to have smoother
first values (as we explained it when we implemented the <a class="reference external" href="/how-do-you-find-a-good-learning-rate.html">learning rate finder</a>). It's same for <span class="math">\(n_{t}\)</span> and <span class="math">\(\widehat{n_{t}}\)</span>.</p>
<p>We can code this quite easily:</p>
<pre class="code python literal-block">
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">parameters</span><span class="p">:</span>
<span class="n">m</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">beta1</span> <span class="o">*</span> <span class="n">m</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">beta1</span><span class="p">)</span> <span class="o">*</span> <span class="n">p</span><span class="o">.</span><span class="n">grad</span>
<span class="n">n</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">=</span> <span class="n">beta2</span> <span class="o">*</span> <span class="n">n</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">beta2</span><span class="p">)</span> <span class="o">*</span> <span class="n">p</span><span class="o">.</span><span class="n">grad</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">m_hat</span> <span class="o">=</span> <span class="n">m</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">beta1</span><span class="o">**</span><span class="n">t</span><span class="p">)</span>
<span class="n">n_hat</span> <span class="o">=</span> <span class="n">n</span><span class="p">[</span><span class="n">p</span><span class="p">]</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">beta2</span><span class="o">**</span><span class="n">t</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">p</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">m_hat</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n_hat</span> <span class="o">+</span> <span class="n">epsilon</span><span class="p">)</span>
</pre>
<p>Both RMS Prop and Adam have the advantage of smoothing the gradients with this moving average of the norm. This way, we can pick a higher learning rate while
avoiding the phenomenon where gradients were exploding.</p>
</div>
<div class="section" id="in-conclusion">
<h2>In conclusion</h2>
<p>To get a sense on how these methods are all doing compared to the other, I've applied each of them on the
<a class="reference external" href="/a-neural-net-in-pytorch.html">digit classifier we built in pytorch</a> and plotted the smoothed loss (
as described when we implemented the <a class="reference external" href="/how-do-you-find-a-good-learning-rate.html">learning rate finder</a>) for each of them.</p>
<img alt="Loss over iterations with all the SGD variants" class="align-center" src="../images/art3_variants.png" style="width: 600px;" />
<p>This is the loss as we go through our mini-batches with the same initial model. All variants are better than vanilla SGD, but it's probably because it needed more time to get to a
better stop. What's interesting is that RMSProp and Adam tend to get the loss really low extremely fast.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>A simple neural net in numpy2018-03-20T16:15:00-04:002018-03-20T16:15:00-04:00Sylvain Guggertag:None,2018-03-20:/a-simple-neural-net-in-numpy.html<p class="first last">Now that we have seen how to build a neural net in pytorch, let's try to take it a step further and try to do the same thing in numpy.</p>
<p>Numpy doesn't have GPU-acceleration, so this is just to force us to understand what's going on behind the scenes, and how to code the things pytorch does automatically.
The main thing we have to dig into is how it computes the gradient of the loss with respect to all the parameters of our neural net. We'll see that despite the fact it seems
a very hard thing to do, calculating the gradients involves roughly the same things (so takes approximately the same time) as computing the outputs of our network from the outputs.</p>
<div class="section" id="back-propagation">
<h2>Back propagation</h2>
<p>If we take the same example as in <a class="reference external" href="/a-neural-net-in-pytorch.html">this article</a> our neural network has two linear layers, the first activation function being a ReLU and
the last one softmax (or log softmax) and the loss function the Cross Entropy.
If we really wanted to, we could write down the (horrible) formula that gives the loss in terms of our inputs, the theoretical labels and
all the parameters of the neural net, then compute the derivatives with respect to each weight and each bias, and finally, implement the corresponding formulas.</p>
<p>Needless to say that would be painful (though what we'll do still is), and not very helpful in general since each time we change our network, we would have to redo the whole process. There is a smarter way
to go that relies on the chain rule. Basically, our loss function is just a composition of simpler functions, let's say:</p>
<div class="math">
\begin{equation*}
\hbox{loss} = f_{1} \circ f_{2} \circ f_{3} \circ \cdots \circ f_{p}(x)
\end{equation*}
</div>
<p>where <span class="math">\(f_{1}\)</span> would be the Cross Entropy, <span class="math">\(f_{2}\)</span> our softmax activation, <span class="math">\(f_{3}\)</span> the last linear layer and so on... Furthermore, let's note</p>
<div class="math">
\begin{equation*}
\left \{ \begin{array}{l} x_{1} = f_{p}(x) \\ x_{2} = f_{p-1}(f_{p}(x)) \\ \vdots \\ x_{p} = f_{1} \circ f_{2} \circ f_{3} \circ \cdots \circ f_{p}(x) \end{array} \right .
\end{equation*}
</div>
<p>Then our loss is simply <span class="math">\(x_{p}\)</span>. We can compute (almost) easily the derivatives of all the <span class="math">\(f_{i}\)</span> (because they are the simple parts composing our loss function) so
we will start from the very end and go backward until we reach <span class="math">\(x_{0} = x\)</span>. At the end, we have <span class="math">\(x_{p} = f_{p}(x_{p-1})\)</span> so</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial x_{p-1}} = \frac{\partial f_{p}}{\partial x_{p-1}}(x_{p-1})
\end{equation*}
</div>
<p>Then <span class="math">\(x_{p-1} = f_{p-1}(x_{p-2})\)</span> so by the chain rule</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial x_{p-2}} = \frac{\partial \hbox{loss}}{\partial x_{p-1}} \times \frac{\partial f_{p-1}}{\partial x_{p-2}}(x_{p-2}) = \frac{\partial f_{p}}{\partial x_{p-1}} \times \frac{\partial f_{p-1}}{\partial x_{p-2}}(x_{p-2})
\end{equation*}
</div>
<p>and so forth. In practice, this is just a tinier bit more complicated than this when the function <span class="math">\(f_{i}\)</span> depends on more than one variable, but we will study that in
details when we need it (for the softmax and a linear layer).</p>
<p>To code this in a flexible manner, and since I need some training in Oriented Object Programming in Python, we will define each tiny bit of our neural network as a class.
Each one will have a forward method (that gives the result of an input going through that layer or activation function) and a backward method (that will compute the step
in the back propagation of going from after this layer/activation to before). The forward method will get the output of the last part of the neural net to give its output.</p>
<p>The backward method will get the derivatives of the loss function with respect to the next part of the layer and will have to compute the derivatives of the loss function
with regards to the inputs it received. If that sounds unclear, reread this after the next paragraph and I hope it'll make more sense.</p>
<p>It seems complicated but it's not that difficult, just a bit of math to bear with. Our hard bits will be the linear layer and the softmax activation, so let's keep them for the end.</p>
</div>
<div class="section" id="activation-function">
<h2>Activation function</h2>
<p>If <span class="math">\(f\)</span> is an activation function, it receives the result of a layer <span class="math">\(x\)</span> and is applied element-wise to compute the output which is <span class="math">\(y = f(x)\)</span>. In
practice, <span class="math">\(x\)</span> is a whole mini-batch of inputs, so it's an array with as many rows as the size of our mini-batch and as many columns as there were neurons in
the previous layer. It's not really important since the function is applied element-wise, so we can safely imagine that <span class="math">\(x\)</span> is just one entry of the array.</p>
<p>The forward pass is straightforward, and the back propagation step isn't really complicated either. If we know the derivative of our loss function with respect to <span class="math">\(y\)</span>, then</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial x} = \frac{\partial \hbox{loss}}{\partial y} \times \frac{\partial f}{\partial x} = \frac{\partial \hbox{loss}}{\partial y} \times f'(x).
\end{equation*}
</div>
<p>where <span class="math">\(f'\)</span> is the derivative of the function <span class="math">\(f\)</span>. So if we receive a variable named <span class="math">\(\hbox{grad}\)</span> that contains all the derivatives of the loss with respect
to all the <span class="math">\(y\)</span>, the derivatives of the loss with respect to all the <span class="math">\(x\)</span> is simply</p>
<div class="math">
\begin{equation*}
f'(x) \odot \hbox{grad}
\end{equation*}
</div>
<p>where <span class="math">\(f'\)</span> is applied element-wise and <span class="math">\(\odot\)</span> represents the product of the two arrays element-wise.
Note that we will need to know <span class="math">\(x\)</span> when it's time for the back propagation step, so let's save it when we do the forward pass
inside a parameter of our class.</p>
<p>The easier to implement will be the ReLU activation function. Since we have</p>
<div class="math">
\begin{equation*}
\hbox{ReLU}(x) = \max(0,x) = \left \{ \begin{array}{l} x \hbox{ si } x > 0 \\ 0 \hbox{ sinon} \end{array} \right .
\end{equation*}
</div>
<p>we can compute its derivative very easily:</p>
<div class="math">
\begin{equation*}
\hbox{ReLU}'(x) = \left \{ \begin{array}{l} 1 \hbox{ si } x > 0 \\ 0 \hbox{ sinon} \end{array} \right .
\end{equation*}
</div>
<p>Then the ReLU class is easily coded.</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">ReLU</span><span class="p">():</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="bp">None</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">grad</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">old_x</span><span class="o">></span><span class="mi">0</span><span class="p">,</span><span class="n">grad</span><span class="p">,</span><span class="mi">0</span><span class="p">)</span>
</pre>
<p>We just simplified the multiplication between grad and an array of 0 and 1 by the where statement.</p>
<p>We won't need it for our example of neural net, but let's do the sigmoid to. It's defined by</p>
<div class="math">
\begin{equation*}
\sigma(x) = \frac{\mathrm{e}^{x}}{1 + \mathrm{e}^{x}}
\end{equation*}
</div>
<p>and by using the traditional rule to differentiate a quotient,</p>
<div class="math">
\begin{align*}
\sigma'(x) &= \frac{\mathrm{e}^{x}(1+\mathrm{e}^{x}) - \mathrm{e}^{x} \times \mathrm{e}^{x}}{(1+\mathrm{e}^{x})^{2}} = \frac{\mathrm{e}^{x}}{1+\mathrm{e}^{x}} - \frac{\mathrm{e}^{2x}}{(1+\mathrm{e}^{x})^{2}} \\
&= \sigma(x) - \sigma(x)^{2} = \sigma(x) (1 - \sigma(x))
\end{align*}
</div>
<p>Then the sigmoid class is</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">Sigmoid</span><span class="p">():</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="mf">1.</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">old_y</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">grad</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">old_y</span> <span class="o">*</span> <span class="p">(</span><span class="mf">1.</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">old_y</span><span class="p">)</span> <span class="o">*</span> <span class="n">grad</span>
</pre>
<p>Note that here we store the result of the forward pass and not the old value of x, because that's what we will need for the back propagation step.</p>
<p>The tanh class would be very similar to write.</p>
</div>
<div class="section" id="softmax">
<h2>Softmax</h2>
<p>The softmax activation is a bit different since the results depends of all the inputs and it's not just applied to each element. If our input is <span class="math">\(x_{1},\dots,x_{n}\)</span> the
output <span class="math">\(y_{1},\dots,y_{n}\)</span> is defined by</p>
<div class="math">
\begin{equation*}
\hbox{softmax}_{i}(x) = y_{i} = \frac{\mathrm{e}^{x_{i}}}{\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}}}
\end{equation*}
</div>
<p>When we want to take the derivative of <span class="math">\(y_{i}\)</span>, we have <span class="math">\(n\)</span> different variables with respect to which differentiate. We compute</p>
<div class="math">
\begin{align*}
\frac{\partial y_{i}}{\partial x_{i}} &= \frac{\mathrm{e}^{x_{i}}(\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}}) - \mathrm{e}^{x_{i}} \times \mathrm{e}^{x_{i}}}{(\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}})^{2}} \\
&= \frac{\mathrm{e}^{x_{i}}}{\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}}} - \frac{\mathrm{e}^{2x_{i}}}{(\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}})^{2}} \\
&= y_{i} - y_{i}^{2} = y_{i}(1-y_{i})
\end{align*}
</div>
<p>and if <span class="math">\(j \neq i\)</span></p>
<div class="math">
\begin{align*}
\frac{\partial y_{i}}{\partial x_{j}} &= - \frac{\mathrm{e}^{x_{i}}\mathrm{e}^{x_{j}}}{(\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}})^{2}} \\
&= \frac{\mathrm{e}^{x_{i}}}{\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}}} \times \frac{\mathrm{e}^{x_{j}}}{\mathrm{e}^{x_{1}} + \cdots + \mathrm{e}^{x_{n}}} \\
&= -y_{i} y_{j}
\end{align*}
</div>
<p>Now we will get the derivatives of the loss with respect to the <span class="math">\(y_{i}\)</span> and we will have to compute the derivatives of the loss with respect to the <span class="math">\(x_{j}\)</span>. In this
case, since each <span class="math">\(y_{k}\)</span> depends on the variable <span class="math">\(x_{j}\)</span>, the chain rule is written:</p>
<div class="math">
\begin{align*}
\frac{\partial \hbox{loss}}{\partial x_{j}} &= \sum_{k=1}^{n} \frac{\partial \hbox{loss}}{\partial y_{k}} \times \frac{\partial y_{k}}{\partial x_{j}} \\
&= \frac{\partial \hbox{loss}}{\partial y_{j}} (y_{j}-y_{j}^{2}) - \sum_{k \neq j} y_{k}y_{j} \frac{\partial \hbox{loss}}{\partial y_{k}}\\
&= y_{j} \frac{\partial \hbox{loss}}{\partial y_{j}} - \sum_{k=1}^{n} y_{k}y_{j} \frac{\partial \hbox{loss}}{\partial y_{k}}
\end{align*}
</div>
<p>Now when we implement this, we have to remember that x is a mini-batch of inputs. In the forward pass, the sum that we see in the denominator is to be taken on each line
of the exponential of x which we can do with</p>
<pre class="code python literal-block">
<span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</pre>
<p>This array will have a shape of (mb,), were mb is the size of our mini-batch. We want to divide np.exp(x) by this array, but since x has a shape (mb,n), we have to convert
this array into an array of shape (mb,1), otherwise the two won't be broadcastable (numpy tries to add the ones at the beginning of the shape when two arrays don't have
the same dimension). The trick is done with resize or expand_dims:</p>
<pre class="code python literal-block">
<span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</pre>
<p>Another way that's shorter is to add a None index:</p>
<pre class="code python literal-block">
<span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">]</span>
</pre>
<p>For the backward function, we can note that in our formula, <span class="math">\(y_{j}\)</span> can be factored, then the values we have to compute are the</p>
<div class="math">
\begin{equation*}
y_{j} \left ( g_{j} - \sum_{k=1}^{n} y_{k} g_{k} \right )
\end{equation*}
</div>
<p>where I noted g the gradient of the loss with respect to the <span class="math">\(y_{j}\)</span>. Again, the sum is to be taken on each line (and we have to add a dimension this time as well),
which gives us:</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">Softmax</span><span class="p">():</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="p">[:,</span><span class="bp">None</span><span class="p">]</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">old_y</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">grad</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">old_y</span> <span class="o">*</span> <span class="p">(</span><span class="n">grad</span> <span class="o">-</span><span class="p">(</span><span class="n">grad</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">old_y</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">])</span>
</pre>
</div>
<div class="section" id="cross-entropy-cost">
<h2>Cross Entropy cost</h2>
<p>The cost function is a little different in the sense it takes an output and a target, then returns a single real number. When we apply it to a mini-batch though, we have two arrays
x and y of the same size (mb by n, the number of outputs) which represent a mini-batch of outputs of our network and the targets they should match, and it will return a vector
of size mb.</p>
<p>The cross-entropy cost is the sum of the <span class="math">\(-\ln(x_{i})\)</span> over all the indexes <span class="math">\(i\)</span> for which <span class="math">\(y_{i}\)</span> equals 1. In practice though, just in case our network
returns a value of <span class="math">\(x_{i}\)</span> to close to zero, we clip its value to a minimum of <span class="math">\(10^{-8}\)</span> (usually).</p>
<p>For the backward function, we don't have any old gradients to pass, since this is the first step of computing the derivatives of our loss. In the case of the cross-entropy loss,
those are <span class="math">\(-\frac{1}{x_{i}}\)</span> for each <span class="math">\(i\)</span> for which <span class="math">\(y_{i}\)</span> equals 1, 0 otherwise. Thus we can code:</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">CrossEntropy</span><span class="p">():</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="nb">min</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">,</span><span class="nb">max</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_y</span> <span class="o">=</span> <span class="n">y</span>
<span class="k">return</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">y</span><span class="o">==</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">old_x</span><span class="p">),</span> <span class="mi">0</span><span class="p">))</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">old_y</span><span class="o">==</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="o">/</span><span class="bp">self</span><span class="o">.</span><span class="n">old_x</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</pre>
</div>
<div class="section" id="linear-layer">
<h2>Linear Layer</h2>
<p>We have done everything else, so now is the time to focus on a linear layer. Here we have a few parameters, the weights and the biases. If the layer we consider has <span class="math">\(n_{in}\)</span>
inputs and <span class="math">\(n_{out}\)</span> outputs, the weights are stored in a matrix <span class="math">\(W\)</span> of size <span class="math">\(n_{in},n_{out}\)</span> and the bias is a vector <span class="math">\(B\)</span>. The output
is given by <span class="math">\(Y = XW + B\)</span> (where <span class="math">\(X\)</span> is the input).</p>
<p>This formula can be seen for just one vector of inputs or a mini-batch, in the second case, we just have to think of <span class="math">\(B\)</span> as a matrix with mb lines, all equal to the bias
vector (which is the usual broadcasting in numpy).</p>
<p>The forward pass will be very easy to implement, for the backward pass, not only will we have to compute the gradients of the loss with regards to <span class="math">\(X\)</span> while given the gradients
of the loss with regards to <span class="math">\(Y\)</span> (to be able to continue our back propagation) but we will also have to calculate and store the gradients of the loss with regards to all the
weights and biases, since those are the things we will need to do a step in our gradient descent (and the whole reason we are doing this back propagation).</p>
<p>Let's begin with this. In terms of coordinates, the formula above can be rewritten</p>
<div class="math">
\begin{equation*}
y_{i} = \sum_{k=1}^{n_{in}} x_{k} w_{k,i} + b_{i}
\end{equation*}
</div>
<p>So we have immediately</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial b_{i}} = \frac{\partial \hbox{loss}}{\partial y_{i}} \times \frac{\partial y_{i}}{\partial b_{i}} = \frac{\partial \hbox{loss}}{\partial y_{i}}
\end{equation*}
</div>
<p>and</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial w_{k,i}} = \frac{\partial \hbox{loss}}{\partial y_{i}} \times \frac{\partial y_{i}}{\partial w_{k,i}} = x_{k} \frac{\partial \hbox{loss}}{\partial y_{i}}.
\end{equation*}
</div>
<p>There are no sums here because <span class="math">\(b_{i}\)</span> and <span class="math">\(w_{k,i}\)</span> only appear to define <span class="math">\(y_{i}\)</span>.</p>
<p>If we have the derivatives of the loss with respect to the <span class="math">\(y_{i}\)</span> in a variable called grad, the derivatives of the loss with respect to the biases are in grad, and the
derivatives of the loss with respect to the weights are in the array</p>
<pre class="code python literal-block">
<span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">old_x</span><span class="p">[:,:,</span><span class="bp">None</span><span class="p">],</span><span class="n">grad</span><span class="p">[:,</span><span class="bp">None</span><span class="p">,:])</span>
</pre>
<p>Why is that? Since x has a size <span class="math">\((mb,n_{in})\)</span> and grad has a size <span class="math">\((mb,n_{out})\)</span>, we transform those two arrays into tensors with dimensions
<span class="math">\((mb,n_{in},1)\)</span> and <span class="math">\((mb,1,n_{out})\)</span>. That way, the traditional matrix product applied for the two last dimensions will give us, for each mini-batch, the product
of <span class="math">\(x_{k}\)</span> by <span class="math">\(\hbox{grad}_{j}\)</span>.</p>
<p>As we explained in the <a class="reference external" href="/what-is-deep-learning.html">introduction</a> we average the gradients over the mini-batch to apply our step of the SGD, so we will store the mean
over the first axis of those two arrays.</p>
<p>Then, once this is done, we still need to compute the derivatives of the loss with respect to the <span class="math">\(x_{k}\)</span>. This is given by the formula</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial x_{k}} = \sum_{i=1}^{n_{out}} \frac{\partial \hbox{loss}}{\partial y_{i}} \times \frac{\partial y_{i}}{\partial x_{k}} = \sum_{i=1}^{n_{out}} \frac{\partial \hbox{loss}}{\partial y_{i}} w_{k,i}.
\end{equation*}
</div>
<p>This can be rewritten as a simple matrix product:</p>
<div class="math">
\begin{equation*}
\hbox{new grad} = (\hbox{old grad}) \times {}^{t}W.
\end{equation*}
</div>
<p>We all of this, we can finally code our own linear class.</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">Linear</span><span class="p">():</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">n_in</span><span class="p">,</span><span class="n">n_out</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">weights</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n_in</span><span class="p">,</span><span class="n">n_out</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">/</span><span class="n">n_in</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">biases</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">n_out</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">old_x</span> <span class="o">=</span> <span class="n">x</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">biases</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">grad</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">grad_b</span> <span class="o">=</span> <span class="n">grad</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">grad_w</span> <span class="o">=</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">old_x</span><span class="p">[:,:,</span><span class="bp">None</span><span class="p">],</span><span class="n">grad</span><span class="p">[:,</span><span class="bp">None</span><span class="p">,:]))</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">grad</span><span class="p">,</span><span class="bp">self</span><span class="o">.</span><span class="n">weights</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
</pre>
<p>Note that we initialize the weights randomly as we create the network. The rule usually used for this is explained <a class="reference external" href="https://arxiv-web3.library.cornell.edu/abs/1502.01852">here</a>, I may write another article detailing the reasoning behind it later. The biases are initialized to zero.</p>
<p>We can now group all those layers in a model</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">Model</span><span class="p">():</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">layers</span><span class="p">,</span> <span class="n">cost</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">layers</span> <span class="o">=</span> <span class="n">layers</span>
<span class="bp">self</span><span class="o">.</span><span class="n">cost</span> <span class="o">=</span> <span class="n">cost</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">):</span>
<span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">layer</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
<span class="k">def</span> <span class="nf">loss</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">),</span><span class="n">y</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">backward</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">grad</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span>
<span class="n">grad</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">backward</span><span class="p">(</span><span class="n">grad</span><span class="p">)</span>
</pre>
<p>We can then create a model that looks like the one we defined for our digit classification like this:</p>
<pre class="code python literal-block">
<span class="n">net</span> <span class="o">=</span> <span class="n">Model</span><span class="p">([</span><span class="n">Linear</span><span class="p">(</span><span class="mi">784</span><span class="p">,</span><span class="mi">100</span><span class="p">),</span> <span class="n">Relu</span><span class="p">(),</span> <span class="n">Linear</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span><span class="mi">10</span><span class="p">),</span> <span class="n">Softmax</span><span class="p">()],</span> <span class="n">CrossEntropy</span><span class="p">())</span>
</pre>
<p>The training loop would then look like something like this:</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="n">lr</span><span class="p">,</span><span class="n">nb_epoch</span><span class="p">,</span><span class="n">data</span><span class="p">):</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nb_epoch</span><span class="p">):</span>
<span class="n">running_loss</span> <span class="o">=</span> <span class="mf">0.</span>
<span class="n">num_inputs</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">mini_batch</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="n">inputs</span><span class="p">,</span><span class="n">targets</span> <span class="o">=</span> <span class="n">mini_batch</span>
<span class="n">num_inputs</span> <span class="o">+=</span> <span class="n">inputs</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1">#Forward pass + compute loss</span>
<span class="n">running_loss</span> <span class="o">+=</span> <span class="n">model</span><span class="o">.</span><span class="n">loss</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span><span class="n">targets</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="c1">#Back propagation</span>
<span class="n">model</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1">#Update of the parameters</span>
<span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="n">model</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">layer</span><span class="p">)</span> <span class="o">==</span> <span class="n">Linear</span><span class="p">:</span>
<span class="n">layer</span><span class="o">.</span><span class="n">weights</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">layer</span><span class="o">.</span><span class="n">grad_w</span>
<span class="n">layer</span><span class="o">.</span><span class="n">biases</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">layer</span><span class="o">.</span><span class="n">grad_b</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s1">'Epoch {epoch+1}/{nb_epoch}: loss = {running_loss/num_inputs}'</span><span class="p">)</span>
</pre>
<p>To test it, we can use the data from the MNIST dataset loaded in <a class="reference external" href="https://github.com/sgugger/Deep-Learning/blob/master/First%20neural%20net%20in%20pytorch.ipynb">this notebook</a>, we just have to convert all the torch arrays obtained into numpy arrays,
flatten the inputs, and for the targets, replace each label by a vector with zeros and a one (because that's what our loss function needs). Here's an example of how to do this:</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">load_minibatches</span><span class="p">(</span><span class="n">batch_size</span><span class="o">=</span><span class="mi">64</span><span class="p">):</span>
<span class="n">tsfms</span> <span class="o">=</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Compose</span><span class="p">([</span><span class="n">transforms</span><span class="o">.</span><span class="n">ToTensor</span><span class="p">(),</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Normalize</span><span class="p">((</span><span class="mf">0.1307</span><span class="p">,),</span> <span class="p">(</span><span class="mf">0.3081</span><span class="p">,))])</span>
<span class="n">trn_set</span> <span class="o">=</span> <span class="n">datasets</span><span class="o">.</span><span class="n">MNIST</span><span class="p">(</span><span class="s1">'.'</span><span class="p">,</span> <span class="n">train</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">download</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">transform</span><span class="o">=</span><span class="n">tsfms</span><span class="p">)</span>
<span class="n">trn_loader</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">DataLoader</span><span class="p">(</span><span class="n">trn_set</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">num_workers</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">mb</span> <span class="ow">in</span> <span class="n">trn_loader</span><span class="p">:</span>
<span class="n">inputs_t</span><span class="p">,</span><span class="n">targets_t</span> <span class="o">=</span> <span class="n">mb</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">inputs_t</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span><span class="mi">784</span><span class="p">))</span>
<span class="n">targets</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">inputs_t</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span><span class="mi">10</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">inputs_t</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)):</span>
<span class="n">targets</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">targets_t</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">=</span> <span class="mf">1.</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">28</span><span class="p">):</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">28</span><span class="p">):</span>
<span class="n">inputs</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="o">*</span><span class="mi">28</span><span class="o">+</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">inputs_t</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="n">j</span><span class="p">,</span><span class="n">k</span><span class="p">]</span>
<span class="n">data</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">inputs</span><span class="p">,</span><span class="n">targets</span><span class="p">))</span>
<span class="k">return</span> <span class="n">data</span>
</pre>
<p>It's slower than the pytorch version but at least we can say we've fully got in the whole details of building a neural net from scratch.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>How Do You Find A Good Learning Rate2018-03-20T16:15:00-04:002018-03-20T16:15:00-04:00Sylvain Guggertag:None,2018-03-20:/how-do-you-find-a-good-learning-rate.html<p class="first last">This is the main hyper-parameter to set when we train a neural net, but how do you determine the best value? Here's a technique to quickly decide on one.</p>
<div class="section" id="the-theory">
<h2>The theory</h2>
<p>How do you decide on a learning rate? If it's too slow, your neural net is going to take forever to learn (try to use <span class="math">\(10^{-5}\)</span> instead of <span class="math">\(10^{-2}\)</span> in
<a class="reference external" href="/a-neural-net-in-pytorch.html">the previous article</a> for instance). But if it's too high, each step you take will go over the minimum and you'll never get to an acceptable loss.
Worse, a high learning rate could lead you to an increasing loss until it reaches nan.</p>
<p>Why is that? If your gradients are really high, then a high learning rate is going to take you to a spot that's so far away from the minimum you will probably be worse than before
in terms of loss. Even on something as simple as a parabola, see how a high learning rate quickly gets you further and further away from the minima.</p>
<img alt="" class="align-center" src="../images/art2_explode.png" style="width: 500px;" />
<p>So we have to pick exactly the right value, not too high and not too low. For a long time, it's been a game of try and see, but in <a class="reference external" href="https://arxiv.org/abs/1506.01186">this article</a> another approach is presented. Over an epoch begin your SGD with a very low learning rate (like <span class="math">\(10^{-8}\)</span>) but change it
(by multiplying it by a certain factor for instance)
at each mini-batch until it reaches a very high value (like 1 or 10). Record the loss each time at each iteration and once you're finished, plot those losses against the learning
rate. You'll find something like this:</p>
<img alt="Plot of the loss against the learning rate" class="align-center" src="../images/art2_courbe_lr.png" />
<p>The loss decreases at the beginning, then it stops and it goes back increasing, usually extremely quickly. That's because with very low learning rates, we get better and better,
especially since we increase them. Then comes a point where we reach a value that's too high and the phenomenon shown before happens. Looking at this graph, what is the best
learning rate to choose? Not the one corresponding to the minimum.</p>
<p>Why? Well the learning rate that corresponds to the minimum value is already a bit too high, since we are at the edge between improving and getting all over the place.
We want to go one order of magnitude before, a value that's still aggressive (so that we train quickly) but still on the safe side from an explosion. In the example described by
the picture above, for instance, we don't want to pick <span class="math">\(10^{-1}\)</span> but rather <span class="math">\(10^{-2}\)</span>.</p>
<p>This method can be applied on top of every variant of SGD, and any kind of network. We just have to go through one epoch (usually less) and record the values of our loss
to get the data for our plot.</p>
</div>
<div class="section" id="in-practice">
<h2>In practice</h2>
<p>How do we code this? Well it's pretty simple when we use the fastai library. As detailed in the <a class="reference external" href="http://course.fast.ai/lessons/lesson1.html">first lesson</a>, if we have built a learner object for our model, we just have
to type</p>
<pre class="code python literal-block">
<span class="n">learner</span><span class="o">.</span><span class="n">lr_find</span><span class="p">()</span>
<span class="n">learner</span><span class="o">.</span><span class="n">sched</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
</pre>
<p>and we'll get a picture very similar as then one above. Let's do it ourselves though, to be sure we have understood everything there is behind the scenes. It's going to be pretty
easy since we just have to adapt the training loop seen in <a class="reference external" href="/a-neural-net-in-pytorch.html">that article</a> there is just a few tweaks.</p>
<p>The first one is that we won't really plot the loss of each mini-batch, but some smoother version of it. If we tried to plot the raw loss, we would end up with a graph like
this one:</p>
<img alt="Plot of the loss against the learning rate" class="align-center" src="../images/art2_loss_vs_lr.png" />
<p>Even if we can see a global trend (and that's because I truncated the part where it goes up to infinity on the right), it's not as clear as the previous graph. To smooth those
losses we will take their exponentially weighed averages. It sounds far more complicated that it is and if you're familiar with the momentum variant of SGD it's exactly
the same. At each step where we get a loss, we define this average loss by</p>
<div class="math">
\begin{equation*}
\hbox{avg loss} = \beta * \hbox{old avg loss} + (1-\beta) * \hbox{loss}
\end{equation*}
</div>
<p>where <span class="math">\(\beta\)</span> is a parameter we get to pick between 0 and 1. This way the average losses will reduce the noise and give us a smoother graph where we'll definitely be able to
see the trend. This also also explains why we are <em>too late</em> when we reach the minimum in our first curve: this averaged loss will stay low when our losses start to explode, and
it'll take a bit of time before it starts to really increase.</p>
<p>If you don't see the exponentially weighed behind this average, it's because it's hidden in our recursive formula. If our losses are <span class="math">\(l_{0},\dots,l_{n}\)</span> then the
exponentially weighed loss at a given index <span class="math">\(i\)</span> is</p>
<div class="math">
\begin{align*}
\hbox{avg loss}_{i} &= \beta \hbox{avg loss}_{i-1} + (1-\beta) l_{i} = \beta^{2} \hbox{avg loss}_{i-2} + \beta(1-\beta) l_{i-1} + \beta l_{i} \\
&= \beta^{3} \hbox{avg loss}_{i-3} + \beta^{2}(1-\beta) l_{i-2} + \beta(1-\beta) l_{i-1} + \beta l_{i} \\
&\vdots \\
&= (1-\beta) \beta^{i} l_{0} + (1-\beta) \beta^{i-1} l_{1} + \cdots + (1-\beta) \beta l_{i-1} + (1-\beta) l_{i}
\end{align*}
</div>
<p>so the weights are all powers of <span class="math">\(\beta\)</span>. If remember the formula giving the sum of a geometric sequence, the sum of our weights is</p>
<div class="math">
\begin{equation*}
(1-\beta) \beta^{i} + (1-\beta) \beta^{i-1} + \cdots + (1-\beta) \beta + (1-\beta) = (1-\beta) * \frac{1-\beta^{i+1}}{1-\beta} = 1-\beta^{i+1}
\end{equation*}
</div>
<p>so to really be an average, we have to divide our average loss by this factor. In the end, the loss we will plot is</p>
<div class="math">
\begin{equation*}
\hbox{smoothed loss}_{i} = \frac{\hbox{avg loss}_{i}}{1-\beta^{i+1}}.
\end{equation*}
</div>
<p>This doesn't really change a thing when <span class="math">\(i\)</span> is big, because <span class="math">\(\beta^{i+1}\)</span> will be very close to 0. But for the first values of <span class="math">\(i\)</span>, it insures we get better
results. This is called the bias-corrected version of our average.</p>
<p>The next thing we will change in our training loop is that we probably won't need to do one whole epoch: if the loss is starting to explode, we probably don't want to continue.
The criteria that's implemented in the fastai library and that seems to work pretty well is:</p>
<div class="math">
\begin{equation*}
\hbox{current smoothed loss} > 4 \times \hbox{minimum smoothed loss}
\end{equation*}
</div>
<p>Lastly, we need just a tiny bit of math to figure out by how much to multiply our learning rate at each step. If we begin with a learning rate of <span class="math">\(\hbox{lr}_{0}\)</span> and
multiply it at each step by <span class="math">\(q\)</span> then at the <span class="math">\(i\)</span>-th step, our learning rate will be</p>
<div class="math">
\begin{equation*}
\hbox{lr}_{i} = \hbox{lr}_{0} \times q^{i}
\end{equation*}
</div>
<p>Now, we want to figure out <span class="math">\(q\)</span> knowing <span class="math">\(\hbox{lr}_{0}\)</span> and <span class="math">\(\hbox{lr}_{N-1}\)</span> (the final value after <span class="math">\(N\)</span> steps) so we isolate it:</p>
<div class="math">
\begin{equation*}
\hbox{lr}_{N-1} = \hbox{lr}_{0} \times q^{N-1} \quad \Longleftrightarrow \quad q^{N-1} = \frac{\hbox{lr}_{N-1}}{\hbox{lr}_{0}} \quad \Longleftrightarrow \quad q = \left ( \frac{\hbox{lr}_{N-1}}{\hbox{lr}_{0}} \right )^{\frac{1}{N-1}}
\end{equation*}
</div>
<p>Why go through this trouble and not just take learning rates by regularly splitting the interval between our initial value and our final value? We have to remember we will
plot the loss against the logs of the learning rates at the end, and if we take the log of our <span class="math">\(\hbox{lr}_{i}\)</span> we have</p>
<div class="math">
\begin{equation*}
\log(\hbox{lr}_{i}) = \log(\hbox{lr}_{0}) + i \log(q) = \log(\hbox{lr}_{0}) + i\frac{\log(\hbox{lr}_{N-1}) - \log(\hbox{lr}_{0})}{N-1},
\end{equation*}
</div>
<p>which corresponds to regularly splitting the interval between our initial value and our final value... but on a log scale! That way we're sure to have evenly spaced points on our
curve, whereas by taking</p>
<div class="math">
\begin{equation*}
\hbox{lr}_{i} = \hbox{lr}_{0} + i \frac{\hbox{lr}_{N-1} - \hbox{lr}_{0}}{N-1}
\end{equation*}
</div>
<p>we would have had all the points concentrated near the end (since <span class="math">\(\hbox{lr}_{N-1}\)</span> is much bigger than <span class="math">\(\hbox{lr}_{0}\)</span>).</p>
<p>With all of this, we're ready to alter our previous training loop. This all supposes that you've got a neural net defined (in the variable called net), a data loader called
trn_loader, an optimizer and a loss function (called criterion).</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">find_lr</span><span class="p">(</span><span class="n">init_value</span> <span class="o">=</span> <span class="mf">1e-8</span><span class="p">,</span> <span class="n">final_value</span><span class="o">=</span><span class="mf">10.</span><span class="p">,</span> <span class="n">beta</span> <span class="o">=</span> <span class="mf">0.98</span><span class="p">):</span>
<span class="n">num</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">trn_loader</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span>
<span class="n">mult</span> <span class="o">=</span> <span class="p">(</span><span class="n">final_value</span> <span class="o">/</span> <span class="n">init_value</span><span class="p">)</span> <span class="o">**</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">num</span><span class="p">)</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">init_value</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">param_groups</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s1">'lr'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lr</span>
<span class="n">avg_loss</span> <span class="o">=</span> <span class="mf">0.</span>
<span class="n">best_loss</span> <span class="o">=</span> <span class="mf">0.</span>
<span class="n">batch_num</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">losses</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">log_lrs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">data</span> <span class="ow">in</span> <span class="n">trn_loader</span><span class="p">:</span>
<span class="n">batch_num</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="c1">#As before, get the loss for this mini-batch of inputs/outputs</span>
<span class="n">inputs</span><span class="p">,</span><span class="n">labels</span> <span class="o">=</span> <span class="n">data</span>
<span class="n">inputs</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">inputs</span><span class="p">),</span> <span class="n">Variable</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">net</span><span class="p">(</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">labels</span><span class="p">)</span>
<span class="c1">#Compute the smoothed loss</span>
<span class="n">avg_loss</span> <span class="o">=</span> <span class="n">beta</span> <span class="o">*</span> <span class="n">avg_loss</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">beta</span><span class="p">)</span> <span class="o">*</span><span class="n">loss</span><span class="o">.</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">smoothed_loss</span> <span class="o">=</span> <span class="n">avg_loss</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">beta</span><span class="o">**</span><span class="n">batch_num</span><span class="p">)</span>
<span class="c1">#Stop if the loss is exploding</span>
<span class="k">if</span> <span class="n">batch_num</span> <span class="o">></span> <span class="mi">1</span> <span class="ow">and</span> <span class="n">smoothed_loss</span> <span class="o">></span> <span class="mi">4</span> <span class="o">*</span> <span class="n">best_loss</span><span class="p">:</span>
<span class="k">return</span> <span class="n">log_lrs</span><span class="p">,</span> <span class="n">losses</span>
<span class="c1">#Record the best loss</span>
<span class="k">if</span> <span class="n">smoothed_loss</span> <span class="o"><</span> <span class="n">best_loss</span> <span class="ow">or</span> <span class="n">batch_num</span><span class="o">==</span><span class="mi">1</span><span class="p">:</span>
<span class="n">best_loss</span> <span class="o">=</span> <span class="n">smoothed_loss</span>
<span class="c1">#Store the values</span>
<span class="n">losses</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">smoothed_loss</span><span class="p">)</span>
<span class="n">log_lrs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">math</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="n">lr</span><span class="p">))</span>
<span class="c1">#Do the SGD step</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1">#Update the lr for the next step</span>
<span class="n">lr</span> <span class="o">*=</span> <span class="n">mult</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">param_groups</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s1">'lr'</span><span class="p">]</span> <span class="o">=</span> <span class="n">lr</span>
<span class="k">return</span> <span class="n">log_lrs</span><span class="p">,</span> <span class="n">losses</span>
</pre>
<p>Note that the learning rate is found into the dictionary stored in optimizer.param_groups. If we go back to our notebook with the MNIST data set, we can then define our
neural net, an optimizer and the loss function.</p>
<pre class="code python literal-block">
<span class="n">net</span> <span class="o">=</span> <span class="n">SimpleNeuralNet</span><span class="p">(</span><span class="mi">28</span><span class="o">*</span><span class="mi">28</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span><span class="n">lr</span><span class="o">=</span><span class="mf">1e-1</span><span class="p">)</span>
<span class="n">criterion</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">nll_loss</span>
</pre>
<p>And after this we can call this function to find our learning rate and plot the results.</p>
<pre class="code python literal-block">
<span class="n">logs</span><span class="p">,</span><span class="n">losses</span> <span class="o">=</span> <span class="n">find_lr</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">logs</span><span class="p">[</span><span class="mi">10</span><span class="p">:</span><span class="o">-</span><span class="mi">5</span><span class="p">],</span><span class="n">losses</span><span class="p">[</span><span class="mi">10</span><span class="p">:</span><span class="o">-</span><span class="mi">5</span><span class="p">])</span>
</pre>
<p>The skip of the first 10 values and the last 5 is another thing that the fastai library does by default, to remove the initial and final high losses and focus on the interesting
parts of the graph. I added all of this at the end of the previous notebook, and you can find it <a class="reference external" href="https://github.com/sgugger/Deep-Learning/blob/master/Learning%20rate%20finder.ipynb">here</a>.</p>
<p>This code modifies the neural net and its optimizer, so we have to be careful to reinitialize those after doing this, to the best value we can. An amelioration to the code
would be to save it then reload the initial state when we're done (which is what the fastai library does).</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>A Neural Net In Pytorch2018-03-16T10:13:00-04:002018-03-16T10:13:00-04:00Sylvain Guggertag:None,2018-03-16:/a-neural-net-in-pytorch.html<p class="first last">The theory is all really nice, but let's actually build a neural net and train it! We'll see how a simple neural net with one hidden layer can learn to recognize digits very efficiently.</p>
<p>This article goes with <a class="reference external" href="https://github.com/sgugger/Deep-Learning/blob/master/First%20neural%20net%20in%20pytorch.ipynb">this notebook</a> if you want to really do the experiment.
In particular, I won't explain the specifics of getting the data and preprocessing it here.</p>
<div class="section" id="pytorch">
<h2>Pytorch</h2>
<p><a class="reference external" href="http://pytorch.org/">Pytorch</a> is a Python library that provides all what is needed to implement Deep Learning easily. In particular, it enables GPU-accelerated computations and provides
automatic differentiation. We have seen why the latter is useful in the <a class="reference external" href="/what-is-deep-learning.html">previous article</a>, and this the reason why we will never have to worry
about calculating gradients (unless we really want to dig into that).</p>
<p>But why GPUs? As we have seen, Deep Learning is just a succession of linear operations with a few functions applied element-wise in between, and it happens that GPUs are really
good (and fast!) at those, because that's what is basically needed to decide which color should each pixel of the screen have when playing a game. Thanks to the gaming industry,
research on GPUs has made them extremely efficient, which is also why Deep Learning has become better in a lot of different areas. We can consider deeper network and train
them on much more data nowadays.</p>
<p>To use the full potential of this library, we're going to need one, preferably several, efficient GPU. A gaming computer can have one, but the best way is to rent some. Services
to rent GPUs by the hour have flourished and you can easily find some powerful virtual machines with efficient GPUs for less than fifty cents an hour. I'm personally using
Paperspace at the moment.</p>
<p>I'm mostly using pytorch because the library of fast.ai is built on top of it, but I really like the way it uses Python functionalities (as we'll see, it makes good use of
Object Oriented Programming in Python) and the fact the gradients are dynamically computed. It's making the implementation of Recurrent Neural Networks a lot easier in my
opinion, but we'll see more of that later.</p>
</div>
<div class="section" id="mnist-dataset">
<h2>MNIST Dataset</h2>
<p>To have some data on which try our neural net, we will use the MNIST Dataset. It's a set of hand-written digits that contains 70,000 pictures with their labels. It's divided
in two parts, one training set with 60,000 digits (on which we will train our model) and 10,000 others that form the test. These were drawn by different people from the ones
in the first test, and by evaluating how well on this set, we will see how well it actually generalizes what it learned.</p>
<p>We'll skip the part as to how to get those sets and how to treat them since it's all shown in the notebook. Let's go to the part where we define our neural net instead. The
pictures we are given have a size of 28 by 28 pixels, each pixel having a value of 0 (white) to 1 (black), so that makes 784 inputs. For this simple model, we choose one
hidden layer of 100 neurons, and then an output size of 10 since we have ten different digits.</p>
<p>Why 10 and not 1? It's true that in this case we could have asked for just one output going from 0 to 9 (and there are ways to make sure it'd behave like this) but in
image classification problems, we often give as many outputs as they are classes to determine. What if our next problem is to say if the picture if of a dog, a cat, a frog or
a horse? One output won't really represent this, whereas four outputs will certainly help, each of them representing the probability it's in a given class.</p>
</div>
<div class="section" id="softmax">
<h2>Softmax</h2>
<p>When we have a classification problem and a neural network trying to solve it with <span class="math">\(N\)</span> outputs (the number of classes), we would like those outputs to represent the probabilities
the input is in each of the classes. To make sure that our final <span class="math">\(N\)</span> numbers are all positive and add up to one, we use the softmax activation for the last layer.</p>
<p>If <span class="math">\(z_{1},\dots,z_{N}\)</span> are the last activations given by our final linear layer, instead of pushing them through a ReLU or a sigmoid, we define the outputs
<span class="math">\(y_{1},\dots,y_{N}\)</span> by</p>
<div class="math">
\begin{equation*}
y_{i} = \frac{\mathrm{e}^{z_{i}}}{\mathrm{e}^{z_{1}} + \cdots + \mathrm{e}^{z_{N}}} = \frac{\mathrm{e}^{z_{i}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}}.
\end{equation*}
</div>
<p>As we take the exponentials of the <span class="math">\(z_{i}\)</span>, we are sure all of them are positive. Then since we divide by their sum, they must all add up to one, so softmax satisfies
all the prerequisites we wanted for our final output.</p>
<p>One nice side effect (and which is the reason we chose the exponential) is that if one of the <span class="math">\(z_{i}\)</span> is slightly bigger than the other, its exponential will be a lot
bigger. This will have the effect that the corresponding <span class="math">\(y_{i}\)</span> will be close to 1, while the other <span class="math">\(y_{j}\)</span> are close to zero. Softmax is an activation that really
<em>wants</em> to pick one class over the other.</p>
<p>It's not essential, and a neural net could certainly learn with ReLU or sigmoid as its final activation function, but by using softmax we are making it easier for it to have
an output that is close to what we really want, so it will learn faster and generalize better.</p>
</div>
<div class="section" id="cross-entropy">
<h2>Cross Entropy</h2>
<p>To evaluate how badly our model is doing, we had seen the Mean Squared Error loss in the last article. When the output activation function is softmax or a sigmoid, another
function is usually used, called Cross Entropy Loss. If the correct class our model should pick is the <span class="math">\(i\)</span>-th, we define the loss as being <span class="math">\(-\ln(y_{i})\)</span> when
the output is <span class="math">\((y_{1},\dots,y_{N})\)</span>.</p>
<p>Since all the <span class="math">\(y_{i}\)</span> are between 0 and 1, this loss is a positive number, and it vanishes when <span class="math">\(y_{i} = 1\)</span>. If <span class="math">\(y_{i}\)</span> is real low though (and we are doing
a mistake in choosing this class) it'll get particularly high.</p>
<p>If we had multiple correct answers (in a multi-classification problem) we would sum the <span class="math">\(-\ln(y_{i})\)</span> over all the correct classes <span class="math">\(i\)</span>.</p>
<p>Note that with the usual formulas, we have</p>
<div class="math">
\begin{equation*}
ln(y_{i}) = \ln \left ( \frac{\mathrm{e}^{z_{i}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}} \right ) = \ln(\mathrm{e}^{z_{i}}) - \ln \left ( \sum_{k=1}^{N} \mathrm{e}^{z_{k}} \right ) = z_{i} - \ln \left ( \sum_{k=1}^{N} \mathrm{e}^{z_{k}} \right ).
\end{equation*}
</div>
<p>so the derivative of the loss with respect to <span class="math">\(z_{i}\)</span> is</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial z_{i}} = -1 + \frac{\mathrm{e}^{z_{i}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}} = y_{i} - 1
\end{equation*}
</div>
<p>and the derivative of the loss with respect to <span class="math">\(z_{j}\)</span> with <span class="math">\(j \neq i\)</span> is</p>
<div class="math">
\begin{equation*}
\frac{\partial \hbox{loss}}{\partial z_{j}} = \frac{\mathrm{e}^{z_{j}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}} = y_{j}
\end{equation*}
</div>
<p>so it's always <span class="math">\(y_{j} - \hat{y_{j}}\)</span>, where <span class="math">\(\hat{y_{j}}\)</span> is the output we are supposed to obtain. This simplification makes it easier to compute the gradients, and
it also has the advantage of giving a higher gradient when the error is big, whereas with the MSE loss we'd end up with littler ones, hence learning more slowly.</p>
<p>In practice, pytorch implemented the computation of log softmax faster than softmax, and since we're using the log of the softmax in our loss function, we'll use
log softmax as the output activation function. The only thing we have to remember is that we'll then receive the logs of the probabilities for our input to be in each class,
which means we'll have to put them through exp if we want to see the actual probabilities.</p>
</div>
<div class="section" id="writing-our-model">
<h2>Writing our model</h2>
<p>In what follows we consider the following imports have been done:</p>
<pre class="code python literal-block">
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="kn">as</span> <span class="nn">nn</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="kn">as</span> <span class="nn">F</span>
<span class="kn">import</span> <span class="nn">torch.optim</span> <span class="kn">as</span> <span class="nn">optim</span>
<span class="kn">from</span> <span class="nn">torch.autograd</span> <span class="kn">import</span> <span class="n">Variable</span>
</pre>
<p>The first module contains the basic functions of torch, allowing us to build and manipulate tensors, which are the arrays this library handles. The submodule nn contains
all the functions we will need to build a neural net, and its submodule functional has all the functions we will need (like ReLU, softmax...). The aliases are the same as in the
pytorch documentation, and the ones usually used. We'll see what optim and Variable are used for a bit later.</p>
<p>To write our neural net in pytorch, we create a specific kind of nn.Module, which is the generic pytorch class that handles models. To do so, we only have to create a new
subclass of nn.Module:</p>
<pre class="code python literal-block">
<span class="k">class</span> <span class="nc">SimpleNeuralNet</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
</pre>
<p>Then in this class, we have to define two functions. The initialization and the forward pass. In the first function, we create the actual layers, with their weights and biases,
and in the second one, we explain how to compute the output from the input.</p>
<p>In the initialization, we have to remember to initialize the parent class (nn.Module) or we won't be able to use all the properties of those nn.Module, then we just define
our two layers, which can simply be done by using nn.Linear. This is another subclass of nn.Module which represents a classic linear layer. Note that when we have defined
on our custom nn.Module, we can use them inside the definition of another one.</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">n_in</span><span class="p">,</span><span class="n">n_hidden</span><span class="p">,</span><span class="n">n_out</span><span class="p">):</span>
<span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">linear1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_in</span><span class="p">,</span><span class="n">n_hidden</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">linear2</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_hidden</span><span class="p">,</span><span class="n">n_out</span><span class="p">)</span>
</pre>
<p>The code is pretty straightforward, our linear layers have been automatically initialized by pytorch, with random weights and biases.
For the forward pass, it's almost as easy, there's just one little problem. Our input is going to be a mini-batch of images. Inside pytorch,
it will be stored as a tensor (think array) of size mb by 1 by 28 by 28, where mb is the number we choose for our mini-batch size (64 in the notebook).</p>
<p>Why is that? Well it's faster to compute all the outputs of the mini-batch at the same time. If we remember how a linear layer works, we calculate <span class="math">\(XW + B\)</span> where
<span class="math">\(X\)</span> is the input viewed as a line, <span class="math">\(W\)</span> the weight matrix and <span class="math">\(B\)</span> the vector of biases. Instead of doing this mb times, we can be more efficient and do all
the operations at once, if we replace <span class="math">\(X\)</span> by a matrix, each line being one of the different inputs of the mini-batch: <span class="math">\(X_{1},\dots,X_{n_{in}}\)</span>. This way, <span class="math">\(XW + B'\)</span>
is going to be a matrix where each line is a vector of outputs, the only trick being to replace <span class="math">\(B\)</span> by a matrix with the same number of lines as <span class="math">\(X\)</span>, repeating <span class="math">\(B\)</span> each time.</p>
<div class="math">
\begin{equation*}
\left ( \begin{array}{c} X_{1} \\ X_{2} \\ \vdots \\ X_{n_{in}} \end{array} \right ) \times W + \left ( \begin{array}{c} B \\ B \\ \vdots \\ B \end{array} \right ) = \left ( \begin{array}{c} Y_{1} \\ Y_{2} \\ \vdots \\ Y_{n_{out}} \end{array} \right )
\end{equation*}
</div>
<p>This process is called vectorization.</p>
<p>So that explain the first dimension in our tensor. The last two are the actual size of the picture (28 by 28 pixels) and pytorch adds a dimension because he knows our input is
an image, and usually images have three channels (for red, green and blue). We have 1 here because the picture is black and white.</p>
<p>Following the logic of this vectorization process, the first linear layer is going to expect a tensor of size mb by 784 (which is the result of 28 * 28), so we have to resize
our input (we usually say flatten). To do so, we use the method view:</p>
<pre class="code python literal-block">
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</pre>
<p>In this line, we tell pytorch to transform x into a two-dimensional array, with a first dimension being the same as the previous value of x, and the second, whatever it needs
to be so that it fits the previous shape of x.</p>
<p>Once we have this line, the rest of the forward pass is easy: we apply the first linear layer, a ReLU, the second linear layer, and the log softmax. Note that all the functions
we need are in the F (for nn.functional) library.</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span><span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">linear1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">return</span> <span class="n">F</span><span class="o">.</span><span class="n">log_softmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">linear2</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
</pre>
<p>Then, we just have to create an instance of our model by calling the class with the arguments it needs (here n_in, n_hidden and n_out).</p>
<pre class="code python literal-block">
<span class="n">net</span> <span class="o">=</span> <span class="n">SimpleNeuralNet</span><span class="p">(</span><span class="mi">784</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span>
</pre>
<p>The only parameter we can choose here is the number of neurons in the hidden layer. I've picked 100 but you can try something else.</p>
</div>
<div class="section" id="the-training-loop">
<h2>The training loop</h2>
<p>Now that we have our model, we must train him to recognize digits. With a random initialization, we can expect it to have a 10%-accuracy at the beginning. But we'll see how
quickly it improves when applying SGD.</p>
<p>The key thing pytorch provides us with, is automatic differentiation. This means we won't have to compute the gradients ourselves. There is two little things to think of, though.
The first one is that pytorch must remember how an output was created from an input, to be able to roll back from this definition and calculate the gradients. This is done
through the Variable object. Instead of feeding a tensor to our model, we will wrap it in a Variable.</p>
<pre class="code python literal-block">
<span class="n">x</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</pre>
<p>The new object x still has all the inputs, that we can find in x.data, but this new object has other attributes, one of them being the gradient. If we call the model on x to
get the outputs and feed that in the loss function (with the expected label) we'll be able to get the derivatives of the loss function with respect to x. We told pytorch we would
need them when we typed requires_grad=True.</p>
<pre class="code python literal-block">
<span class="n">outputs</span> <span class="o">=</span> <span class="n">net</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">nll_loss</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span><span class="n">Variable</span><span class="p">(</span><span class="n">labels</span><span class="p">))</span>
</pre>
<p>Note that we don't use the Cross Entropy loss function since the outputs are already the logarithms of the softmax, and that the labels must also be wrapped inside a Variable.</p>
<p>Once we have done this, we ask pytorch to compute the gradients of the loss like this:</p>
<pre class="code python literal-block">
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
</pre>
<p>and the derivatives of the loss with respect to x for instance, will be in the Variable x.grad (or x.grad.data if we want the values).</p>
<p>The second thing we don't want to forget is that pytorch accumulates the gradients. That means he sums there over, each time we call this backward function. This is why we have
to reinitialize them via x.grad.data.zero_ before we want to calculate new derivatives.</p>
<p>Then, the actual step of the SGD can be done automatically by the use of a pytorch optimizer. We can use the library optim to define one, and will have to pass him the
parameters we want to change at each step (in our case, all the weights and biases in our network) and the learning rate we want to use. Here we define</p>
<pre class="code python literal-block">
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span><span class="n">lr</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span>
</pre>
<p>Then we won't need to write the line where we subtract to each parameter the learning rate multiplied by the gradient, this will all be done by calling optimizer.step().
To reinitialize all the gradients of the parameters of our model, we'll just have to type optimizer.zero_grad().</p>
<p>Once this is all done, we can write our training loop. It consists, for each epoch, in looking through all the data, compute the outputs of each mini-batch of inputs, compare
them with their theoretical labels via the loss function, compute the gradients of the loss functions with respect to all the parameters and adjust them in consequence. We just
had the computation of the accuracy to print how well we are doing at the end of each epoch.</p>
<pre class="code python literal-block">
<span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">nb_epoch</span><span class="p">):</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nb_epoch</span><span class="p">):</span>
<span class="n">running_loss</span> <span class="o">=</span> <span class="mf">0.</span>
<span class="n">corrects</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s1">'Epoch {epoch+1}:'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">data</span> <span class="ow">in</span> <span class="n">trn_loader</span><span class="p">:</span>
<span class="c1">#separate the inputs from the labels</span>
<span class="n">inputs</span><span class="p">,</span><span class="n">labels</span> <span class="o">=</span> <span class="n">data</span>
<span class="c1">#wrap those into variables to keep track of how they are created and be able to compute their gradient.</span>
<span class="n">inputs</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">inputs</span><span class="p">),</span> <span class="n">Variable</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span>
<span class="c1">#Put the gradients back to zero</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="c1">#Compute the outputs given by our model at this stage.</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">net</span><span class="p">(</span><span class="n">inputs</span><span class="p">)</span>
<span class="n">_</span><span class="p">,</span><span class="n">preds</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">outputs</span><span class="o">.</span><span class="n">data</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="c1">#Compute the loss</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">nll_loss</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">labels</span><span class="p">)</span>
<span class="n">running_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="o">.</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">inputs</span><span class="o">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">corrects</span> <span class="o">+=</span> <span class="n">torch</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">data</span> <span class="o">==</span> <span class="n">preds</span><span class="p">)</span>
<span class="c1">#Backpropagate the computation of the gradients</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1">#Do the step of the SGD</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="s1">'Loss: {running_loss/len(trn_set)} Accuracy: {100.*corrects/len(trn_set)}'</span><span class="p">)</span>
</pre>
<p>After training our simple neural net for 10 epochs on the train set, we get an accuracy 96.23%. It seems like a great result but we need to see if it generalizes well or
if our model just learned to recognize the particular images of the training set extremely well (we call this overfitting).</p>
<p>The loop to check how well our model is doing on the test test is very similar to the training loop, minus the gradients, and as shwon on the notebook, we get a 96% accuracy
there. Not bad for such a simple model!</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>What Is Deep Learning?2018-03-13T17:20:00-04:002018-03-13T17:20:00-04:00Sylvain Guggertag:None,2018-03-13:/what-is-deep-learning.html<p class="first last">What is deep learning? It's a class of algorithms where you train something called a neural net to complete a specific task. Let's begin with a general overview and we will dig into the details in subsequent articles.</p>
<p>What is deep learning? It's a class of algorithms where you train something called a neural net to complete a specific task. Let's begin with a general overview and we will dig into
the details in subsequent articles.</p>
<div class="section" id="a-neural-net">
<h2>A neural net</h2>
<p>To some extent, a neural net is an attempt by engineers to get a computer to replicate the behavior of our brain. A neuron in our brain gets a .</p>
<img alt="Neuron model" class="align-center" src="../images/art1_neuron.png" />
<p>A model of that is to consider a structure getting a certain number of entries, each of them having a weight attributed to them. If this neuron as <span class="math">\(n\)</span> entries that are
<span class="math">\(x_{1},\dots,x_{n}\)</span>, with the weights <span class="math">\(w_{1},\dots,w_{n}\)</span>, we consider the sum of the inputs multiplied by the weight to compute the output:</p>
<div class="math">
\begin{equation*}
y = f \left ( w_{1}x_{1} + \cdots + w_{n}x_{n} \right )
\end{equation*}
</div>
<p>or in a more compact way</p>
<div class="math">
\begin{equation*}
y = f \left ( \sum_{i=1}^{n} w_{i} x_{i} \right ).
\end{equation*}
</div>
<p>Here <span class="math">\(y\)</span> is the output and <span class="math">\(f\)</span> a function called the activation function. A few classical activation functions are the rectified linear unit (ReLU), the
sigmoid function or the hyperbolic tangent function:</p>
<img alt="Graph of the ReLU function" class="align-center" src="../images/art1_relu.png" style="width: 500px;" />
<div class="math">
\begin{equation*}
\hbox{ReLU}(x) = \max(0,x)
\end{equation*}
</div>
<img alt="Graph of the sigmoid function" class="align-center" src="../images/art1_sigmoid.png" style="width: 400px;" />
<div class="math">
\begin{equation*}
\sigma(x) = \frac{\mathrm{e}^{x}}{1+\mathrm{e}^{x}}
\end{equation*}
</div>
<img alt="Graph of the tanh function" class="align-center" src="../images/art1_tanh.png" style="width: 500px;" />
<div class="math">
\begin{equation*}
\hbox{tanh}(x) = \frac{\mathrm{e}^{x} - \mathrm{e}^{-x}}{\mathrm{e}^{x} + \mathrm{e}^{-x}}
\end{equation*}
</div>
<p>The function <span class="math">\(\hbox{tanh}\)</span> is actually a sigmoid that we enlarged and translated to go from -1 to 1. To do this, we just have to multiply the sigmoid by 2 (the range between
-1 and 1) then subtract 1:</p>
<div class="math">
\begin{equation*}
\hbox{tanh}(x) = 2 \sigma(2x) - 1
\end{equation*}
</div>
<p>These three functions are just the three most popular, but we could take any function, as long as it's non-linear and easy to differentiate (for reasons we will see later).</p>
<p>One last parameter we usually consider in our neuron is something called a bias, that is added to the weighted sum of the inputs before going into the activation function.</p>
<div class="math">
\begin{equation*}
y = f \left ( \sum_{i=1}^{n} w_{i} x_{i} + b \right ).
\end{equation*}
</div>
<p>If we consider the case of a ReLU activation function (which basically replaces the negative by zero), the opposite of this bias is then the minimum value to reach to get an
output that isn't nil.</p>
<p>So that's one neuron. In a neural net, we have quite a few of them, regrouped in what we call layers. An example of neural net with a single layer would look like this:</p>
<img alt="Neural net with one layer" class="align-center" src="../images/art1_simplennet.jpg" />
<p>So let's say we want to build a neural net with an input size <span class="math">\(n_{in}\)</span> and an output size <span class="math">\(n_{out}\)</span>. We then must have <span class="math">\(n_{out}\)</span> neurons.
Each one of our neurons then has <span class="math">\(n_{in}\)</span> inputs, so it must have as many weights. If we consider the neuron number <span class="math">\(i\)</span> we can call this weights
<span class="math">\(w_{1,i},\dots,w_{n_{in},i}\)</span> and the bias of the neuron <span class="math">\(b_{i}\)</span> then the output number <span class="math">\(i\)</span> is</p>
<div class="math">
\begin{equation*}
y_{i} = f \left ( \sum_{k=1}^{n_{in}} x_{k} w_{k,i} + b_{i} \right )
\end{equation*}
</div>
<p>where <span class="math">\(x_{1},\dots,x_{n_{in}}\)</span> are the coordinates of the input. There is a more compact way to write this, with a little bit of linear algebra. The big sum inside the
parenthesis is just the i-th coordinate of the matrix product <span class="math">\(XW\)</span> if we define the matrix <span class="math">\(W\)</span> as the array of weights <span class="math">\((w_{i,k})\)</span> (with <span class="math">\(n_{in}\)</span> rows and
<span class="math">\(n_{out}\)</span> columns) and <span class="math">\(X\)</span> is a vector containing the inputs (viewed as a single row). If we then note <span class="math">\(Y\)</span> the vector containing the outputs and <span class="math">\(B\)</span> the
vector containing the biases (both have <span class="math">\(n_{out}\)</span> coordinates), we can simply write</p>
<div class="math">
\begin{equation*}
Y = f(XW + B)
\end{equation*}
</div>
<p>where <span class="math">\(f\)</span> is applied to each one of the coordinates of the vector <span class="math">\(XW + B\)</span>.</p>
<p>This is the only thing a neural net does, apply a linear operation then an activation function. Except it does that many times: instead of having just one layer of neurons, we
have multiple ones, each one feeding the next.</p>
<img alt="Neural net with three layers" class="align-center" src="../images/art1_complexnnet.png" />
<p>Here we have three layers, each one having its own set of weights <span class="math">\(W_{l}\)</span>, its vector of biases <span class="math">\(B_{l}\)</span> and its activation function <span class="math">\(f_{l}\)</span>. The only constraint
is that each vector of bias as the same number of coordinates as the number of columns of the weigh matrix, which must also be the number of rows of the next weight matrix.</p>
<p>If we have an input <span class="math">\(X\)</span>, we compute the output by going through each layer, one after the other:</p>
<div class="math">
\begin{equation*}
\left \{ \begin{array}{l} X_{0} = X \\ X_{1} = f_{0}(X_{0}W_{1} + B_{1}) \\ X_{2} = f_{1}(X_{1}W_{2} + B_{2}) \\ \vdots \\ X_{L} = f_{L}(X_{L-1}W_{L} + B_{L}) \end{array} \right .
\end{equation*}
</div>
<p>where <span class="math">\(L\)</span> is the number of layers. This is when we see why each activation function must be non-linear. If one (say <span class="math">\(f_{0}\)</span>) was linear, the operations
going from <span class="math">\(X_{0}\)</span> to <span class="math">\(X_{1}W_{2} + B_{2}\)</span> would all be linear, so they could be summarized in to</p>
<div class="math">
\begin{equation*}
X_{1}W_{2} + B_{2} = X_{0}W'_{1} + B'_{1}
\end{equation*}
</div>
<p>and there wouldn't be any need to have that initial first layer.</p>
</div>
<div class="section" id="training">
<h2>Training</h2>
<p>Now that we know what a neural net is, we can study how we can teach him to solve a particular problem. All the weights and the biases of the neural net we saw before (that are
called the parameters of our model) are initialized at random, so initially, the output the neural net will compute has nothing to do with what we would expect. It's through
a process called training that we will make our model better.</p>
<p>To do this, we need a set of labeled data, which is a collection of inputs where we know the desired output, for instance, in an image classification problem, pictures that have
been classified for us. We can then evaluate how badly our model is doing by computing all the outputs and comparing them to the theoretical ones. To give a value to this, we
use an error function.</p>
<p>An error function that is often used is called MSE for Mean Squared Errors. If <span class="math">\(Y\)</span> is the output we computed and <span class="math">\(Z\)</span> the one we should have found, one way to see
how far away <span class="math">\(Y\)</span> is from <span class="math">\(Z\)</span> is to take the mean of the errors between each coordinate <span class="math">\(y_{i}\)</span> and <span class="math">\(z_{i}\)</span>. This error can be represented by
<span class="math">\((z_{i}-y_{i})^{2}\)</span>. The square is to get rid of the negatives (an error of -4 is as bas as an error of 4). If <span class="math">\(Y\)</span> and <span class="math">\(Z\)</span> are of size <span class="math">\(n_{out}\)</span>, this
can be written</p>
<div class="math">
\begin{equation*}
\hbox{MSE}(Y,Z) = \frac{1}{n_{out}} \sum_{i=1}^{n_{out}} (z_{i}-y_{i})^{2}.
\end{equation*}
</div>
<p>And then we can define the total loss by taking the mean of the loss on all our data. If we have <span class="math">\(N\)</span> inputs <span class="math">\(X_{1},\dots,X_{N}\)</span> that are labeled
<span class="math">\(Z_{1},\dots,Z_{N}\)</span> (the theoretical outputs we are supposed to get), by computing what the network returns when you feed it the <span class="math">\(X_{i}\)</span> and naming this as
<span class="math">\(Y_{i}\)</span> we then have</p>
<div class="math">
\begin{equation*}
\hbox{loss} = \frac{1}{N} \sum_{k=1}^{N} \hbox{MSE}(Y_{k},Z_{k})
\end{equation*}
</div>
<p>Any kind of function could be used for the loss, as long as it's always positive (you don't want to subtract things from previous losses), that it only vanishes at zero, and
that it's easy to differentiate. The total loss is always taken by averaging the loss on all the samples of the dataset.</p>
<p>Since the network's parameters are initialized at random, this loss will be pretty bad at the beginning. Training is a process during which the computer will compute this loss,
analyze why it's so bad, and try to do a little bit better the next time. More specifically, we will try to determine a new set of parameters (all the weights and the biases)
that will give us a slightly better loss. Then by repeating this over and over again, we should find the set of parameters that minimize this loss.</p>
<p>The exciting thing with neural networks, is that even if they learn on a specific dataset, they tend to generalize pretty well (and there's a bunch of techniques we can use to
make sure the model doesn't overfit to the training data). In image recognition for instance, those kinds of models can have better accuracy than humans do.</p>
<p>To minimize this loss, we use an algorithm called SGD for Stochastic Gradient Descent. The idea is fairly simple: if you're in the mountains and looking for the point that is
at the lowest altitude, you just take a step down, and a step down, and so forth until you reach that particular spot. This is going to be exactly the same for our neural
net and its loss. To minimize that function, we will take a little step down.</p>
<p>This function loss depends of a certain amount of parameters <span class="math">\(p_{1},\dots,p_{t}\)</span> (all the weights and all the biases). Now, with just a little bit of math, we know that
the way down for a function of <span class="math">\(t\)</span> variables (which is the direction where it's steeper) is given by the opposite of the gradient. This is the vector</p>
<div class="math">
\begin{equation*}
\overrightarrow{\hbox{grad}}(\hbox{loss}) = \left ( \frac{\partial \hbox{loss}}{\partial p_{1}}, \dots, \frac{\partial \hbox{loss}}{\partial p_{t}} \right )
\end{equation*}
</div>
<p>To update our parameters, which just have to take a step along the opposite of the gradients, which means subtract to the vector <span class="math">\((p_{1},\dots,p_{t})\)</span> a little bit
multiplied by this gradient vector. How much? That's the question that has been driving crazy a lot of data scientists, and we will give an answer in another article.
This little bit is called the learning rate, and if we note it <span class="math">\(\hbox{lr}\)</span> we can update our parameters with the formulas:</p>
<div class="math">
\begin{equation*}
\hbox{new } p_{i} = \hbox{old } p_{i} - \hbox{lr} \times \frac{\partial \hbox{loss}}{\partial p_{i}}.
\end{equation*}
</div>
<p>By doing this, we know that the loss, the next time we compute all the outputs of all our data, is going to be better (the only exception would be if we chose a too high learning
rate, which would make us miss the spot where our function was lowest, but we'll get back to this later). So by repeating this step over and over again, we will eventually get
to a minimum of our loss function, and a very good model.</p>
<p>This explains the Gradient Descent in SGD but not the Stochastic part. The random part appears by necessity: very often our training dataset has a lot of labeled inputs. It can
be as big as a million images. That's why, for a step of gradient descent, we don't compute the total loss, but rather the loss on a smaller sample called a mini-batch. If we
choose to take the sample <span class="math">\((X_{k_{1}},Z_{k_{1}}),\dots,(X_{k_{mb}},Z_{k_{mb}})\)</span> the loss on this mini-batch will just be:</p>
<div class="math">
\begin{equation*}
\hbox{loss}' = \frac{1}{mb} \sum_{q=1}^{mb} \hbox{MSE}(Y_{k_{q}},Z_{k_{q}})
\end{equation*}
</div>
<p>The idea is that this new loss will have a gradient that is close to the gradient of the real loss (since we're averaging on a mini-batch and not just taking one sample) but with
fewer computation time. In practice, to make sure we still see all of the data, we don't overlap the mini-batches, taking different parts of our training set each time we randomly
pick a mini-batch, and updating all the parameters of our network each time, up until we have seen all the inputs once. This whole process is called an epoch.</p>
<p>We can then run as many epochs as we want (or as we have time to), as long as the learning rate is low enough, the neural network should progress and become better each time.
The gradient may seem a bit complicated to evaluate, but it can be computed exactly by using the chain rule, going backward from the end (compute the derivatives of the loss
function with respects to the obtained outputs), through each layer, up until the beginning.</p>
<p>That is all the general theory behind a neural network. I will dig more into the details in further articles to explain the different layers we can find, the little tweaks we can
add to SGD to make it train faster, how to set the learning rate and how to compute those gradients. We'll see how to code simple and more complex examples of neural networks
in pytorch, but you can already jump a bit ahead and look at the <a class="reference external" href="http://course.fast.ai/lessons/lesson1.html">first video</a> of the deep-learning course of fast.ai, and train in a few minutes a neural net recognizing cats from
dogs with 99% accuracy.</p>
</div>
<script type='text/javascript'>if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Why Write A Blog?2018-03-12T16:57:00-04:002018-03-12T16:57:00-04:00Sylvain Guggertag:None,2018-03-12:/why-write-a-blog.html<p>I've just been accepted to follow the 2018 version of Deep learning - part 2 on <a class="reference external" href="http://fast.ai/">fast.ai</a>, and I'm pretty excited about it. As I'm reaching the stage at which this is becoming more
than a hobby and the plan is to switch careers to Data Science, I've taken the …</p><p>I've just been accepted to follow the 2018 version of Deep learning - part 2 on <a class="reference external" href="http://fast.ai/">fast.ai</a>, and I'm pretty excited about it. As I'm reaching the stage at which this is becoming more
than a hobby and the plan is to switch careers to Data Science, I've taken the time to ponder all the alternatives and I've settled on self-studying.</p>
<p>Which brings me to this blog. How do you measure your progress when there's no one to grade your work? One solution I've found is to write about what I learn. Nothing here is going
to be new, I simply intend to explain in my own words concepts that have been detailed elsewhere (probably with fewer grammatical mistakes!). I could say that
my only reader is going to be my mom, but she doesn't read English, so that won't even be the case. As an ex-teacher, I simply believe you've
never completely mastered something until you've taught it to someone else.</p>
<p>For now the curriculum I have settled on is:</p>
<ul class="simple">
<li>the deep learning course of <a class="reference external" href="http://fast.ai/">fast.ai</a> (part 1 and 2);</li>
<li>the machine learning course of <a class="reference external" href="http://fast.ai/">fast.ai</a>;</li>
<li>the online book of <a class="reference external" href="http://neuralnetworksanddeeplearning.com/">Michael Nielsen</a>;</li>
<li>Python for data analysis.</li>
</ul>
<p>I will update this list as it grows. I will try to get to the bottom of all the concepts I learn, and I intend to code everything from scratch. Since the
fast.ai library is wrapped on top of pytorch, this is the library I will mostly use, along with numpy and pandas. All these articles will be in the Deep learning category.</p>
<p>I also plan to play along with different approaches and parameters, to highlight the importance of each decision we make when building a model. To that end, I'll design and
implement as many experiments as I can and put the results in the Experiments category.</p>
<p>Explaining things is all very good, but it's even better to show what you can actually do. I plan to enter a few <a class="reference external" href="http://kaggle.com/">Kaggle</a> competitions, hopefully achieve a good ranking,
and I will also use some of the articles of this blog as a portfolio to demonstrate my skills. Those articles will be in the Portfolio category.</p>