A Neural Net In Pytorch

Posted on Fri 16 March 2018 in Basics

This article goes with this notebook if you want to really do the experiment. In particular, I won't explain the specifics of getting the data and preprocessing it here.


Pytorch is a Python library that provides all what is needed to implement Deep Learning easily. In particular, it enables GPU-accelerated computations and provides automatic differentiation. We have seen why the latter is useful in the previous article, and this the reason why we will never have to worry about calculating gradients (unless we really want to dig into that).

But why GPUs? As we have seen, Deep Learning is just a succession of linear operations with a few functions applied element-wise in between, and it happens that GPUs are really good (and fast!) at those, because that's what is basically needed to decide which color should each pixel of the screen have when playing a game. Thanks to the gaming industry, research on GPUs has made them extremely efficient, which is also why Deep Learning has become better in a lot of different areas. We can consider deeper network and train them on much more data nowadays.

To use the full potential of this library, we're going to need one, preferably several, efficient GPU. A gaming computer can have one, but the best way is to rent some. Services to rent GPUs by the hour have flourished and you can easily find some powerful virtual machines with efficient GPUs for less than fifty cents an hour. I'm personally using Paperspace at the moment.

I'm mostly using pytorch because the library of fast.ai is built on top of it, but I really like the way it uses Python functionalities (as we'll see, it makes good use of Object Oriented Programming in Python) and the fact the gradients are dynamically computed. It's making the implementation of Recurrent Neural Networks a lot easier in my opinion, but we'll see more of that later.

MNIST Dataset

To have some data on which try our neural net, we will use the MNIST Dataset. It's a set of hand-written digits that contains 70,000 pictures with their labels. It's divided in two parts, one training set with 60,000 digits (on which we will train our model) and 10,000 others that form the test. These were drawn by different people from the ones in the first test, and by evaluating how well on this set, we will see how well it actually generalizes what it learned.

We'll skip the part as to how to get those sets and how to treat them since it's all shown in the notebook. Let's go to the part where we define our neural net instead. The pictures we are given have a size of 28 by 28 pixels, each pixel having a value of 0 (white) to 1 (black), so that makes 784 inputs. For this simple model, we choose one hidden layer of 100 neurons, and then an output size of 10 since we have ten different digits.

Why 10 and not 1? It's true that in this case we could have asked for just one output going from 0 to 9 (and there are ways to make sure it'd behave like this) but in image classification problems, we often give as many outputs as they are classes to determine. What if our next problem is to say if the picture if of a dog, a cat, a frog or a horse? One output won't really represent this, whereas four outputs will certainly help, each of them representing the probability it's in a given class.


When we have a classification problem and a neural network trying to solve it with \(N\) outputs (the number of classes), we would like those outputs to represent the probabilities the input is in each of the classes. To make sure that our final \(N\) numbers are all positive and add up to one, we use the softmax activation for the last layer.

If \(z_{1},\dots,z_{N}\) are the last activations given by our final linear layer, instead of pushing them through a ReLU or a sigmoid, we define the outputs \(y_{1},\dots,y_{N}\) by

\begin{equation*} y_{i} = \frac{\mathrm{e}^{z_{i}}}{\mathrm{e}^{z_{1}} + \cdots + \mathrm{e}^{z_{N}}} = \frac{\mathrm{e}^{z_{i}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}}. \end{equation*}

As we take the exponentials of the \(z_{i}\), we are sure all of them are positive. Then since we divide by their sum, they must all add up to one, so softmax satisfies all the prerequisites we wanted for our final output.

One nice side effect (and which is the reason we chose the exponential) is that if one of the \(z_{i}\) is slightly bigger than the other, its exponential will be a lot bigger. This will have the effect that the corresponding \(y_{i}\) will be close to 1, while the other \(y_{j}\) are close to zero. Softmax is an activation that really wants to pick one class over the other.

It's not essential, and a neural net could certainly learn with ReLU or sigmoid as its final activation function, but by using softmax we are making it easier for it to have an output that is close to what we really want, so it will learn faster and generalize better.

Cross Entropy

To evaluate how badly our model is doing, we had seen the Mean Squared Error loss in the last article. When the output activation function is softmax or a sigmoid, another function is usually used, called Cross Entropy Loss. If the correct class our model should pick is the \(i\)-th, we define the loss as being \(-\ln(y_{i})\) when the output is \((y_{1},\dots,y_{N})\).

Since all the \(y_{i}\) are between 0 and 1, this loss is a positive number, and it vanishes when \(y_{i} = 1\). If \(y_{i}\) is real low though (and we are doing a mistake in choosing this class) it'll get particularly high.

If we had multiple correct answers (in a multi-classification problem) we would sum the \(-\ln(y_{i})\) over all the correct classes \(i\).

Note that with the usual formulas, we have

\begin{equation*} ln(y_{i}) = \ln \left ( \frac{\mathrm{e}^{z_{i}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}} \right ) = \ln(\mathrm{e}^{z_{i}}) - \ln \left ( \sum_{k=1}^{N} \mathrm{e}^{z_{k}} \right ) = z_{i} - \ln \left ( \sum_{k=1}^{N} \mathrm{e}^{z_{k}} \right ). \end{equation*}

so the derivative of the loss with respect to \(z_{i}\) is

\begin{equation*} \frac{\partial \hbox{loss}}{\partial z_{i}} = -1 + \frac{\mathrm{e}^{z_{i}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}} = y_{i} - 1 \end{equation*}

and the derivative of the loss with respect to \(z_{j}\) with \(j \neq i\) is

\begin{equation*} \frac{\partial \hbox{loss}}{\partial z_{j}} = \frac{\mathrm{e}^{z_{j}}}{\sum_{k=1}^{N} \mathrm{e}^{z_{k}}} = y_{j} \end{equation*}

so it's always \(y_{j} - \hat{y_{j}}\), where \(\hat{y_{j}}\) is the output we are supposed to obtain. This simplification makes it easier to compute the gradients, and it also has the advantage of giving a higher gradient when the error is big, whereas with the MSE loss we'd end up with littler ones, hence learning more slowly.

In practice, pytorch implemented the computation of log softmax faster than softmax, and since we're using the log of the softmax in our loss function, we'll use log softmax as the output activation function. The only thing we have to remember is that we'll then receive the logs of the probabilities for our input to be in each class, which means we'll have to put them through exp if we want to see the actual probabilities.

Writing our model

In what follows we consider the following imports have been done:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

The first module contains the basic functions of torch, allowing us to build and manipulate tensors, which are the arrays this library handles. The submodule nn contains all the functions we will need to build a neural net, and its submodule functional has all the functions we will need (like ReLU, softmax...). The aliases are the same as in the pytorch documentation, and the ones usually used. We'll see what optim and Variable are used for a bit later.

To write our neural net in pytorch, we create a specific kind of nn.Module, which is the generic pytorch class that handles models. To do so, we only have to create a new subclass of nn.Module:

class SimpleNeuralNet(nn.Module):

Then in this class, we have to define two functions. The initialization and the forward pass. In the first function, we create the actual layers, with their weights and biases, and in the second one, we explain how to compute the output from the input.

In the initialization, we have to remember to initialize the parent class (nn.Module) or we won't be able to use all the properties of those nn.Module, then we just define our two layers, which can simply be done by using nn.Linear. This is another subclass of nn.Module which represents a classic linear layer. Note that when we have defined on our custom nn.Module, we can use them inside the definition of another one.

def __init__(self,n_in,n_hidden,n_out):
    self.linear1 = nn.Linear(n_in,n_hidden)
    self.linear2 = nn.Linear(n_hidden,n_out)

The code is pretty straightforward, our linear layers have been automatically initialized by pytorch, with random weights and biases. For the forward pass, it's almost as easy, there's just one little problem. Our input is going to be a mini-batch of images. Inside pytorch, it will be stored as a tensor (think array) of size mb by 1 by 28 by 28, where mb is the number we choose for our mini-batch size (64 in the notebook).

Why is that? Well it's faster to compute all the outputs of the mini-batch at the same time. If we remember how a linear layer works, we calculate \(XW + B\) where \(X\) is the input viewed as a line, \(W\) the weight matrix and \(B\) the vector of biases. Instead of doing this mb times, we can be more efficient and do all the operations at once, if we replace \(X\) by a matrix, each line being one of the different inputs of the mini-batch: \(X_{1},\dots,X_{n_{in}}\). This way, \(XW + B'\) is going to be a matrix where each line is a vector of outputs, the only trick being to replace \(B\) by a matrix with the same number of lines as \(X\), repeating \(B\) each time.

\begin{equation*} \left ( \begin{array}{c} X_{1} \\ X_{2} \\ \vdots \\ X_{n_{in}} \end{array} \right ) \times W + \left ( \begin{array}{c} B \\ B \\ \vdots \\ B \end{array} \right ) = \left ( \begin{array}{c} Y_{1} \\ Y_{2} \\ \vdots \\ Y_{n_{out}} \end{array} \right ) \end{equation*}

This process is called vectorization.

So that explain the first dimension in our tensor. The last two are the actual size of the picture (28 by 28 pixels) and pytorch adds a dimension because he knows our input is an image, and usually images have three channels (for red, green and blue). We have 1 here because the picture is black and white.

Following the logic of this vectorization process, the first linear layer is going to expect a tensor of size mb by 784 (which is the result of 28 * 28), so we have to resize our input (we usually say flatten). To do so, we use the method view:

x = x.view(x.size(0),-1)

In this line, we tell pytorch to transform x into a two-dimensional array, with a first dimension being the same as the previous value of x, and the second, whatever it needs to be so that it fits the previous shape of x.

Once we have this line, the rest of the forward pass is easy: we apply the first linear layer, a ReLU, the second linear layer, and the log softmax. Note that all the functions we need are in the F (for nn.functional) library.

def forward(self,x):
    x = x.view(x.size(0),-1)
    x = F.relu(self.linear1(x))
    return F.log_softmax(self.linear2(x), dim=-1)

Then, we just have to create an instance of our model by calling the class with the arguments it needs (here n_in, n_hidden and n_out).

net = SimpleNeuralNet(784,100,10)

The only parameter we can choose here is the number of neurons in the hidden layer. I've picked 100 but you can try something else.

The training loop

Now that we have our model, we must train him to recognize digits. With a random initialization, we can expect it to have a 10%-accuracy at the beginning. But we'll see how quickly it improves when applying SGD.

The key thing pytorch provides us with, is automatic differentiation. This means we won't have to compute the gradients ourselves. There is two little things to think of, though. The first one is that pytorch must remember how an output was created from an input, to be able to roll back from this definition and calculate the gradients. This is done through the Variable object. Instead of feeding a tensor to our model, we will wrap it in a Variable.

x = Variable(inputs, requires_grad=True)

The new object x still has all the inputs, that we can find in x.data, but this new object has other attributes, one of them being the gradient. If we call the model on x to get the outputs and feed that in the loss function (with the expected label) we'll be able to get the derivatives of the loss function with respect to x. We told pytorch we would need them when we typed requires_grad=True.

outputs = net(x)
loss = F.nll_loss(outputs,Variable(labels))

Note that we don't use the Cross Entropy loss function since the outputs are already the logarithms of the softmax, and that the labels must also be wrapped inside a Variable.

Once we have done this, we ask pytorch to compute the gradients of the loss like this:


and the derivatives of the loss with respect to x for instance, will be in the Variable x.grad (or x.grad.data if we want the values).

The second thing we don't want to forget is that pytorch accumulates the gradients. That means he sums there over, each time we call this backward function. This is why we have to reinitialize them via x.grad.data.zero_ before we want to calculate new derivatives.

Then, the actual step of the SGD can be done automatically by the use of a pytorch optimizer. We can use the library optim to define one, and will have to pass him the parameters we want to change at each step (in our case, all the weights and biases in our network) and the learning rate we want to use. Here we define

optimizer = optim.SGD(net.parameters(),lr=0.01)

Then we won't need to write the line where we subtract to each parameter the learning rate multiplied by the gradient, this will all be done by calling optimizer.step(). To reinitialize all the gradients of the parameters of our model, we'll just have to type optimizer.zero_grad().

Once this is all done, we can write our training loop. It consists, for each epoch, in looking through all the data, compute the outputs of each mini-batch of inputs, compare them with their theoretical labels via the loss function, compute the gradients of the loss functions with respect to all the parameters and adjust them in consequence. We just had the computation of the accuracy to print how well we are doing at the end of each epoch.

def train(nb_epoch):
    for epoch in range(nb_epoch):
        running_loss = 0.
        corrects = 0
        print(f'Epoch {epoch+1}:')
            for data in trn_loader:
            #separate the inputs from the labels
            inputs,labels = data
            #wrap those into variables to keep track of how they are created and be able to compute their gradient.
            inputs, labels = Variable(inputs), Variable(labels)
            #Put the gradients back to zero
            #Compute the outputs given by our model at this stage.
            outputs = net(inputs)
            _,preds = torch.max(outputs.data,1)
            #Compute the loss
            loss = F.nll_loss(outputs, labels)
            running_loss += loss.data[0] * inputs.size(0)
            corrects += torch.sum(labels.data == preds)
            #Backpropagate the computation of the gradients
            #Do the step of the SGD
        print(f'Loss: {running_loss/len(trn_set)}  Accuracy: {100.*corrects/len(trn_set)}')

After training our simple neural net for 10 epochs on the train set, we get an accuracy 96.23%. It seems like a great result but we need to see if it generalizes well or if our model just learned to recognize the particular images of the training set extremely well (we call this overfitting).

The loop to check how well our model is doing on the test test is very similar to the training loop, minus the gradients, and as shwon on the notebook, we get a 96% accuracy there. Not bad for such a simple model!