How to build your first image classifier using PyTorch

Neural networks are everywhere nowadays. But while it seems that
literally everyone is using a neural network today, creating and
training your own neural network for the first time can be quite
a hurdle to overcome. In this blog post I’ll take you by the hand
and show you how to train an image classifier — using PyTorch!

Why not Keras?

Before we start, you might ask why I’ve chosen to use PyTorch,
and not Keras. Of course there are pros and cons for each of the
options, but I am not going to attempt to make a good overview here.
I’m not the right person to ask for a comparison because I
have no experience with Keras, so if you are looking for an article
on the differences between these (and possibly more) options
you could have a look here,
here
or here.

Convolutional Neural Networks

The tool that we are going to use to make a classifier is called a
convolutional neural network, or CNN. You can find a great explanation
of what these are right
here on wikipedia.

But we are not going to fully train one ourselves: that would take way more time
than I would be willing to spend. Instead, we are going to do transfer learning,
where we take a pre-trained CNN and replace only the last layer by a layer
of our own. Then we only need to train that single layer, as all the other
layers already have weights that are quite sensible. Here we exploit the fact
that the images we are interested in have a lot of the same properties
as those images that the original network was trained on. You can find a
great explanation of transfer learning
here.

Defining a neural network

Before we do any transfer learning, lets have a look at how we can define
our own CNN in PyTorch. Here is a minimal example:

from torch.nn import Conv2d, functional as F, Linear, MaxPool2d, Module

class Net(Module):

    def __init__(self):
        super(Net, self).__init__()
        self.conv = Conv2d(3, 18, kernel_size=3, stride=1, padding=1)
        self.pool = MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.fc1 = Linear(18 * 16 * 16, 64)
        self.fc2 = Linear(64, 10)

    def forward(self, x):
        x = F.relu(self.conv(x))
        x = self.pool(x)
        x = x.view(-1, 18 * 16 * 16)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

We define a neural network by creating a class that inherits from Module.
When we initialise the network we define the layers of the network:

a 2D convolutional layer,
a max pooling layer,
two linear layers.

In the forward method we define what happens to any input x that we feed
into the network. This argument x is a PyTorch tensor (a multi-dimensional
array), which in our case is a batch of images that each
have 3 channels (RGB) and are 32 by 32 pixels: the shape of x is then (b, 3, 32, 32)
where b is the batch size.

The first statement of our forward method applies the convolutional layer
to the input, which results in a 18-channel, 32 by 32 tensor for each input image.
Immediately after that we apply
the ReLU function.

Next, we apply the max pooling layer, which reduces the tensor to size (b, 18, 16, 16).
The view method of x reshapes the tensor to the specified shape, where the
value of -1 indicates that PyTorch is supposed to figure out this dimension: this
allows us to work with varying batch sizes. The result is a 1D vector of size 4608
for each element of our batch.

Finally, we apply the two linear (fully connected) layers with yet another
relu in between. This first reduces our shape from (b, 4608) to (b, 64) and then
to (b, 10): our output is 10 values for each image.

We can interpret these outputs as the some kind of probability for each
class to be the correct class: this model would be a classifier for 10 classes.

Using a pre-trained model

If instead of defining our own model we want to use a pre-trained model,
PyTorch provides quite a few that we can easily use. All we need to do to use
Squeezenet for example is:

from torchvision.models import squeezenet1_0

model = squeezenet1_0(pretrained=True)

We can have a look at the structure of this model by simply printing it:
print(model) gives us:

SqueezeNet(
  (features): Sequential(
    (0): Conv2d(3, 96, kernel_size=(7, 7), stride=(2, 2))
    (1): ReLU(inplace)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)
    (3): Fire(
      (squeeze): Conv2d(96, 16, kernel_size=(1, 1), stride=(1, 1))
      (squeeze_activation): ReLU(inplace)
      (expand1x1): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1))
      (expand1x1_activation): ReLU(inplace)
      (expand3x3): Conv2d(16, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (expand3x3_activation): ReLU(inplace)
    )
    ...
  )
  (classifier): Sequential(
    (0): Dropout(p=0.5)
    (1): Conv2d(512, 1000, kernel_size=(1, 1), stride=(1, 1))
    (2): ReLU(inplace)
    (3): AvgPool2d(kernel_size=13, stride=1, padding=0)
  )
)

The network consists of two parts: the features and the classifier.
I’ve truncated the output of the features part in order to keep some
readability: it contains 12 layers out of which eight are Fire modules.
These modules contain six sublayers and are the defining feature of Squeezenet,
read more about them here.

For us the classifier part is much more interesting though: this is where
the network makes the final classification based on the features that were
created in the previous layers. If we want to do transfer learning, this
is the layer that we want to replace.

Note that Squeezenet was designed for and trained upon an
ImageNet
data set, which contains 1000 classes. We can replace the Conv2d layer
with our own layer with the appropriate number of classes. For example:

model.num_classes = n_classes
model.classifier[1] = nn.Conv2d(512, n_classes, kernel_size=(1, 1), stride=(1, 1))

Here we also set the num_classes attribute of the network which is internally
used to re-shape the final output of the network.

Training

Now that we have a model set up, the next step is training it. For that
we need the following train loop:

from torch.nn import CrossEntropyLoss
from torch.optim import SGD

model.train()

criterion = CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=1E-3, momentum=0.9)

for inputs, labels in loader:
    optimizer.zero_grad()
    outputs = model(inputs)

    loss = criterion(outputs, labels)

    loss.backward()
    optimizer.step()

First, we set the model to training mode — I’ll explain below what that means.
Next, we define our loss,
Cross-Entropy loss, and our optimizer:
Stochastic Gradient Descent.
Let’s not worry about the parameters of the optimizer just yet.

The we start the training loop. We loop over the contents of a loader object,
which we’ll look at in a minute. Every iteration it yields two items:
the inputs and the labels. They are PyTorch tensors of which the
first dimension is the batch size. The inputs can be
directly fed to the model, while labels has the single dimension of which
the size is equal to the batch size: it represents the class of each image.

We start each iteration by resetting the optimizer by calling zero_grad,
and then feeding the inputs through the model. Next, we use our loss function
to compute the loss on the results of the model. While we do those computations
PyTorch automatically tracks our operations and when we call backward() on
the result it calculates the derivative (gradient) of each of the steps
with respect to the inputs. This gradient is then what the optimizer can
use to optimize the weights when we call step().

We call the full training loop over all elements in the loader an epoch.

Evaluation

After training for one or more epochs you are probably interested in the
performance of your network. We can evaluate that by computing the total
loss on the evaluation set, like this:

from torch import max, no_grad

model.eval()

loss = 0
with no_grad():
    for inputs, labels in loader:
        outputs = model(inputs)
        loss += criterion(outputs, labels)
        _, predictions = max(outputs.data, dim=1)
        ...

First we need to set our model to evaluation mode (which is the same
as disabling the training mode using .train(False)). This disables features
that are handy using train time, such as
dropout, in
order to get the maximum performance out of our network. Next, we enter the
no_grad context, in which the automatic computation of gradients is disabled:
we do not need that during evaluation.

Then we have a loop similar to the one in the training case: we loop over
the inputs and the labels from the loader, pass the inputs to the model
and calculate the loss. In addition, we could inspect the predictions of the
model (and possibly use them) by using the torch.max function, which returns
a tuple of (maximum values, positions). These positions correspond to the output
node (and hence class) that has the highest probability according to our model,
which we can interpret as the index of the most probable class.

The loader

Of course data is essential to either training or evaluating a classifier.
In the previous two segments we looped through the contents of this loader
object, which we did not define before. In order to create it, we must first
define a data set.

Of course a single data set is not enough: we need both a training and a testing
data set. In addition you may want to have a validation data set as well.
Assuming that you have your images in a folder structure like this:

images/
  train/
    class_1/
    class_2/
    ...
  train/
    class_1/
    class_2/
    ...

we can define the data sets as follows:

from torchvision import transforms
from torchvision.datasets import ImageFolder

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
])
test_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor()
])

train_set = ImageFolder('images/train', transform=train_transform)
test_set = ImageFolder('images/test', transform=test_transform)

To each image set we provide a transformation which tells PyTorch what to
do with the images when reading them. We define two transformations,
one for each data set.

Let’s have a look at the test_transform first: when we read a test image, we

resize the image such that the smallest dimension of the image is 256 pixels, then we
crop a square of 224 x 224 pixels from the center of the resized image, and finally
convert the result to a tensor so that PyTorch can pass it through a model.

In the train_transform we do something different: we

randomly take a crop of a random size (between certain limits) and aspect ratio and resize that to 224×224,
randomly flip the image horizontally, and finally
convert the result to a tensor.

This means that although the model will encounter each training image once during
every epoch, the exact images it will be seeing vary from epoch to epoch: sometimes
it will be seeing most of the image and other times it will see only a small crop.
Since most objects still look roughly the same when we horizontally flip the image,
we want the model to also learn from the flipped images. Vertically flipped (upside-down)
images usually do not look like the same object anymore, so we only flip horizontally.

All this randomly transforming the training images helps to prevent our model
to overfit:
it cannot learn by heart that a small portion of an image belongs to a certain
label because every epoch it sees a different subset of the image.

Once we have defined the data sets, we can create the loaders:

from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset=train_set,
    batch_size=32,
    num_workers=4,
    shuffle=True,
)

test_loader = DataLoader(
    dataset=test_set,
    batch_size=32,
    num_workers=4,
    shuffle=True,
)

To each we provide the respective data set, and we specify that:

the batch size is 32 (feel free to try other values),
we want four processes to read and transform the images, and
we want to read the images in random order.

Now we have all ingredients to really start training our model! But…

Learning rate

Back when we defined the optimizer,

optimizer = SGD(model.parameters(), lr=1E-3, momentum=0.9)

we skipped over its parameters. And especially the first one, lr, the learning rate,
is very important. This parameter defines how much the weights will be changed
in every optimization step. In other words, it defines our step size when we are
looking for the most optimal set of weights.

Let’s have a look at a 1D example. Suppose we are looking to find the minimum value
in the curves depicted below. If our learning rate is too large then we might
actually walk away from the minimum, as we see on the left. If, on the other hand
our learning rate is too low, we will be moving very slowly and we run the risk
of getting stuck in a local optimum.

Now you might be inclined to perform a classical hyper-parameter search, by simply
trying out a lot of values for the learning rate and seeing how well the model
performs in the end. But training a single
models takes at least a few hours on a decent GPU, so training tens (or hundreds!)
of these models would become a costly affair.

A better way to figure out the optimal value of the learning rate is to do a learning
rate sweep: we train our model for a number of batches for a range of learning
rates. In the example here I’ve included a little pseudocode:

def set_learning_rate(optimizer, learning_rate):
    for param_group in optimizer.param_groups:
        param_group['lr'] = learning_rate

learning_rates = np.logspace(min_lr, max_lr, num=n_steps)
results = []
for learning_rate in learning_rates:
    set_learning_rate(optimizer, learning_rate)
    train_batches(...)
    results.append(evaluate(...))

The result should look something like this:

We see that in the beginning we learn very very slowly, but it improves
after a while. Then, when the learning rate passes some point around (10^{-2}) we
see the performance of our network going down (the loss goes up), up to the
point where the results are terrible. Your ideal setting is there where the
improvement is the fastest, i.e. where the line goes down the steepest. For the above
example that would be somewhere around (10^{-3}).

After the sweep, do not forget to reset the network to the state before you did the sweep,
as the batches with the highest learning rates will most likely have ruined
your networks’ performance.

Learning rate scheduler

Unfortunately doing a sweep once is not enough, as the best learning rate depends on
the state of our network. The closer we come to the ideal weights, the lower we should
set our learning rate. We can solve this by using a learning rate scheduler.

For example, we can use the ReduceLROnPlateau scheduler which decreases the learning
rate when the loss has been stable for a while:

from torch.optim.lr_scheduler import ReduceLROnPlateau

scheduler = ReduceLROnPlateau(optimizer, factor=0.5, patience=10)

This scheduler is configured to reduce the learning rate by a factor 2 if the
performance was stable for 10 epochs.
All we have to do next is call scheduler.step(test_loss) after every epoch,
and the scheduler will automatically adapt the learning rate to the situation.

The result will look something like the figure below: every once in a while the scheduler
will decide to reduce the learning rate when it thinks the loss is not improving
enough.

Now all that you need to start making your own image classifier is a data set!

Where next?

If you’re looking for more example code, have a look at this project
which I used to build an image classifier that can recognize skylines of a few large cities.
I gave a talk about the project on EuroPython 2019, of which you can find the
slides here.
And of course the PyTorch docs are your
friend whenever you are building something like this!

Join us for more on deep learning!

Want to get the hang of deep learning? Our three-day Deep Learning course will take you through the theory you need to know and provides you with loads of hands-on experience.