In our previous exploration, we built a rudimentary digit classifier using a basic distance metric and the power of broadcasting. While functional, this approach relied on simple pixel similarity and lacked the ability to truly learn and improve. It’s time to change that.
In this article, we’ll introduce Stochastic Gradient Descent (SGD), a fundamental optimization algorithm that forms the backbone of many machine learning models. We’ll see how SGD allows our model to move beyond mere pixel matching and develop a more nuanced understanding of handwritten digits.
Environment Setup:
To follow along, you’ll need a Conda environment with fastai and fastbook installed. I’ll be using the environment from Part 1 of this series. If you’re new here or need a refresher, head back to Part 1 for detailed setup instructions.
With your environment ready, launch your Jupyter Notebook. We’ll begin by loading the MNIST dataset using the same code as before:
from fastbook import *
from fastai.vision.all import *
path = untar_data(URLs.MNIST_SAMPLE)threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()
seven_tensors = [tensor(Image.open(o)) for o in sevens]
three_tensors = [tensor(Image.open(o)) for o in threes]
stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255
valid_3_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'3').ls()])
valid_3_tens = valid_3_tens.float()/255
valid_7_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'7').ls()])
valid_7_tens = valid_7_tens.float()/255
Let’s start with the training set.
train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)
Let’s break down what’s happening here:
We have stacked_threes and stacked_sevens, which are rank-3 tensors holding our training images for 3s and 7s, respectively. We use torch.cat to combine (concatenate) these tensors into a single tensor, train_x. Think of it like merging two stacks of images into one.
Reshaping with .view(): Next, we use .view(-1, 28*28) to reshape this combined tensor. What we’re essentially doing is transforming each 28×28 image into a single row of 784 pixels (28*28 = 784).
The -1 is a clever trick. It tells PyTorch, “You figure out the size of this dimension based on the other one and the total number of elements.” In this case, if we had 6000 ‘3’ images and 6000 ‘7’ images, the first dimension would become 12000, to accommodate all of our newly flattened images. So, our original tensors with shapes like [6000+6000, 28, 28] are now transformed into [12000, 784]. We’ve essentially converted our image data from a rank-3 tensor (a stack of 2D images) into a rank-2 tensor (a 2D matrix or table, where each row is a flattened image).
We also need labels for our training data — a way to tell our model which images are 3s and which are 7s. That’s where train_y comes in:
train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
print(train_x.shape, train_y.shape)
We create a tensor train_y where each ‘3’ image is assigned a label of 1 and each ‘7’ image is assigned a label of 0. We do this by making a list of 1s that is as long as the number of ‘3’ images we have, and a list of 0s that is as long as the number of ‘7’ images, joining them together and turning them into a tensor.
Adding a Dimension with .unsqueeze(): We then add a dimension to our labels using .unsqueeze(1). This converts our list of labels from a 1D tensor into a 2D tensor, where each label is a separate row in a column. Without this, our label tensor would not be in the format that PyTorch expects when it’s time to train our model. The shape of our label tensor will be something like [12396, 1].
Matching Order: The order here is crucial! Notice that we created train_y by concatenating a list of 1s (for 3s) followed by 0s (for 7s). This perfectly mirrors how we combined our image data in train_x, where stacked_threes came before stacked_sevens. This ensures that each image in train_x is associated with the correct label in train_y.
Putting it Together: Creating a Dataset
To make it easier to work with our data, we can combine train_x and train_y into a single dataset:
dset = list(zip(train_x, train_y))
x, y = dset[0]
print(x.shape, y)
We use zip to pair up each image (from train_x) with its corresponding label (from train_y).
Then we can access an element of dset, for example the first one, and see that it’s a pair of an image and its label. The shape of x will be [784] (our flattened image) and y will simply be tensor([1]) if 3 or tensor([0]) if 7.
Finally, we repeat the same steps for our validation data:
valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x, valid_y))
Alright, we’ve got our training and validation datasets ready. Now, here’s where the “learning” part starts. We need to introduce something called weights. Think of weights as knobs or dials that our model can adjust to get better at recognizing digits.
Initially, we’ll set these weights to random values. We’ll create one weight for each pixel in our images. Here’s how we do it in code:
def init_params(size, std=1.0):
return (torch.randn(size)*std).requires_grad_()
weights = init_params((28*28,1))
torch.randn(size) generates random numbers that follow a normal distribution (a bell curve shape, most numbers will be close to 0) and then multiply it by the standard deviation.
requires_grad_() is important! It tells PyTorch that we want to track these weights and how they change during the learning process. That’s the autograd of PyTorch.
Our weights tensor now holds 784 random values, one for each pixel. These weights, together with another concept called the bias which we’ll see shortly, are called the parameters of our model.
Adding Bias: Shifting the Line
Our weights are important, but they’re not enough on their own. We also need a bias. Think of the bias as a way to shift our prediction up or down. It allows our model to make predictions even when all the pixel values are zero.
We’ll initialize the bias to a random number, just like we did with the weights:
bias = init_params(1)
Now we have a single random value in our bias tensor.
Okay, we have weights and a bias. How do we use them to make a prediction? For each image, we’ll do the following:
- Multiply each pixel value by its corresponding weight.
- Add up all those weighted pixel values.
- Add the bias to that sum.
This might sound familiar. It’s basically the equation of a line: y = wx + b, where:
- y is our prediction (how much does the model think this is a ‘3’)
- w are the weights
- x are the pixel values
- b is the bias
Let’s see how this works for the first image in our training set:
(train_x[0] * weights.T).sum() + bias
Why the .T?
Good question! The .T is for transpose. train_x[0] has a shape of [1, 784] (1 row, 784 columns), and weights has a shape of [784, 1] (784 rows, 1 column). To perform element-wise multiplication, we want weights to be [1, 784]. The transpose flips the rows and columns of weights, so weights.T will be [1, 784].
Matrix Multiplication: A Faster Way to Predict
Calculating predictions one by one using a loop would be really slow, especially when dealing with thousands of images. Thankfully, there’s a much faster way: matrix multiplication.
Matrix multiplication lets us calculate all the predictions at once. It’s a fundamental operation in linear algebra and is highly optimized in libraries like PyTorch.
In Python, we use the @ operator for matrix multiplication:
def linear1(xb):
return xb@weights + bias
preds = linear1(train_x)
preds
Remember that train_x holds our entire training dataset — thousands of images, each flattened into a row of 784 pixels (28×28). So, train_x is a big matrix with thousands of rows and 784 columns.
Now, weights is our tensor of weights, one for each pixel (a column vector of shape 784×1).
When we do xb@weights (which is equivalent to train_x @ weights in this case), we’re performing matrix multiplication. This single operation does the following for every single image in our training set simultaneously. Then + bias will add the bias to each of our images calculations, which is a single number but will be broadcasted to each row (each image).
Measuring Success: How Good Are Our Predictions?
Now that we have predictions, we need to measure how good they are. Initially, with random weights and bias, our model will probably be quite bad at recognizing digits.
A simple way to check is to see if a prediction is greater than 0. If it is, we’ll guess it’s a “3”; otherwise, we’ll guess it’s a “7”. We can compare these guesses to the actual labels (the train_y values) to get our accuracy.
corrects = (preds>0.0).float() == train_y
corrects
corrects.float().mean().item()
We want to improve our model by adjusting the weights and the bias. To do that, we need a loss function. A loss function tells us how far off our predictions are from the actual values.
You might think, “Why not just use accuracy as our loss function?” It seems logical, but it turns out that accuracy has a big problem: it doesn’t change smoothly.
Imagine you’re trying to climb a hill, but the hill is made of giant steps instead of a smooth slope. You’d have a hard time figuring out which direction to go to reach the top.
Similarly, accuracy only changes when a prediction flips from “3” to “7” or vice versa. Small changes in weights usually won’t change the accuracy at all. This means the gradient (which tells us the direction to adjust the weights) will be zero most of the time. And if the gradient is zero, our model can’t learn!
A Better Loss Function: Measuring the “Distance”
We need a loss function that changes smoothly, even with small changes in weights. We want a function that gives us a lower value when our predictions are getting better.
Here’s the key idea: instead of just checking if a prediction is right or wrong, we’ll measure how close it is to the correct answer.
- If the image is a “3” (target is 1), we want our prediction to be close to 1.
- If the image is a “7” (target is 0), we want our prediction to be close to 0.
Here’s a loss function that does this, which we will call mnist_loss:
def mnist_loss(predictions, targets):
return torch.where(targets==1, 1-predictions, predictions).mean()
torch.where(a, b, c) is like a faster if else loop in PyTorch. It checks a condition a, if it is true returns b else returns c. For example, if the target is 1 (it’s a 3), 1 — predictions will be small if predictions is close to 1. If the target is 0 (it’s a 7), predictions will be small if predictions is close to 0.
This mnist_loss function gives us a lower value when our predictions are closer to the targets, even if they’re not exactly right yet.
Ensuring Predictions are Between 0 and 1
Our mnist_loss function assumes that predictions are always between 0 and 1. We’ll see how to guarantee this in the next step using an activation function called Sigmoid.
Introducing Sigmoid: Smoothing Out Our Predictions
Remember our mnist_loss function? It works best when our predictions are between 0 and 1. But what if our weights * pixels + bias calculation spits out a number like -5 or 234? That’s where the sigmoid function comes in.
Think of sigmoid as a “smoosher.” It takes any number, no matter how large or small, positive or negative, and gently squeezes it into the range between 0 and 1. Here’s the formula:
def sigmoid(x):
return 1/(1+torch.exp(-x))
Luckily, PyTorch has a built-in, super-fast version of sigmoid (torch.sigmoid), so we don’t need to use this one.
Let’s see what it looks like:
Notice a few key things about the sigmoid curve:
- It always outputs a value between 0 and 1.
- It’s a smooth, S-shaped curve.
- As the input gets larger (more positive), the output gets closer to 1.
- As the input gets smaller (more negative), the output gets closer to 0.
This smoothness is crucial because it makes it easier for our optimization algorithm (SGD) to find good gradients and update our weights effectively.
Let’s update our mnist_loss function to use sigmoid:
def mnist_loss(predictions, targets):
predictions = predictions.sigmoid()
return torch.where(targets==1, 1-predictions, predictions).mean()
Now, we first pass our raw predictions through sigmoid() to squish them between 0 and 1. Then, the rest of the mnist_loss calculation proceeds as before. This ensures that our loss function always behaves nicely, regardless of the raw prediction values.
Why Loss vs. Metrics?
- Metrics are for us, humans. They tell us how well our model is doing in a way that we understand (e.g., “This model is 95% accurate”).
- Loss functions are for the optimization algorithm (SGD). They guide the learning process by providing a smooth, differentiable measure of how far off our predictions are.
The Importance of Smoothness
For automated learning to work well, the loss function needs to be reasonably smooth. This means it should respond to small changes in the weights with small changes in the loss value. A smooth loss function gives us meaningful gradients, which are like signposts telling us which direction to adjust the weights to improve the model.
Our mnist_loss (with sigmoid) is a good example of a smooth loss function. It might not perfectly match what we ultimately care about (accuracy), but it’s a good compromise — a function that’s both optimizable and reasonably aligned with our goal.
Stochastic Gradient Descent (SGD): Learning in Small Steps
Now that we have a good loss function, let’s talk about how we use it to train our model. This is where Stochastic Gradient Descent (SGD) comes in.
SGD is an optimization algorithm that helps us find the best values for our weights and bias — the values that minimize our loss function.
Here’s the basic idea:
- Calculate the loss: We feed our model some data (a mini-batch, which we’ll explain shortly) and calculate the loss using our mnist_loss function.
- Calculate the gradients: We use PyTorch’s autograd feature to automatically calculate the gradients of the loss with respect to each weight and the bias. The gradient tells us how much the loss would change if we made a small change to each weight or the bias.
- Update the weights and bias: We nudge the weights and bias in the opposite direction of the gradient. This is like rolling a ball downhill — we’re moving towards a lower point in the loss landscape.
- Repeat: We repeat steps 1–3 many times, gradually improving our model’s predictions and reducing the loss.
Mini-Batches: Finding the Right Balance
In SGD, we have a choice: how much data do we use to calculate the loss and gradients in each step?
- Full-batch gradient descent: We could use the entire dataset for each update. This would give us the most accurate estimate of the gradients, but it would be very slow and computationally expensive.
- Stochastic gradient descent (with a batch size of 1): We could use only one data point for each update. This would be very fast, but the gradients would be very noisy and unreliable because they’re based on so little information.
The sweet spot is usually somewhere in between: using a mini-batch of data for each update. A mini-batch is a small, randomly selected subset of our dataset (e.g., 32, 128 or 256 images).
Why Mini-Batches?
Mini-batches give us a more accurate estimate of the gradients than using a single data point, leading to more stable learning. They are much faster to process than the entire dataset, especially when using GPUs as GPUs are designed to work on batches of data in parallel. Mini-batches allow us to take advantage of this parallelism, making training much faster. Using different mini-batches in each step introduces some randomness into the training process, which can actually help our model generalize better to new, unseen data.
The optimal batch size depends on factors like the dataset size, the model complexity, and the hardware we’re using.
DataLoader: Making Mini-Batches Easy
Manually creating mini-batches and shuffling our data for each epoch would be a pain. Thankfully, fastai provides a handy tool called DataLoader that does all this for us.
A DataLoader takes a Dataset (which holds our data) and automatically creates an iterator that serves up mini-batches. It can also shuffle the data for us on each epoch, which is important for good generalization.
Here’s a simple example:
coll = range(15)
dl = DataLoader(coll, batch_size=5, shuffle=True)
list(dl)
This creates a DataLoader from a simple collection of numbers and returns shuffled mini-batches of size 5.
For training our model, we’ll create a Dataset that holds pairs of independent and dependent variables (images and labels). Then, we’ll pass this Dataset to a DataLoader to get our mini-batches:
ds = L(enumerate(string.ascii_lowercase)) # An indexed alphabet dataset
dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)
The stage is set. We’ve mastered the basics, and now it’s time for the main event: implementing SGD. Join me in Part 3 as we write our first training loop and put all this newfound knowledge to the test. We’ll finally see our MNIST classifier learn and improve before our eyes!