
Over the past few weeks we’ve been studying the fundamental building blocks of machine learning and neural network classifiers.
We started with an introduction to linear classification that discussed the concept of parameterized learning , and how this type of learning enables us to define a scoring function that maps our input data to output class labels.
Thisscoring function is defined in terms of parameters ; specifically, our weight matrix W and our bias vector b . Our scoring function accepts these parameters as inputs and returns a predicted class label for each input data point .
Fromthere, we discussed two common loss functions:Multi-class SVM loss andcross-entropy loss (commonly referred to in the same breath as “Softmax classifiers”). Loss functions, at the most basic level, are used to quantify how “good” or “bad” a given predictor (i.e., a set of parameters) are at classifying the input data points in our dataset.
Given these building blocks, we can now move on to arguably the most important aspect of machine learning, neural networks, and deep learning ― optimization.
Throughout this discussion we’ve learned that high classification accuracy is dependent on finding a set of weights W such that our data points are correctly classified. Given W ,can compute our output class labels via our scoring function . And finally, we can determine how good/poor our classifications are given some W via our loss function .
But how do we go about finding and obtaining a weight matrix W that obtains high classification accuracy?
Do we randomly initialize W , evaluate, and repeat over and over again, hoping that at some point we land on a W that obtains reasonable classification accuracy?
Well we could ― and it some cases that might work just fine.
But in most situations, we instead need to define an optimization algorithm that allows us to iteratively improve our weight matrix W .
In today’s blog post, we’ll be looking at arguably themost common algorithm used to find optimal values of W ― gradient descent.
Looking for the source code to this post?
Jump right to the downloads section. Gradient descent with pythonThe gradient descent algorithm comes in two flavors:
The standard “vanilla” implementation. The optimized “stochastic” version that is more commonly used.Today well be reviewing the basic vanilla implementation to form a baseline for our understanding. Then next week I’ll be discussing the stochastic version of gradient descent.
Gradient descent is an optimization algorithmThe gradient descent method is an iterative optimization algorithm that operates over a loss landscape.
We can visualize our loss landscape as a bowl, similar to the one you may eat cereal or soup out of:

Figure 1:A plot of our loss landscape. We typically see this landscape depicted as a “bowl”. Our goal is to move towards the basin of this bowl where this is minimal loss.
The surface of our bowl is called our loss landscape , which is essentially a plot of our loss function.
The difference between our loss landscape and your cereal bowl is that your cereal bowl only exists in three dimensions, while your loss landscape exists in many dimensions , perhaps tens, hundreds, or even thousands of dimensions.
Each position along the surface of the bowl corresponds to a particular loss value given our set of parameters, W (weight matrix) and b (bias vector).
Our goal is to try different values of W and b , evaluate their loss, and then take a step towards more optimal values that will (ideally) have lower loss.
Iteratively repeating this process will allow us to navigate our loss landscape, following the gradient of the loss function (the bowl), and find a set of parameters that have minimum loss and high classification accuracy.
The “gradient” in gradient descentTo make our explanation of gradient descent a little more intuitive, let’s pretend that we have a robot ― let’s name him Chad:

Figure 2:Introducing our robot, Chad, who will help us understand the concept of gradient descent.
We place Chad on a random position in our bowl (i.e., the loss landscape):

Figure 3:Chad is placed on a random position on the loss landscape. However, Chad has only one sensor ― the loss value at the exact position he is standing at. Using this sensor (and this sensor alone), how is he going to get to the bottom of the basin?
It’s now Chad’s job to navigate to the bottom of the basin (where thereis minimum loss).
Seems easy enough, right? All Chad has to do is orient himself such that he’s facing “downhill” and then ride the slope until he reaches the bottom of the basin.
But we have a problem: Chad isn’t a very smart robot.
Chadonly has one sensor ― this sensor allows him to take his weight matrix W and compute a loss function L .
Therefore, Chad is able to compute his (relative) position on the loss landscape, but he has absolutely no idea in which direction he should take a step to move himself closer to the bottom of the basin.
What is Chad to do?
The answer is to apply gradient descent.
All we need to do is follow the slope of the gradient W . We can compute the gradient of W across all dimensions using the following equation:

In > 1 dimensions, our gradient becomes a vector of partial derivatives.
The problem with this equation is that:
It’s an approximation to the gradient. It’s very slow.In practice, we use the analytic gradient instead. This method is exact, fast, but extremely challenging to implement due to partial derivatives and multivariable calculus. You can read more about the numeric and analytic gradients here .
For the sake of this discussion, simply try to internalize what gradient descent is doing: attempting to optimize our parameters for low loss and high classification accuracy.
Pseudocode for gradient descentBelow I have included some Python-like pseudocode of the standard, vanilla gradient descent algorithm, inspired by the CS231n slides :
while True: Wgradient = evaluate_gradient(loss, data, W) W += -alpha * WgradientThis pseudocode is essentially what all variations of gradient descent are built off of.
We start off on Line 1 by looping until some condition is met. Normally this condition is either:
A specified number of epochs has passed