Gradient descent with Python

Over the past few weeks we’ve been studying the fundamental building blocks of machine learning and neural network classifiers.

We started with an introduction to linear classification that discussed the concept of parameterized learning , and how this type of learning enables us to define a scoring function that maps our input data to output class labels.

Thisscoring function is defined in terms of parameters ; specifically, our weight matrix W and our bias vector b . Our scoring function accepts these parameters as inputs and returns a predicted class label for each input data point .

Fromthere, we discussed two common loss functions:Multi-class SVM loss andcross-entropy loss (commonly referred to in the same breath as “Softmax classifiers”). Loss functions, at the most basic level, are used to quantify how “good” or “bad” a given predictor (i.e., a set of parameters) are at classifying the input data points in our dataset.

Given these building blocks, we can now move on to arguably the most important aspect of machine learning, neural networks, and deep learning ― optimization.

Throughout this discussion we’ve learned that high classification accuracy is dependent on finding a set of weights W such that our data points are correctly classified. Given W ,can compute our output class labels via our scoring function . And finally, we can determine how good/poor our classifications are given some W via our loss function .

But how do we go about finding and obtaining a weight matrix W that obtains high classification accuracy?

Do we randomly initialize W , evaluate, and repeat over and over again, hoping that at some point we land on a W that obtains reasonable classification accuracy?

Well we could ― and it some cases that might work just fine.

But in most situations, we instead need to define an optimization algorithm that allows us to iteratively improve our weight matrix W .

In today’s blog post, we’ll be looking at arguably themost common algorithm used to find optimal values of W ― gradient descent.

Looking for the source code to this post?

Jump right to the downloads section. Gradient descent with python

The gradient descent algorithm comes in two flavors:

The standard “vanilla” implementation. The optimized “stochastic” version that is more commonly used.

Today well be reviewing the basic vanilla implementation to form a baseline for our understanding. Then next week I’ll be discussing the stochastic version of gradient descent.

Gradient descent is an optimization algorithm

The gradient descent method is an iterative optimization algorithm that operates over a loss landscape.

We can visualize our loss landscape as a bowl, similar to the one you may eat cereal or soup out of:

Figure 1:A plot of our loss landscape. We typically see this landscape depicted as a “bowl”. Our goal is to move towards the basin of this bowl where this is minimal loss.

The surface of our bowl is called our loss landscape , which is essentially a plot of our loss function.

The difference between our loss landscape and your cereal bowl is that your cereal bowl only exists in three dimensions, while your loss landscape exists in many dimensions , perhaps tens, hundreds, or even thousands of dimensions.

Each position along the surface of the bowl corresponds to a particular loss value given our set of parameters, W (weight matrix) and b (bias vector).

Our goal is to try different values of W and b , evaluate their loss, and then take a step towards more optimal values that will (ideally) have lower loss.

Iteratively repeating this process will allow us to navigate our loss landscape, following the gradient of the loss function (the bowl), and find a set of parameters that have minimum loss and high classification accuracy.

The “gradient” in gradient descent

To make our explanation of gradient descent a little more intuitive, let’s pretend that we have a robot ― let’s name him Chad:

Figure 2:Introducing our robot, Chad, who will help us understand the concept of gradient descent.

We place Chad on a random position in our bowl (i.e., the loss landscape):

Figure 3:Chad is placed on a random position on the loss landscape. However, Chad has only one sensor ― the loss value at the exact position he is standing at. Using this sensor (and this sensor alone), how is he going to get to the bottom of the basin?

It’s now Chad’s job to navigate to the bottom of the basin (where thereis minimum loss).

Seems easy enough, right? All Chad has to do is orient himself such that he’s facing “downhill” and then ride the slope until he reaches the bottom of the basin.

But we have a problem: Chad isn’t a very smart robot.

Chadonly has one sensor ― this sensor allows him to take his weight matrix W and compute a loss function L .

Therefore, Chad is able to compute his (relative) position on the loss landscape, but he has absolutely no idea in which direction he should take a step to move himself closer to the bottom of the basin.

What is Chad to do?

The answer is to apply gradient descent.

All we need to do is follow the slope of the gradient W . We can compute the gradient of W across all dimensions using the following equation:

In > 1 dimensions, our gradient becomes a vector of partial derivatives.

The problem with this equation is that:

It’s an approximation to the gradient. It’s very slow.

In practice, we use the analytic gradient instead. This method is exact, fast, but extremely challenging to implement due to partial derivatives and multivariable calculus. You can read more about the numeric and analytic gradients here .

For the sake of this discussion, simply try to internalize what gradient descent is doing: attempting to optimize our parameters for low loss and high classification accuracy.

Pseudocode for gradient descent

Below I have included some Python-like pseudocode of the standard, vanilla gradient descent algorithm, inspired by the CS231n slides :

while True: Wgradient = evaluate_gradient(loss, data, W) W += -alpha * Wgradient

This pseudocode is essentially what all variations of gradient descent are built off of.

We start off on Line 1 by looping until some condition is met. Normally this condition is either:

A specified number of epochs has passed

Gradient descent with Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本