The core of many machine learning algorithms is optimization.
Optimization algorithms are used by machine learning algorithms to find a good set of model parameters given a training dataset.
The most common optimization algorithm used in machine learning is stochastic gradient descent.
In this tutorial, you will discover how to implement stochastic gradient descent to optimize a linear regression algorithm from scratch with python.
After completing this tutorial, you will know:
How to estimate linear regression coefficients using stochastic gradient descent. How to make predictions for multivariate linear regression. How to implement linear regression with stochastic gradient descent to make predictions on new data.Let’s get started.

How to Implement Linear Regression With Stochastic Gradient Descent From Scratch With Python
Photos by star5112 , some rights reserved.
DescriptionIn this section, we will describe linear regression, the stochastic gradient descent technique and the wine quality dataset used in this tutorial.
Multivariate Linear RegressionLinear regression is a technique for predicting a real value.
Confusingly, these problems where a real value is to be predicted are called regression problems.
Linear regression is a technique where a straight line is used to model the relationship between input and output values. In more than two dimensions, this straight line may be thought of as a plane or hyperplane.
Predictions are made as a combination of the input values to predict the output value.
Each input attribute (x) is weighted using a coefficient (b), and the goal of the learning algorithm is to discover a set of coefficients that results in good predictions (y).
y = b0 + b1 * x1 + b2 * x2 + ...Coefficients can be found using stochastic gradient descent.
Stochastic Gradient DescentGradient Descent is the process of minimizing a function by following the gradients of the cost function.
This involves knowing the form of the cost as well as the derivative so that from a given point you know the gradient and can move in that direction, e.g. downhill towards the minimum value.
In machine learning, we can use a technique that evaluates and updates the coefficients every iteration called stochastic gradient descent to minimize the error of a model on our training data.
The way this optimization algorithm works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction. This process is repeated for a fixed number of iterations.
This procedure can be used to find the set of coefficients in a model that result in the smallest error for the model on the training data. Each iteration, the coefficients (b) in machine learning language are updated using the equation:
b = b - learning_rate * error * xWhere b is the coefficient or weight being optimized, learning_rate is a learning rate that you must configure (e.g. 0.01), error is the prediction error for the model on the training data attributed to the weight, and x is the input value.
Wine Quality DatasetAfter we develop our linear regression algorithm with stochastic gradient descent, we will use it to model the wine quality dataset.
This dataset is comprised of the details of 4,898 white wines including measurements like acidity and pH. The goal is to usethese objective measures to predict the wine quality on a scale between 0 and 10.
Below is a sample of the first 5 records from this dataset.
7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8,6 6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6 8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6 7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6 7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6The dataset must be normalized to the values between 0 and 1 as each attribute hasdifferent units and in turn different scales.
By predicting the mean value (Zero Rule Algorithm) on the normalized dataset, a baseline root mean squared error (RMSE) of 0.148 can be achieved.
You can learn more about the dataset on the UCI Machine Learning Repository .
You can download the dataset and save it in your current working directory with the name winequality-white.csv . You must remove the header information and the empty lines from the end of the file, and convert the “;” value separator to “,” to meet CSV format.
TutorialThis tutorial is broken down into 3parts:
Making Predictions. Estimating Coefficients. Wine Quality Prediction.This will provide the foundation you need to implement and apply linear regression with stochastic gradient descent on your own predictive modeling problems.
1. Making PredictionsThe first step is to develop a function that can make predictions.
This will be needed both in the evaluation of candidate coefficient values in stochastic gradient descent and after the model is finalized and we wish to start making predictions on test data or new data.
Below is a function named predict() that predicts an output value for a row given a set of coefficients.
The first coefficient in is always the intercept, also called the bias or b0 as it is standalone and not responsible for a specific input value.
# Make a prediction with coefficients defpredict(row, coefficients): yhat = coefficients[0] for i in range(len(row)-1): yhat += coefficients[i + 1] * row[i] return yhatWe can contrive a small dataset to test our prediction function.
x, y 1, 1 2, 3 4, 3 3, 2 5, 5Below is a plot of this dataset.

Small Contrived Dataset For Linear Regression
We can also use previously prepared coefficients to make predictions for this dataset.
Putting this all together we can test our predict() function below.
# Make a prediction with coefficients defpredict(row, coefficients): yhat = coefficients[0] for i in range(len(row)-1): yhat += coefficients[i + 1] * row[i] return yhat dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]] coef = [0.4, 0.8] for rowin dataset: yhat = predict(row, coef) print("Expected=%.3f, Predicted=%.3f" % (row[-1], yhat))There is a single input value (x) and two coefficient values (b0 and b1). The prediction equation we have modeled for this problem is:
y = b0 + b1 * xor, with the specific coefficient values we chose by hand as:
y = 0.4 + 0.8 * xRunning this function we get predictions that are reasonably close to the expected output (y) values.
Expected=1.000, Predicted=1.200 Expected=3.000, Predicted=2.000 Expected=3.000, Predicted=3.600 Expected=2.000, Predicted=2.800 Expected=5.000, Predicted=4.400Now we are ready to implement stochastic gradient descent to optimize our coefficient values.
2. Estimating CoefficientsWe can estimate the coefficient values for our training data using stochastic gradient descent.
Stochastic