How To Implement Simple Linear Regression From Scratch With Python

Linear regression is a prediction method that is more than 200 years old.

Simple linear regression is a great first machine learning algorithm to implement as it requires you to estimate properties from your training dataset, but is simple enough for beginners to understand.

In this tutorial, you will discover how to implement the simple linear regression algorithm from scratch in python.

After completing this tutorial you will know:

How to estimate statistical quantities from training data. How to estimate linear regression coefficients from data. How to make predictions using linear regression for new data.

Let’s get started.

How To Implement Simple Linear Regression From Scratch With Python

Photo by Kamyar Adl , some rights reserved.

Description

This section is divided into two parts, a description of the simple linear regression technique and a description of the dataset to which we will later apply it.

Simple Linear Regression

Linear regression assumes a linear or straight line relationship between the input variables (X) and the single output variable (y).

More specifically, that output (y) can be calculated from a linear combination of the input variables (X). When there is a single input variable, the method is referred to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data.

The line for a simple linear regression model can be written as:

y = b0 + b1 * x

where b0 and b1 are the coefficients we must estimate from the training data.

Once the coefficients are known, we can use this equation to estimate output values for y given new input examples of x.

It requires that you calculate statistical properties from the data such as mean, variance and covariance.

All the algebra has been taken care of and we are left with some arithmetic to implement to estimate the simple linear regression coefficients.

Briefly, we can estimate the coefficients as follows:

B1 = sum((x(i) - mean(x)) * (y(i) - mean(y))) / sum( (x(i) - mean(x))^2 ) B0 = mean(y) - B1 * mean(x)

where the i refers to the value of the ith value of the input x or output y.

Don’t worry if this is not clear right now, these are the functions will implement in the tutorial.

Swedish Insurance Dataset

We will use a real dataset to demonstrate simple linear regression.

The dataset is called the “Auto Insurance in Sweden” dataset and involves predicting the total payment for all the claims in thousands of Swedish Kronor (y) given the total number of claims (x).

This means that for a new number of claims (x) we will be able to predict the total payment of claims (y).

Here is a small sample of the first 5 records of the dataset.

108,392.5 19,46.2 13,15.7 124,422.2 40,119.4

Using the Zero Rule algorithm (that predicts the mean value) a Root Mean Squared Error or RMSE of about 72.251 (thousands of Kronor) is expected.

Below is a scatter plot of the entire dataset.

Swedish Insurance Dataset

You can download the raw dataset from here or here .

Save it to a CSV file in your local working directory with the name “ insurance.csv “.

Note, you may need to convert the European “,” to the decimal “.”. You will also need change the file from white-space-separated variables to CSV format.

Tutorial

This tutorial is broken down into five parts:

Calculate Mean and Variance. Calculate Covariance. Estimate Coefficients. Make Predictions. Predict Insurance.

These steps will give you the foundation you need to implement and train simple linear regression models for your own prediction problems.

1. Calculate Mean and Variance

The first step is to estimate the mean and the variance of both the input and output variables from the training data.

The mean of a list of numbers can be calculated as:

mean(x) = sum(x) / count(x)

Below is a function named mean() that implements this behavior for a list of numbers.

# Calculate the mean value of a list of numbers defmean(values): return sum(values) / float(len(values))

The variance is the sum squared difference for each value from the mean value.

Variance for a list of numbers can be calculated as:

variance = sum( (x - mean(x))^2 )

Below is a function named variance() that calculates the variance of a list of numbers. It requires the mean of the list to be provided as an argument, just so we don’t have to calculate it more than once.

# Calculate the variance of a list of numbers defvariance(values, mean): return sum([(x-mean)**2 for x in values])

We can put these two functions together and test them on a small contrived dataset.

Below is a small dataset of x and y values.

x, y 1, 1 2, 3 4, 3 3, 2 5, 5

We can plot this dataset on a scatter plot graph as follows:

Small Contrived Dataset For Simple Linear Regression

We can calculate the mean and variance for both the x and y values in the example below.

# Estimate Mean and Variance # Calculate the mean value of a list of numbers defmean(values): return sum(values) / float(len(values)) # Calculate the variance of a list of numbers defvariance(values, mean): return sum([(x-mean)**2 for x in values]) # calculate mean and variance dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]] x = [row[0] for rowin dataset] y = [row[1] for rowin dataset] mean_x, mean_y = mean(x), mean(y) var_x, var_y = variance(x, mean_x), variance(y, mean_y) print('x stats: mean=%.3f variance=%.3f' % (mean_x, var_x)) print('y stats: mean=%.3f variance=%.3f' % (mean_y, var_y))

Running this example prints out the mean and variance for both columns.

x stats: mean=3.000 variance=10.000 y stats: mean=2.800 variance=8.800

This is our first step, next we need to put these values to use in calculating the covariance.

2. Calculate Covariance

The covariance of two numbers describes how those numbers change together.

In fact, covariance is a generalization of correlation that is limited to two variables. Whereas covariance can be calculate between two or more variables.

Additionally, covariance can be normalized to produce a correlation value.

Nevertheless, we can calculate the covariance between two variables as follows:

covariance = sum((x(i) - mean(x)) * (y - mean(y)))

Below is a function named covariance() that implements this statistic. It builds upon the previous step and takes the lists of x and yvalues as well as the mean of these values as arguments.

How To Implement Simple Linear Regression From Scratch With Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本