Regression in Python using R-style formula it’s easy!

I remember experimenting with doing regressions in python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now and I’ll show you just how easy.

Before running this you will need to install the pandas , statsmodels and patsy packages. If you’re using conda you should be able to do this by running the following from the terminal:

conda install statsmodels patsy

(and then say yes when it asks you to confirm it)

import pandas as pd from statsmodels.formula.api import ols

Before we can do any regression, we need some data so lets read some data on cars:

df = pd.read_csv("http://web.pdx.edu/~gerbing/data/cars.csv")

You may have noticed from the code above that you can just give a URL to the read_csv function and it will download it and open it handy!

Anyway, here is the data:

df.head() Model MPG Cylinders Engine Disp Horsepower Weight Accelerate Year Origin 0 amc ambassador dpl 15.0 8 390.0 190 3850 8.5 70 American 1 amc gremlin 21.0 6 199.0 90 2648 15.0 70 American 2 amc hornet 18.0 6 199.0 97 2774 15.5 70 American 3 amc rebel sst 16.0 8 304.0 150 3433 12.0 70 American 4 buick estate wagon (sw) 14.0 8 455.0 225 3086 10.0 70 American

Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the corr() method:

df.corr() MPG Cylinders Engine Disp Horsepower Weight Accelerate Year MPG 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 0.580541 Cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 -0.345647 Engine Disp -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 -0.369855 Horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 -0.416361 Weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 -0.309120 Accelerate 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 0.290316 Year 0.580541 -0.345647 -0.369855 -0.416361 -0.309120 0.290316 1.000000

Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:

model = ols("MPG ~ Year", data=df) results = model.fit()

The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The ols method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called df ). We then call fit() to actually do the regression.

We can easily get a summary of the results here including all sorts of crazy statistical measures!

results.summary() OLS Regression Results Dep. Variable: MPG R-squared: 0.337 Model: OLS Adj. R-squared: 0.335 Method: Least Squares F-statistic: 198.3 Date: Sat, 20 Aug 2016 Prob (F-statistic): 1.08e-36 Time: 10:42:17 Log-Likelihood: -1280.6 No. Observations: 392 AIC: 2565. Df Residuals: 390 BIC: 2573. Df Model: 1 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.] Intercept -70.0117 6.645 -10.536 0.000 -83.076 -56.947 Year 1.2300 0.087 14.080 0.000 1.058 1.402 Omnibus: 21.407 Durbin-Watson: 1.121 Prob(Omnibus): 0.000 Jarque-Bera (JB): 15.843 Skew: 0.387 Prob(JB): 0.000363 Kurtosis: 2.391 Cond. No. 1.57e+03

We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:

df.columns Index(['Model', 'MPG', 'Cylinders', 'Engine Disp', 'Horsepower', 'Weight', 'Accelerate', 'Year', 'Origin'], dtype='object')

We can now add in more variables doing multiple regression:

model = ols("MPG ~ Year + Weight + Horsepower", data=df) results = model.fit() results.summary() OLS Regression Results Dep. Variable: MPG R-squared: 0.808 Model: OLS Adj. R-squared: 0.807 Method: Least Squares F-statistic: 545.4 Date: Sat, 20 Aug 2016 Prob (F-statistic): 9.37e-139 Time: 10:42:17 Log-Likelihood: -1037.4 No. Observations: 392 AIC: 2083. Df Residuals: 388 BIC: 2099. Df Model: 3 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.] Intercept -13.7194 4.182 -3.281 0.001 -21.941 -5.498 Year 0.7487 0.052 14.365 0.000 0.646 0.851 Weight -0.0064 0.000 -15.768 0.000 -0.007 -0.006 Horsepower -0.0050 0.009 -0.530 0.597 -0.024 0.014 Omnibus: 41.952 Durbin-Watson: 1.423 Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.490 Skew: 0.671 Prob(JB): 8.14e-16 Kurtosis: 4.566 Cond. No. 7.48e+04

We can see that bringing in some extra variables has increased the $R^2$ value from ~0.3 to ~0.8 although we can see that the P value for the Horsepower is very high. If we remove Horsepower from the regression then it barely changes the results:

model = ols("MPG ~ Year + Weight", data=df) results = model.fit() results.summary() OLS Regression Results Dep. Variable: MPG R-squared: 0.808 Model: OLS Adj. R-squared: 0.807 Method: Least Squares F-statistic: 819.5 Date: Sat, 20 Aug 2016 Prob (F-statistic): 3.33e-140 Time: 10:42:17 Log-Likelihood: -1037.6 No. Observations: 392 AIC: 2081. Df Residuals: 389 BIC: 2093. Df Model: 2 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.] Intercept -14.3473 4.007 -3.581 0.000 -22.224 -6.470 Year 0.7573 0.049 15.308 0.000 0.660 0.855 Weight -0.0066 0.000 -30.911 0.000 -0.007 -0.006 Omnibus: 42.504 Durbin-Watson: 1.425 Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.997 Skew: 0.670 Prob(JB): 2.32e-16 Kurtosis: 4.616 Cond. No. 7.17e+04

We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called Origin . Patsy automatically treats strings as categorical variables, so we don’t have to do anything special but if needed we could wrap the variable name in C() to force it to be a categorical variable.

model = ols("MPG ~ Year + Origin", data=df) results = model.fit() results.summary() OLS Regression Results Dep. Variable: MPG R-squared: 0.579 Model: OLS Adj. R-squared: 0.576 Method: Least Squares F-statistic: 178.0 Date: Sat, 20 Aug 2016 Prob (F-statistic): 1.42e-72 Time: 10:42:17 Log-Likelihood: -1191.5 No. Observations: 392 AIC: 2391. Df Residuals: 388 BIC: 2407. Df Model: 3 Covariance Type: nonrobust coef std err t P>|t| [95.0% Conf. Int.] Intercept -61.2643 5.393 -11.360 0.000 -71.868 -50.661 Origin[T.European] 7.4784 0.697 10.734 0.000 6.109 8.848 Origin[T.Japanese] 8.4262 0.671 12.564 0.000 7.108 9.745 Year 1.0755 0.071 15.102 0.000 0.935 1.216 Omnibus: 10.231 Durbin-Watson: 1.656 Prob(Omnibus): 0.006 Jarque-Bera (JB): 10.589 Skew: 0.402 Prob(JB): 0.00502 Kurtosis: 2.980 Cond. No. 1.60e+03

You can see here that Patsy has automatically created extra variables for Origin : in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily see here .

Just for reference, you can easily get any of the statistical outputs as attributes on the results object:

results.rsquared

0.57919459237581172

results.params Intercept -61.264305 Origin[T.European] 7.478449 Origin[T.Japanese] 8.426227 Year 1.075484 dtype: float64

You can also really easily use the model to predict based on values you’ve got:

results.predict({'Year':1990, 'Origin':'European'}) array([ 2086.42634156])

Categorised as: Programming , Python

Regression in Python using R-style formula it’s easy!

Trending Articles

【英文字幕/OVA/冷门动画】装鬼兵系列美版两部全

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

STEAG EBSILON Professional 13.02 crack 破解版

銘京液靜壓系統客製化服務

反恐部隊都在用！雪地戰神版~怪怪 G&G M4A1 Chione16 電動槍，電槍(槍機可動)

unipush2.0 使用getClientStatusByCid查询设备状态，一直返回offline

GPU-Z 2.67.0 免安裝版 - 顯示卡規格檢查程式

名詞解釋：直接員工(DL)與間接員工(IDL)的差異，對工時的影響

《北京之春》作家：中共跨国镇压是最大威胁

出售: Seiko SNA779P1 Chronograph Alarm

[分享]標註x、y、z座標LISP

曾智希写真集12.1预购首次挑战全裸「浴照」

[图像制作]通用大学毕业证生成器免注册版 v8.5 免费版

【粉羽社&超古代】[新干线变形机器人进化先锋] [改变世界] [25] [1080p] [Tvrip] [简体内嵌]

关门一家亲：习远平、张澜澜、徐才厚

明慧广播：明慧文章汇编-修心断欲（5）

盧金箴命理師大栽問(9)－黑道卜卦你敢講壞事嗎？

北京狗-優質寶寶-特價

爆杨兰兰对于朦胧一见倾心泄露亲爹习近平致命机密？【阿波罗网报道】

宝可梦无限融合6.4.6最新汉化版+福利版，PC端+安卓端