Glmnet for Python

I’ve learned many things since I joined Civis. Least expected though is a new appreciation for simple linear regression and classification models. Shortly after I started, I was asked to evaluate a collection of modeling pipelines on a sample of typical prediction problems at Civis. Glancing at the list, I distinctly remember thinking the real task here was to measure how much better the tree-based models like XGBoost would be compared to some of the “lesser” models like logistic regression.

I set to work installing and running each of the tools on the list, collecting accuracy and runtime metrics. While most were written in python, using Scikit-Learn, NumPy and SciPy, there was a two-stage model that used the R package glmnet and Scikit-Learn. Thinking it would be easier to have a tool that was written in a single language, I started looking for the Scikit-Learn analog of glmnet, specifically the cv.glmnet function in this R package. Unfortunately, I could not find anything written in Python that emulated the functionality we needed from glmnet.

Still, calling R from Python bothered me, so I started reading the glmnet code to see if I could copy and paste my way to success. As it turns out, the majority of the code was written in FORTRAN―a language so old it’s from a time when computers had only uppercase letters. Believe it or not, this was actually good news, I just needed to write some Python code to convert the data to the shape and format expected by the FORTRAN glmnet functions and everything would just work. A few weeks later and we had a Python implementation of glmnet. And today, we’re open sourcing the code.

So what is glmnet and why do we like it at Civis? From the package documentation:

“ glmnet is a package that fits a generalized linear model via penalized maximum likelihood .”

In less formal terms, glmnet fits lasso, ridge, and elastic net versions of linear and logistic regression. For the moment, we are just going to discuss lasso and its feature selection properties, but for a more complete explanation, see two of our favorte text books ESL and ISL .

Feature selection the process of identifying which features to include in a modeling task is a challenging problem. Often, the problem is sufficiently complex that the solution to this is not at all obvious. My colleague Katie discussed feature selectionin depth while building a model to predict whether water wells need maintenance. In this particular example, she used SelectKBest which selects the best features based on univariate statistical tests―a measure of how much each feature is related to the outcome of interest.

The lasso is another tool we can use to accomplish the same objective. The lasso works by fitting a normal logistic regression, but imposing a penalty on the absolute value of the coefficients of the model. This has the effect of shrinking the coefficients, some all the way to zero, meaning they are effectively excluded from the model, producing a sparse model. We could stop here if our task was simply producing a classification model, or we could use the selected features―those with non-zero coefficients―in a Scikit-Learn pipeline as one step in a modeling process.

I omitted one important detail above, and this turns out to be one of the key reasons we use glmnet instead of other lasso solvers. I mentioned that lasso imposes a cost on including features in the model, but not how we balance this against the desire to find an accurate model. This balance―the regularization strength―is actually a parameter of the model we must choose. Typically we run grid search, fitting many models to different values of the regularization strength. As you can imagine, this isn’t particularly fast; one of the innovations made by the glmnet authors was making this process of fitting many models to different values of the regularization strength fast and efficient through some clever math tricks. A single call to the glmnet solver returns many model solutions for a range of values for the regularization strength (referred to as the regularization path). Scikit-Learn has a few solvers that are similar to glmnet, ElasticNetCV and LogisticRegressionCV, but they have some limitations. The first one only works for linear regression and the latter does not handle the elastic net penalty. They also require the user to supply the full sequence of regularization parameters whereas glmnet will determine a suitable sequence from the input data.

A brief example with synthetic data:

import numpy as np from sklearn.datasets import make_classification from sklearn.cross_validation import train_test_split import matplotlib.pyplot as plt import seaborn as sns from glmnet import LogitNet # or ElasticNet for regression

Generate some synthetic data:

X, y = make_classification(n_samples=10000, n_features=100, n_informative=25, n_redundant=10)

Fitting a model should be familiar to anyone who has used Scikit-Learn. First, we instantiate the estimator, supplying any data-independent parameters. In our case, a few relevant options are:

alpha: the lasso vs ridge strength, 1 being lasso, 0 ridge

n_folds: the number of cross validation folds for computing model performance

Additional options are documented in the class docstring. ?LogitNet will show this if you happen to be using the IPython interpreter or a Jupyter notebook. The LogitNet and ElasticNet functions in the Python package are similar to cv.glmnet in the R package in that they run k-fold cross-validation to evaluate the model performance for each value of regularization parameter and automatically select the best.

m = LogitNet(alpha=0.75, n_folds=3)

Next, we call the fit method of the estimator, passing our covariates X and our labels y.

m = m.fit(X_train, y_train)

We can also plot the coefficient path. This is a plot of the coefficient of each feature in the model as a function of the regularization parameter. When this parameter is very big, all the coefficients are zero, as it’s lowered, features start entering the model. Note, the x-axis here is reversed.

for i in range(m.coef_path_.shape[1]): plt.plot(m.lambda_path_, m.coef_path_[0, i, :]) ax = plt.gca() ax.set_xlim(right=m.lambda_path_.max()) ax.set_xlabel("lambda") ax.set_ylabel("Coef Value") ax.invert_xaxis() plt.show()
Glmnet for Python

glmnet may be installed from PyPI and coming soon to conda-forge. The source code is also available on GitHub . We encourage you to try it for your projects, and welcome issues or pull requests.

Glmnet for Python

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本