Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Feature Importance and Feature Selection With XGBoost in Python

$
0
0

A benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model.

In this post you will discover how you can estimate the importance of features for a predictive modeling problem using the XGBoost library in python.

After reading this post you will know:

How feature importance is calculated using the gradient boosting algorithm. How to plot feature importance in Python calculated by the XGBoost model. How to use feature importance calculated by XGBoost to perform feature selection.

Let’s get started.


Feature Importance and Feature Selection With XGBoost in Python

Feature Importance and Feature Selection With XGBoost in Python

Photo by Keith Roper , some rights reserved.

The Algorithm that is Winning Competitions

...XGBoost for fast gradient boosting


Feature Importance and Feature Selection With XGBoost in Python
XGBoost is the high performance implementation of gradient boosting that you can now access directly in Python.

Your PDF Download and Email Course.

FREE 7-Day Mini-Course on

XGBoostWithPython

Download Your FREE Mini-Course

Download your PDF containing all 7lessons.

Daily lesson via email with tips and tricks.

Feature Importance in Gradient Boosting

A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity (Gini index) used to select the split points or anothermore specific error function.

The feature importances are then averaged across all of the the decision trees within the model.

For more technical information on how feature importance is calculated in boosted decision trees, see Section 10.13.1 “ Relative Importance of Predictor Variables ” of the book The Elements of Statistical Learning: Data Mining, Inference, and Prediction , page 367.

Also, see Matthew Drury answer to the StackOverflow question “ Relative variable importance for Boosting ” where he provides a very detailed and practical answer.

Manually Plot Feature Importance

A trained XGBoost model automatically calculates feature importance on your predictive modeling problem.

These importance scores are available in the feature_importances_ member variable of the trained model. For example, they can be printed directly as follows:

print(model.feature_importances_)

We can plot these scores on a bar chart directly to get a visual indication of the relative importance of each feature in the dataset. For example:

# plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

We can demonstrate this by training an XGBoost model on the Pima Indians onset of diabetesdataset and creating a bar chart from the calculated feature importances.

# plot feature importance manually fromnumpyimportloadtxt fromxgboostimportXGBClassifier frommatplotlibimportpyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # feature importance print(model.feature_importances_) # plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

Running this example first outputs the importance scores:

[ 0.089701 0.17109634 0.08139535 0.04651163 0.10465116 0.2026578 0.1627907 0.14119601]

We also get a bar chart of the relative importances.


Feature Importance and Feature Selection With XGBoost in Python

Manual Bar Chart of XGBoost Feature Importance

A downside of this plot is that the features are ordered by their input index rather than their importance. We could sort the features before plotting.

Thankfully, there is a built in plot function to help us.

Using theBuilt-in XGBoost Feature Importance Plot

The XGBoost library provides a built-in function to plot features ordered by their importance.

The function is called plot_importance() and can be used as follows:

# plot feature importance plot_importance(model) pyplot.show()

For example, below is a complete code listing plotting the feature importance for the Pima Indians dataset using the built-in plot_importance() function.

# plot feature importance using built-in function fromnumpyimportloadtxt fromxgboostimportXGBClassifier fromxgboostimportplot_importance frommatplotlibimportpyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # plot feature importance plot_importance(model) pyplot.show()

Running the example gives us a more useful bar chart.


Feature Importance and Feature Selection With XGBoost in Python

XGBoost Feature Importance Bar Chart

You can see that features are automatically named according to their index in the input array (X) from F0 to F7.

Manually mapping these indices to names in the problem description , we can see that the plot shows F5 (body mass index) has the highest importance and F3 (skin fold thickness) has the lowestimportance.

Feature Selection with XGBoost Feature Importance Scores

Feature importance scores can be used for feature selection in scikit-learn.

This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.

This class can take a pre-trained model, such as one trained on the entire training dataset. It can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromMode

Viewing all articles
Browse latest Browse all 9596

Trending Articles