Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Guide to Encoding Categorical Values inPython

$
0
0

Guide to Encoding Categorical Values inPython
Introduction

In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values which represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country). Regardless of what the value is used for, the challenge is determining how to use this data in the analysis. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for furtherprocessing.

As with many other aspects of the Data Science world, there is no single answer on how to approach this problem. Each approach has trade-offs and has potential impact on the outcome of the analysis. Fortunately, the python tools of pandas and scikit-learn provide several approaches that can be applied to transform the categorical data into suitable numeric values. This article will be a survey of some of the various common (and a few more complex) approaches in the hope that it will help others apply these techniques to their real worldproblems.

The DataSet

For this article, I was able to find a good dataset at the UCI Machine Learning Repository . This particular Automobile Data Set includes a good mix of categorical values as well as continuous values and serves as a useful example that is relatively easy to understand. Since domain understanding is an important aspect when deciding how to encode various categorical values - this data set makes a good casestudy.

Before we get started encoding the various values, we need to important the data and do some minor cleanups. Fortunately, pandas makes thisstraightforward:

import pandas as pd import numpy as np # Define the headers since the data does not have any headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] # Read in the CSV file and convert "?" to NaN df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data", header=None, names=headers, na_values="?" ) df.head() symboling normalized_losses make fuel_type aspiration num_doors body_style drive_wheels engine_location wheel_base … engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price 0 3 NaN alfa-romero gas std two convertible rwd front 88.6 … 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0 1 3 NaN alfa-romero gas std two convertible rwd front 88.6 … 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0 2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 … 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0 3 2 164.0 audi gas std four sedan fwd front 99.8 … 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0 4 2 164.0 audi gas std four sedan 4wd front 99.4 … 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

The final check we want to do is see what data types wehave:

df.dtypes symboling int64 normalized_losses float64 make object fuel_type object aspiration object num_doors object body_style object drive_wheels object engine_location object wheel_base float64 length float64 width float64 height float64 curb_weight int64 engine_type object num_cylinders object engine_size int64 fuel_system object bore float64 stroke float64 compression_ratio float64 horsepower float64 peak_rpm float64 city_mpg int64 highway_mpg int64 price float64 dtype: object

Since this article will only focus on encoding the categorical variables, we are going to include only the object columns in our dataframe. Pandas has a helpful select_dtypes function which we can use to build a new dataframe containing only the objectcolumns.

obj_df = df.select_dtypes(include=['object']).copy() obj_df.head() make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system 0 alfa-romero gas std two convertible rwd front dohc four mpfi 1 alfa-romero gas std

Viewing all articles
Browse latest Browse all 9596

Trending Articles