Guide to Encoding Categorical Values inPython

Introduction

In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values which represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country). Regardless of what the value is used for, the challenge is determining how to use this data in the analysis. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for furtherprocessing.

As with many other aspects of the Data Science world, there is no single answer on how to approach this problem. Each approach has trade-offs and has potential impact on the outcome of the analysis. Fortunately, the python tools of pandas and scikit-learn provide several approaches that can be applied to transform the categorical data into suitable numeric values. This article will be a survey of some of the various common (and a few more complex) approaches in the hope that it will help others apply these techniques to their real worldproblems.

The DataSet

For this article, I was able to find a good dataset at the UCI Machine Learning Repository . This particular Automobile Data Set includes a good mix of categorical values as well as continuous values and serves as a useful example that is relatively easy to understand. Since domain understanding is an important aspect when deciding how to encode various categorical values - this data set makes a good casestudy.

Before we get started encoding the various values, we need to important the data and do some minor cleanups. Fortunately, pandas makes thisstraightforward:

import pandas as pd import numpy as np # Define the headers since the data does not have any headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] # Read in the CSV file and convert "?" to NaN df = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data", header=None, names=headers, na_values="?" ) df.head() symboling normalized_losses make fuel_type aspiration num_doors body_style drive_wheels engine_location wheel_base … engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price 0 3 NaN alfa-romero gas std two convertible rwd front 88.6 … 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0 1 3 NaN alfa-romero gas std two convertible rwd front 88.6 … 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0 2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 … 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0 3 2 164.0 audi gas std four sedan fwd front 99.8 … 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0 4 2 164.0 audi gas std four sedan 4wd front 99.4 … 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

The final check we want to do is see what data types wehave:

df.dtypes symboling int64 normalized_losses float64 make object fuel_type object aspiration object num_doors object body_style object drive_wheels object engine_location object wheel_base float64 length float64 width float64 height float64 curb_weight int64 engine_type object num_cylinders object engine_size int64 fuel_system object bore float64 stroke float64 compression_ratio float64 horsepower float64 peak_rpm float64 city_mpg int64 highway_mpg int64 price float64 dtype: object

Since this article will only focus on encoding the categorical variables, we are going to include only the object columns in our dataframe. Pandas has a helpful select_dtypes function which we can use to build a new dataframe containing only the objectcolumns.

obj_df = df.select_dtypes(include=['object']).copy() obj_df.head() make fuel_type aspiration num_doors body_style drive_wheels engine_location engine_type num_cylinders fuel_system 0 alfa-romero gas std two convertible rwd front dohc four mpfi 1 alfa-romero gas std

Guide to Encoding Categorical Values inPython

Trending Articles

LMD VCL Complete v2024.4

Artweaver 7.0.17 免安裝中文版 (8.0.4 安裝版) - 小型繪圖軟體

出售:美國JBL,Paul Audio 出品15吋低音喇叭

日活罗曼晴色粉红电影系列目录

有人買民雄嘉大博識嗎?(或美銓建設以前的建案)

宝可梦无限融合6.4.6最新汉化版+福利版，PC端+安卓端

bundle.load 不回调

uni.requestPayment,支付报错，"errMsg":"requestPayment:fail:[payment微信:-1]General...

[心得] 從來沒碰過魔獸世界的新手照過來,一篇文章就讓你快速上手!

【查】土星在第三宫的表现 (豆瓣草菇@占星社区小组)

《踏血寻梅》拍援交妹命案春夏露点争新人奖

出售: 100% New 抗鼻敏感噴霧 Budesonide PH&T 50 x 3

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

RAV4 E-Mirror電子式後視鏡無法連線

动画「Visual Prison」BD第三卷封面公开

原中国500强建企工程款断崖式下降生存艰难

creator的editbox怎么隐藏键盘

[一般] 至尊不動劍成長進化史給大家參考

請問Rogue這個故障燈號是什麼意思？

[搬運][ANi] 愛有點沉重的暗黑精靈從異世界緊追不放 - 07 [1080P][Baha][WEB-DL][AAC AVC][CHT][MP4]