Outlier removal in Python using IQR rule

My previous post ‘ Outlier removal in R using IQR rule ‘ has been one of the most visited posts on here. So now lets have a look at it in python. This time we’ll be using Pandas and NumPy, along with theTitanic dataset. We will also do a little extra thing log transform the data.

If you are really interested in identifying Outliers (or Novelty detection) I would recommend this paper and a good starting point….

Pimentel, M.A.F., Clifton, D.A., Clifton, L., Tarassenko, L., 2014. A review of novelty detection. Signal Processing.

This is just a quick example to get you started.

Load the packages and the data

Here I have thrown in an extra couple of lines, pd.set_options are set to display only 10 rows but all columns. I always have this set up like this. We’re also using Seaborn ( external link) for some box plots. I always import Seaborn and use set_style and set_context, this makes the Matplotlib plots look better (publication quality).

import pandasas pd import numpyas np import matplotlib.pyplotas plt %matplotlib inline import seabornas sns MAX_ROWS = 10 pd.set_option('display.max_rows', MAX_ROWS) pd.set_option('display.max_columns', 200) sns.set_style("whitegrid") sns.set_context("paper")

Next we load the data, as always mine is in saved in a sub folder called ‘Data’…

df = pd.read_csv('Data/Titanic.csv') # View df

We should see the full dataset, note: it has 891 rows and 12 columns.

Visualise the data

We are going to work with the Fares variable, so lets have a look at it…

i = 'Fare' plt.figure(figsize=(10,8)) plt.subplot(211) plt.xlim(df[i].min(), df[i].max()*1.1) ax = df[i].plot(kind='kde') plt.subplot(212) plt.xlim(df[i].min(), df[i].max()*1.1) sns.boxplot(x=df[i])

Note:here I am setting i as ‘Fare’. I do this sometimes to save me having to change it in many places if I want to view other variables. We are also setting the x axis to a min and max values based on the min and max of the variable.

You should see this…

Outlier removal in Python using IQR rule

Here we have two plots, the density plot and the box plot. This is a good way to view the data as we can see in the density plot (top) that there is some data points in the tails but it is difficult to see, however it is clear in the box plot (thank you Seaborn). Transform the data

Next we are going to butcher the data, I use the work butcher because I’m going to get rid of lots of rows just for this demonstration.

# Remove any zeros (otherwise we get (-inf) df.loc[df.Fare == 0, 'Fare'] = np.nan # Drop NA df.dropna(inplace=True) # Log Transform df['Log_' + i] = np.log(df[i])

So what have we done?

We set all zero values in Fares to NaN. A zero can cause a problem when using a log transform We drop all the rows with a NaN, this is a bit extreme in this example (change df.dropna(inplace=True) to df.dropna(subset=[‘Fare’], inplace=True) to keep more data) We create a new variable called ‘Log_’ + i where i is ‘Fare’, so the new variable is Log_Fare

Plot the values as before changing i to ‘Log_Fare’…

i = 'Log_Fare' plt.figure(figsize=(10,8)) plt.subplot(211) plt.xlim(df[i].min(), df[i].max()*1.1) ax = df[i].plot(kind='kde') plt.subplot(212) plt.xlim(df[i].min(), df[i].max()*1.1) sns.boxplot(x=df[i])

And we get this….

Determine the Min and Max

Next we need to determine the min and max cuttoffs for detecting the outliers. As discussed here we do this…

Step 1, get theInterquartile Range

Step 2, calculate the upper and lower values

In Python this is…

q75, q25 = np.percentile(df.Log_Fare.dropna(), [75 ,25]) iqr = q75 - q25 min = q25 - (iqr*1.5) max = q75 + (iqr*1.5)

We can visualise this using similar code as shown above by adding plt.axvline.

i = 'Log_Fare' plt.figure(figsize=(10,8)) plt.subplot(211) plt.xlim(df[i].min(), df[i].max()*1.1) plt.axvline(x=min) plt.axvline(x=max) ax = df[i].plot(kind='kde') plt.subplot(212) plt.xlim(df[i].min(), df[i].max()*1.1) sns.boxplot(x=df[i]) plt.axvline(x=min) plt.axvline(x=max)
Outlier removal in Python using IQR rule

Finishing touches

Now lets identify the outliers. First we set a new variable in the dataframe called ‘Outlier’ defaulted to 0, then is a row is outside this range we set it to 1. Note: i should still be ‘Log_Fare’

df['Outlier'] = 0 df.loc[df[i] < min, 'Outlier'] = 1 df.loc[df[i] > max, 'Outlier'] = 1

Now we can plot the original data and the data without the outliers in (Clean Data).

Summary

A quick breakdown of what we have done. We load the data into Python, remove any rows which has missing data. We then used a log transform to transform the data (ideally to a more Gaussian distribution). The we determined a min and max value and used that to identify which values are Outliers.

Outlier removal in Python using IQR rule

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎