Python Pandas Groupby Tutorial

In this Pandas group by we are going to learn how to organize Pandas dataframes by groups. More specifically, we are going to learn how to group by one and multiple columns. Furthermore, we are going to learn how calculate some basics summary statistics (e.g., mean, median), convert Pandas groupby to dataframe, calculate the percentage of observations in each group, and many more useful things.

More about working with Pandas: Pandas Dataframe Tutorial

First of all we are going to import pandas as pd, and read a CSV file, using the read_csv method, to a dataframe. In the example below, we use index_col=0 because the first row in the dataset is the index column.

import pandas as pd data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv' df = pd.read_csv(data_url, index_col=0) df.head()
Python Pandas Groupby Tutorial

We used Pandas head to se the first 5 rows of our dataframe. In the image above we can see that we have, at least, three variables that we can group our data by. That is, we can group our data by “rank”, “discipline”, and “sex”. Of course, we could also group it by yrs.since.phd or yrs.service but it may be a lot of groups. As previously mentioned we are going to use Pandas groupby to group a dataframe based on one, two, three, or more columns.

Data can be loaded from other file formats as well (e.g., Excel, HTML, JSON):

Pandas Excel Tutorial: How to Read and Write Excel Files Explorative Data Analysis with Pandas, SciPy, and Seaborn includes a short introduction to Pandas read_html python Pandas Groupby Example

We are starting with the simplest example; grouping by one column. In the Pandas groupby example below we are going to group by the column “rank”.

There are many different methods that we can use on Pandas groupby objects (and Pandas dataframe objects). All available methods on a Python object can be found using this code:

import IPython # Grouping by one factor df_rank = df.groupby('rank') # Getting all methods from the groupby object: meth = [method_name for method_name in dir(df_rank) if callable(getattr(df_rank, method_name)) & ~method_name.startswith('_')] # Printing the result print(IPython.utils.text.columnize(meth))
Python Pandas Groupby Tutorial

Note, that in the code example above we also import IPython to print the list in columns. In the following examples we are going to use some of these methods. First, we can print out the groups by using the groups method to get a dictionary of groups:

df_rank.groups
Python Pandas Groupby Tutorial

We can also use the groupby method get_group to filter the grouped data. In the next code example we are going to select the Assistant Professor group (i.e., “AsstProf”).

# Get group df_rank.get_group('AsstProf').head()
Python Pandas Groupby Tutorial

Pandas Groupby Count

If we want to find out how big each group is (e.g., how many observations in each group), we can use use .size() to count the number of rows in each group:

df_rank.size() # Output: # # rank # AssocProf 64 # AsstProf67 # Prof 266 # dtype: int64

Additionally, we can also use Pandas groupby count method to count by group(s) and get the entire dataframe. If we don’t have any missing values the number should be the same for each column and group. Thus, this is a way we can explore the dataset and see if there are any missing values in any column.

df_rank.count()
Python Pandas Groupby Tutorial

That was how to use Pandas size to count the number of rows in each group. We will return to this, later, when we are grouping by multiple columns. Now we are going to In some cases we may want to find out the number of unique values in each group. This can be done using the groupby method nunique :

df_rank.nunique()
Python Pandas Groupby Tutorial

As can be seen in the the last column (salary) there are 63 Associate Professors, 53 Assistant Proffessors, and 261 Professors in the dataset. In this example we have a complete dataset and we can see that some have the same salary (e.g., there are 261 unique values in the column salary for Professors). As we will see if we have missing values in the dataframe we would get a different result. In the next example we are using Pandas mask method together with NumPy’s random.random to insert missing values (i.e., np.NaN) in 10% of the dataframe:

df_null = df.mask(np.random.random(df.shape) < .1) df_null.isnull().sum().reset_index(name='N Missing Values')
Python Pandas Groupby Tutorial

Note, we used the reset_index method above to get the multi-level indexed grouped dataframe to become a single indexed. In the particular example, above, we used the parameter name to name the count column (“N Missing Values”). This parameter, however, can only be used on Pandas series objects and not dataframe objects.

That said, let’s return to the example; if we run the same code as above (counting unique values by group) we can see that it will not count missing values:

df_null.groupby('rank').nunique()
Python Pandas Groupby Tutorial

That is, we don’t get the same numbers in the two tables because of the missing values. In the following examples we are going to work with Pandas groupby to calculate the mean, median, and standard deviation by one group.

Pandas Groupby Mean

If we want to calculate the mean salary grouped by one column (rank, in this case) it’s simple. We just use Pandas mean method on the grouped dataframe:

df_rank['salary'].mean().reset_index()
Python Pandas Groupby Tutorial

Having a column named salary may not be useful. For instance, if someone else are going to see the table they may not know that it’s the mean salary for each group. Luckily, we can add the rename method to the above code to rename the columns of the grouped data:

df_rank['salary'].mean().reset_index().rename( columns={'rank':'Rank','salary' : 'Mean Salary'})
Python Pandas Groupby Tutorial

Median Score of a Group Using the groupby Method in Pandas

Now lets group by disciplne of the academic and find the median salary in the next Pandas groupby example

df.groupby('rank')['salary'].median().reset_index().rename( columns={'rank':'Rank','salary' : 'MedianSalary'})
Python Pandas Groupby Tutorial

Aggregate Data by Group using Pandas Groupby Most of the time we wan

Python Pandas Groupby Tutorial

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

爆杨兰兰对于朦胧一见倾心泄露亲爹习近平致命机密？【阿波罗网报道】

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

LameXP 4.21.2382 免安裝中文版 - MP3音樂轉檔軟體

免费翻墙节点大全