Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Python Pandas Groupby Tutorial

$
0
0

In this Pandas group by we are going to learn how to organize Pandas dataframes by groups. More specifically, we are going to learn how to group by one and multiple columns. Furthermore, we are going to learn how calculate some basics summary statistics (e.g., mean, median), convert Pandas groupby to dataframe, calculate the percentage of observations in each group, and many more useful things.

More about working with Pandas: Pandas Dataframe Tutorial

First of all we are going to import pandas as pd, and read a CSV file, using the read_csv method, to a dataframe. In the example below, we use index_col=0 because the first row in the dataset is the index column.

import pandas as pd data_url = 'http://vincentarelbundock.github.io/Rdatasets/csv/carData/Salaries.csv' df = pd.read_csv(data_url, index_col=0) df.head()
Python Pandas Groupby Tutorial

We used Pandas head to se the first 5 rows of our dataframe. In the image above we can see that we have, at least, three variables that we can group our data by. That is, we can group our data by “rank”, “discipline”, and “sex”. Of course, we could also group it by yrs.since.phd or yrs.service but it may be a lot of groups. As previously mentioned we are going to use Pandas groupby to group a dataframe based on one, two, three, or more columns.

Data can be loaded from other file formats as well (e.g., Excel, HTML, JSON):

Pandas Excel Tutorial: How to Read and Write Excel Files Explorative Data Analysis with Pandas, SciPy, and Seaborn includes a short introduction to Pandas read_html python Pandas Groupby Example

We are starting with the simplest example; grouping by one column. In the Pandas groupby example below we are going to group by the column “rank”.

There are many different methods that we can use on Pandas groupby objects (and Pandas dataframe objects). All available methods on a Python object can be found using this code:

import IPython # Grouping by one factor df_rank = df.groupby('rank') # Getting all methods from the groupby object: meth = [method_name for method_name in dir(df_rank) if callable(getattr(df_rank, method_name)) & ~method_name.startswith('_')] # Printing the result print(IPython.utils.text.columnize(meth))
Python Pandas Groupby Tutorial

Note, that in the code example above we also import IPython to print the list in columns. In the following examples we are going to use some of these methods. First, we can print out the groups by using the groups method to get a dictionary of groups:

df_rank.groups
Python Pandas Groupby Tutorial

We can also use the groupby method get_group to filter the grouped data. In the next code example we are going to select the Assistant Professor group (i.e., “AsstProf”).

# Get group df_rank.get_group('AsstProf').head()
Python Pandas Groupby Tutorial
Pandas Groupby Count

If we want to find out how big each group is (e.g., how many observations in each group), we can use use .size() to count the number of rows in each group:

df_rank.size() # Output: # # rank # AssocProf 64 # AsstProf67 # Prof 266 # dtype: int64

Additionally, we can also use Pandas groupby count method to count by group(s) and get the entire dataframe. If we don’t have any missing values the number should be the same for each column and group. Thus, this is a way we can explore the dataset and see if there are any missing values in any column.

df_rank.count()
Python Pandas Groupby Tutorial

That was how to use Pandas size to count the number of rows in each group. We will return to this, later, when we are grouping by multiple columns. Now we are going to In some cases we may want to find out the number of unique values in each group. This can be done using the groupby method nunique :

df_rank.nunique()
Python Pandas Groupby Tutorial

As can be seen in the the last column (salary) there are 63 Associate Professors, 53 Assistant Proffessors, and 261 Professors in the dataset. In this example we have a complete dataset and we can see that some have the same salary (e.g., there are 261 unique values in the column salary for Professors). As we will see if we have missing values in the dataframe we would get a different result. In the next example we are using Pandas mask method together with NumPy’s random.random to insert missing values (i.e., np.NaN) in 10% of the dataframe:

df_null = df.mask(np.random.random(df.shape) < .1) df_null.isnull().sum().reset_index(name='N Missing Values')
Python Pandas Groupby Tutorial

Note, we used the reset_index method above to get the multi-level indexed grouped dataframe to become a single indexed. In the particular example, above, we used the parameter name to name the count column (“N Missing Values”). This parameter, however, can only be used on Pandas series objects and not dataframe objects.

That said, let’s return to the example; if we run the same code as above (counting unique values by group) we can see that it will not count missing values:

df_null.groupby('rank').nunique()
Python Pandas Groupby Tutorial

That is, we don’t get the same numbers in the two tables because of the missing values. In the following examples we are going to work with Pandas groupby to calculate the mean, median, and standard deviation by one group.

Pandas Groupby Mean

If we want to calculate the mean salary grouped by one column (rank, in this case) it’s simple. We just use Pandas mean method on the grouped dataframe:

df_rank['salary'].mean().reset_index()
Python Pandas Groupby Tutorial

Having a column named salary may not be useful. For instance, if someone else are going to see the table they may not know that it’s the mean salary for each group. Luckily, we can add the rename method to the above code to rename the columns of the grouped data:

df_rank['salary'].mean().reset_index().rename( columns={'rank':'Rank','salary' : 'Mean Salary'})
Python Pandas Groupby Tutorial
Median Score of a Group Using the groupby Method in Pandas

Now lets group by disciplne of the academic and find the median salary in the next Pandas groupby example

df.groupby('rank')['salary'].median().reset_index().rename( columns={'rank':'Rank','salary' : 'MedianSalary'})
Python Pandas Groupby Tutorial
Aggregate Data by Group using Pandas Groupby Most of the time we wan

Viewing all articles
Browse latest Browse all 9596

Trending Articles