Part of learning how to use any tool is exploring its strengths and weaknesses. I’m just starting to use the python library Pandas , and my nave use of it exposed a weakness that surprised me.
Background
Thanks to bradleypjohnson for sharing this Lucky Charms photounder CC BY 2.0 .
I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achievecould be represented in a grid like this
circle square starblue 8 41 18
orange 5 33 25
red 53 64 58
At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so
import collectionsSHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}
Then I counted my frequencies using the code below. (For simplicity, assumethat my objects are simple 2-tuples of (shape, color) ).
for shape, color in all_my_objects:frequencies[shape][color] += 1
So far, so good.
Enter the PandasThis looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.
It was especially easy to try out a DataFrame becausemy countingloop ( for...all_my_objects ) wouldn’t change, only the definition of frequencies . (Note that the code below requiresI know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)
import pandas as pdfrequencies = pd.DataFrame(columns=SHAPES, index=COLORS, data=0,
dtype='int')
for shape, color in all_my_objects:
frequencies[shape][color] += 1 It Works, But…
Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow . A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.
How Slow is it?To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code . In theversion of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.
First, I timed how long it takes to increment a simple variable, just to get a baseline.
Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict . This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict , and one inside the Counter ). I expected this to be slower, and it was.
Third, I timed how long it takes to increment one cell inside a 2×2 NumPyarray. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.
Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.
Raw Benchmark ResultsHere’s what timeit showed me. Sorry for the cramped formatting.
$ pythonPython 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('data += 1', setup='data=0')
0.09242476700455882
>>> timeit.timeit('data[0][0]+=1',setup='from collections import Counter;data={0:Counter()}')
0.6838196019816678
>>> timeit.timeit('data[0][0]+=1',setup='import numpy as np;data=np.zeros((2,2))')
0.8909121589967981
>>> timeit.timeit('data[0][0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
157.56428507200326
>>> Benchmark Results Summary
Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.
Actual (seconds) Normalized (seconds) Simple variable 0.092 1 Dict + Counter 0.683 7.398 Numpy 2D array 0.890 9.639 Pandas DataFrame 157.564 1704.784As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.
The DataFrame, however, is about 150 200 times slower than either of those two methods. Ouch!
I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.
Here’s a bar chart of the first three methods

Here’s a bar chart of all four

Why Is My DataFrame Access So Slow?
One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this
>>> SHAPES = ('square', 'circle', 'star', )>>> COLORS = ('red', 'blue', 'orange')
>>> pd.DataFrame(columns=SHAPES, index=COLORS, data=0, dtype='int')
square circle star
red 0 0 0
blue 0 0 0
orange 0 0 0
>>> Then frequencies['square']['orange'] is a valid reference.
Not only that, DataFrames support a variety of indexing and slicing options including
A single label, e.g. 5 or 'a' A list or array of labels ['a', 'b', 'c'] A slice object with labels 'a':'f' A boolean array A callable function with one argumentHere are those techniques applied in order to the frequencies DataFrame so you can see how they work
>>> frequencies['star']red 0
blue 0
orange 0
Name: star, dtype: int64
>>> frequencies[['square', 'star']]
square star
red 0 0
blue 0 0
orange 0 0
>>>