Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Pandas Surprise

$
0
0
Summary

Part of learning how to use any tool is exploring its strengths and weaknesses. I’m just starting to use the python library Pandas , and my nave use of it exposed a weakness that surprised me.

Background
Pandas Surprise
Thanks to bradleypjohnson for sharing this Lucky Charms photounder CC BY 2.0 .

I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achievecould be represented in a grid like this

circle square star
blue 8 41 18
orange 5 33 25
red 53 64 58

At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so

import collections
SHAPES = ('square', 'circle', 'star', )
frequencies = {shape: collections.Counter() for shape in SHAPES}

Then I counted my frequencies using the code below. (For simplicity, assumethat my objects are simple 2-tuples of (shape, color) ).

for shape, color in all_my_objects:
frequencies[shape][color] += 1

So far, so good.

Enter the Pandas

This looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.

It was especially easy to try out a DataFrame becausemy countingloop ( for...all_my_objects ) wouldn’t change, only the definition of frequencies . (Note that the code below requiresI know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)

import pandas as pd
frequencies = pd.DataFrame(columns=SHAPES, index=COLORS, data=0,
dtype='int')
for shape, color in all_my_objects:
frequencies[shape][color] += 1 It Works, But…

Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow . A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.

How Slow is it?

To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code . In theversion of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.

First, I timed how long it takes to increment a simple variable, just to get a baseline.

Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict . This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict , and one inside the Counter ). I expected this to be slower, and it was.

Third, I timed how long it takes to increment one cell inside a 2×2 NumPyarray. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.

Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.

Raw Benchmark Results

Here’s what timeit showed me. Sorry for the cramped formatting.

$ python
Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('data += 1', setup='data=0')
0.09242476700455882
>>> timeit.timeit('data[0][0]+=1',setup='from collections import Counter;data={0:Counter()}')
0.6838196019816678
>>> timeit.timeit('data[0][0]+=1',setup='import numpy as np;data=np.zeros((2,2))')
0.8909121589967981
>>> timeit.timeit('data[0][0]+=1',setup='import pandas as pd;data=pd.DataFrame(data=[[0,0],[0,0]],dtype="int")')
157.56428507200326
>>> Benchmark Results Summary

Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.

Actual (seconds) Normalized (seconds) Simple variable 0.092 1 Dict + Counter 0.683 7.398 Numpy 2D array 0.890 9.639 Pandas DataFrame 157.564 1704.784

As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.

The DataFrame, however, is about 150 200 times slower than either of those two methods. Ouch!

I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.

Here’s a bar chart of the first three methods


Pandas Surprise

Here’s a bar chart of all four


Pandas Surprise
Why Is My DataFrame Access So Slow?

One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this

>>> SHAPES = ('square', 'circle', 'star', )
>>> COLORS = ('red', 'blue', 'orange')
>>> pd.DataFrame(columns=SHAPES, index=COLORS, data=0, dtype='int')
square circle star
red 0 0 0
blue 0 0 0
orange 0 0 0
>>> Then frequencies['square']['orange'] is a valid reference.

Not only that, DataFrames support a variety of indexing and slicing options including

A single label, e.g. 5 or 'a' A list or array of labels ['a', 'b', 'c'] A slice object with labels 'a':'f' A boolean array A callable function with one argument

Here are those techniques applied in order to the frequencies DataFrame so you can see how they work

>>> frequencies['star']
red 0
blue 0
orange 0
Name: star, dtype: int64
>>> frequencies[['square', 'star']]
square star
red 0 0
blue 0 0
orange 0 0
>>>

Viewing all articles
Browse latest Browse all 9596

Trending Articles