Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Notes from a Jupyter/Dataframe project

$
0
0

I’m looking at some League of Legends data; champion win rates by player tier . I’m curious which championshave higher win rateswith skilled players, and/or which are more popular. You can see the resulting report here , notebook (without nice CSS) is here .

Learning me some Jupyter and Pandas

Really I’m using this projectto get more comfortable with using Jupyter and Pandas DataFrames to analyze data. The DataFrames part is overkill; I’ve got about 10k of data, not the 10G of numbers DataFrames’ highly optimized code is for. But DataFrames have lots of functions that replicate SQL’s analysis capabilities and they display nicely in Jupyter, so here I am.

My data starts as web pages of HTML tables . I imported thoseinto a single Google Spreadsheet with multiple tabs.For a single sheet, rows are champions (120 or so) and columns are statistics like win rate or games played (5 of them). There’s 7 sheets, one per tier Bronze / Silver / Gold / … I then exported those sheets as CSV files, one per tier, for analysis. (Google Spreadsheets turns out to be a remarkably good HTML table data scraper.)

I then played with various forms of python code for reading those CSV files and doing some statistics on them. Starting with simple raw Python data structures, a bunch of dicts and tuples all the way up to a full Panda Panel. Along the way I learned how to analyze the data using Pandas a bit, and to format the results nicely in Jupyter.My real goal here is to make interactive reporting as natural as if I’d loaded all the data into a SQL database. Being able to do natural things like sum all values, or get the average of numbers grouped by some criterion.

Data loading

One thing I found is that the more I whipped my data into Pandas shape, the simpler my code got. I finally settled on doing all the parsing way up front, when I read the CSV files, to makeclean DataFrameobjects with well named columns and rows. That makes the lateranalysis code much simpler.

The read_csv() function has a lot of options to configure parsing, it’s worth the time to set them. “thousands=,” for instance will allow it to parse things like “1,523,239” as an integer and not a string. I also ended up doing “header=0” and specifying my own “names=[…]” for columns, just so I could make the names valid Python identifiers. (You can access named things in a DataFrame with attribute syntax; either df[“winrate”] or df.winrate.)

But the most important thing for read_csv was index_col. The concept of index in a DataFrame is particularly important. That’s the unique key that names each row; in my data the index is the champion name. In a CSV file this would probably be the column named “label” or the like. In a database table it’d be the primary key. The word index is used all over the docs and it confuses me all the time. Once I set up my data index correctly the code got a lot better.

The thing I find most confusing is the shape of data. DataFrames, for instance, mostly seem to want to treat a column of numbers as a single feature. But I really need to work row-rise most of the time. Slicing a DataFrame is particularly bizarre. df[n] gives you the nth column from the dataframe df. But df[n:n+2] gives you rows n to n+2. I find I often have to transpose my dataframe to make operations convenient, or use the “axis” parameter to functions like sum() to indicate whether I want to sum over columns or over rows.

The final fillip was using a Pandas Panel to store my data. A Panel is a collection of DataFrames; in my case one per tier. This 3 dimensional data structure is a bit confusing, the docs apologize for the names and it’s not clear that Panel adds a lot of value over a dict of DataFrame objects. But it’s Pandas’ way of representing a collection of spreadsheets, so I am going with it. My panels “items” are the tier names, the major_axis is championrows (named by the index) and the minor_axis is the stats (the columns.)

Data analysis

So now I have my data loaded into a panel, time to analyze. The pattern I’ve settled on is generating a new DataFrame that’s a report table. For instance I’m interested in averagestatistics per tier. So I start with a 3d Panel of tiers x champions x stats, and I boil that down to a 2d DataFrame of tiers x average(stats). When I do ad-hoc Python report generation I usually just iteratively print out report lines, or maybe make an array of dicts for my data. Those are still fine too, but I wanted to go full Pandas.

I started out by writing lots of laborious loops over panel.iteritems() and the like. See cell 7 and how I first build up a dict named ‘spreads’ and then converted it to a DataFrame. Over time I’ve refined to simply doing operations like dividing a DataFrame by a Series and the like. Cell 8 shows that style, the creation of pick_rates is about the same computation as spreads before, but it’s less code. The whole process feels a lot like vectorizing code, like I did for my machine learning class in Octave/MatLab. The code gets shorter, which is nice, but it should also be more efficient which matters for large datasets.

Presentation

My final goal was to make nice-looking reports right in Jupyter. That proves to be quite difficult. Pandas is amazing at inlining graphs and visualizations with matplotlib and friends. But it’s not quite as good at textual reports. I mean Jupyter will display any HTML you make in python,but I haven’t found any good HTML generation convenience libraries out there for, say, DataFrames.

Fortunately the default DataFrame presentation HTMLis pretty good. Labelled columns and rows, decent formatting of numbers. All that’s really missing is right alignment of numbers (and why is that not the default?!) But I want more control, for instance rounding floating point numbers to specific precision. For that I used the DataFrame.style object , which gives you a D3-esque ability to apply per-datum CSS styles to cells. I mostly used .format() to control text formatting of numbers and .set_table_styles() to add some CSS to the HTML table. The .applymap() function is the really powerful one that lets you control things per-datum.

I also threw in the background_gradient style, to give a sort of heatmap coloring to the number cells. That required installing a bunch of crap like python-tk and numpy just to get the colormaps. But it’s pretty painless in the end.

Conclusion

Pandas + Jupyter is a pretty good platform for ad hoc exploration of data, and with some learning and effort you can produce pretty clean code and reports. I love the way notebooks are reproducible and shareable. I think using SQL for this kind of work is probably more effective, but there’s a lot of overhead in getting set up with a database. And the interactive exploration isn’t as nice as a Jupyter notebook. (Hmmm, there’s no reason Jupyter cells couldn’t contain SQL code !)


Viewing all articles
Browse latest Browse all 9596

Trending Articles