Notes from a Jupyter/Dataframe project

I’m looking at some League of Legends data; champion win rates by player tier . I’m curious which championshave higher win rateswith skilled players, and/or which are more popular. You can see the resulting report here , notebook (without nice CSS) is here .

Learning me some Jupyter and Pandas

Really I’m using this projectto get more comfortable with using Jupyter and Pandas DataFrames to analyze data. The DataFrames part is overkill; I’ve got about 10k of data, not the 10G of numbers DataFrames’ highly optimized code is for. But DataFrames have lots of functions that replicate SQL’s analysis capabilities and they display nicely in Jupyter, so here I am.

My data starts as web pages of HTML tables . I imported thoseinto a single Google Spreadsheet with multiple tabs.For a single sheet, rows are champions (120 or so) and columns are statistics like win rate or games played (5 of them). There’s 7 sheets, one per tier Bronze / Silver / Gold / … I then exported those sheets as CSV files, one per tier, for analysis. (Google Spreadsheets turns out to be a remarkably good HTML table data scraper.)

I then played with various forms of python code for reading those CSV files and doing some statistics on them. Starting with simple raw Python data structures, a bunch of dicts and tuples all the way up to a full Panda Panel. Along the way I learned how to analyze the data using Pandas a bit, and to format the results nicely in Jupyter.My real goal here is to make interactive reporting as natural as if I’d loaded all the data into a SQL database. Being able to do natural things like sum all values, or get the average of numbers grouped by some criterion.

Data loading

One thing I found is that the more I whipped my data into Pandas shape, the simpler my code got. I finally settled on doing all the parsing way up front, when I read the CSV files, to makeclean DataFrameobjects with well named columns and rows. That makes the lateranalysis code much simpler.

The read_csv() function has a lot of options to configure parsing, it’s worth the time to set them. “thousands=,” for instance will allow it to parse things like “1,523,239” as an integer and not a string. I also ended up doing “header=0” and specifying my own “names=[…]” for columns, just so I could make the names valid Python identifiers. (You can access named things in a DataFrame with attribute syntax; either df[“winrate”] or df.winrate.)

But the most important thing for read_csv was index_col. The concept of index in a DataFrame is particularly important. That’s the unique key that names each row; in my data the index is the champion name. In a CSV file this would probably be the column named “label” or the like. In a database table it’d be the primary key. The word index is used all over the docs and it confuses me all the time. Once I set up my data index correctly the code got a lot better.

The thing I find most confusing is the shape of data. DataFrames, for instance, mostly seem to want to treat a column of numbers as a single feature. But I really need to work row-rise most of the time. Slicing a DataFrame is particularly bizarre. df[n] gives you the nth column from the dataframe df. But df[n:n+2] gives you rows n to n+2. I find I often have to transpose my dataframe to make operations convenient, or use the “axis” parameter to functions like sum() to indicate whether I want to sum over columns or over rows.

The final fillip was using a Pandas Panel to store my data. A Panel is a collection of DataFrames; in my case one per tier. This 3 dimensional data structure is a bit confusing, the docs apologize for the names and it’s not clear that Panel adds a lot of value over a dict of DataFrame objects. But it’s Pandas’ way of representing a collection of spreadsheets, so I am going with it. My panels “items” are the tier names, the major_axis is championrows (named by the index) and the minor_axis is the stats (the columns.)

Data analysis

So now I have my data loaded into a panel, time to analyze. The pattern I’ve settled on is generating a new DataFrame that’s a report table. For instance I’m interested in averagestatistics per tier. So I start with a 3d Panel of tiers x champions x stats, and I boil that down to a 2d DataFrame of tiers x average(stats). When I do ad-hoc Python report generation I usually just iteratively print out report lines, or maybe make an array of dicts for my data. Those are still fine too, but I wanted to go full Pandas.

I started out by writing lots of laborious loops over panel.iteritems() and the like. See cell 7 and how I first build up a dict named ‘spreads’ and then converted it to a DataFrame. Over time I’ve refined to simply doing operations like dividing a DataFrame by a Series and the like. Cell 8 shows that style, the creation of pick_rates is about the same computation as spreads before, but it’s less code. The whole process feels a lot like vectorizing code, like I did for my machine learning class in Octave/MatLab. The code gets shorter, which is nice, but it should also be more efficient which matters for large datasets.

Presentation

My final goal was to make nice-looking reports right in Jupyter. That proves to be quite difficult. Pandas is amazing at inlining graphs and visualizations with matplotlib and friends. But it’s not quite as good at textual reports. I mean Jupyter will display any HTML you make in python,but I haven’t found any good HTML generation convenience libraries out there for, say, DataFrames.

Fortunately the default DataFrame presentation HTMLis pretty good. Labelled columns and rows, decent formatting of numbers. All that’s really missing is right alignment of numbers (and why is that not the default?!) But I want more control, for instance rounding floating point numbers to specific precision. For that I used the DataFrame.style object , which gives you a D3-esque ability to apply per-datum CSS styles to cells. I mostly used .format() to control text formatting of numbers and .set_table_styles() to add some CSS to the HTML table. The .applymap() function is the really powerful one that lets you control things per-datum.

I also threw in the background_gradient style, to give a sort of heatmap coloring to the number cells. That required installing a bunch of crap like python-tk and numpy just to get the colormaps. But it’s pretty painless in the end.

Conclusion

Pandas + Jupyter is a pretty good platform for ad hoc exploration of data, and with some learning and effort you can produce pretty clean code and reports. I love the way notebooks are reproducible and shareable. I think using SQL for this kind of work is probably more effective, but there’s a lot of overhead in getting set up with a database. And the interactive exploration isn’t as nice as a Jupyter notebook. (Hmmm, there’s no reason Jupyter cells couldn’t contain SQL code !)

Notes from a Jupyter/Dataframe project

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本