Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

$
0
0
Why Even Try, Man?

Irecently came upon Brian Granger and Jake VanderPlas’s Altair, a promising young visualization library. Altair seems well-suited to addressing python’s ggplot envy, and its tie-in with javascript’s Vega-Lite grammar means that as the latter develops new functionality (e.g., tooltips and zooming), Altair benefits ― seemingly for free!

Indeed, I was so impressed by Altair that the original thesis of my post was going to be: “Yo, use Altair.”

But then I began ruminating on my own Pythonic visualization habits, and ― in a painful moment of self-reflection ― realized I’m all over the place: I use a hodgepodge of tools and disjointed techniques depending on the task at hand (usually whichever library I first used to accomplish that task 1 ).

This is no good. As the old saying goes: “The unexamined plot is not worth exporting to a PNG.”

Thus, I’m using my discovery of Altair as an opportunity to step back ― to investigate how Python’s visualization options hang together. I hope this investigation proves helpful for you as well.

How’s This Gonna Go?

The conceit of this post will be: “You need to do Thing X. How would you do Thing X in matplotlib? pandas? Seaborn? ggplot? Altair?” By doing many different Thing X’s, we’ll develop a reasonable list of pros, cons, and takeaways ― or at least a whole bunch of code that might be somehow useful.

(Warning: this all may happen in the form of a two-act play.)

The Options (in ~Descending Order of Subjective Complexity)

First, let’s welcome our friends 2 :

matplotlib

The 800-pound gorilla ― and like most 800-pound gorillas, this one should probably be avoided unless you genuinely need its power, e.g., to make a really custom plot or produce a publication-ready graphic

pandas

“Come for the DataFrames; stay for the plotting convenience functions that are arguably more pleasant than the matplotlib code they supplant.” ― rejected pandas taglines

(Bonus tidbit: the pandas team must include a few visualization nerds, as the library includes things like RadViz plots and Andrews Curves that I haven’t seen elsewhere.)

Seaborn

Seaborn has long been my go-to library for statistical visualization; it summarizes itself thusly:

“If matplotlib‘tries to make easy things easy and hard things possible,’ seaborn tries to make a well-defined set of hard things easy too”

yhat’s ggplot

A Python implemention of the grammar of graphics. This isn’t a “feature-for-feature port of ggplot2,” but there’s strong feature overlap. (And speaking as a part-time R user, the main geoms seem to be in place.)

Altair

The new guy, Altair is a “declarative statistical visualization library” with an exceedingly pleasant API.

Wonderful. Now that our guests have arrived and checked their coats, let’s settle in for our very awkward dinner conversation.Our show is entitled…

Little Shop of Python Visualization Libraries (starring all libraries as themselves) ACT I:LINES AND DOTS

(In Scene 1, we’ll be dealing with a tidy data set named “ts.” It consists of three columns: a “dt” column (for dates); a “value” column (for values); and a “kind” column, which has four unique levels: A, B, C, and D. Here’s a preview…)

dt kind value 0 2000-01-01 A 1.442521 1 2000-01-02 A 1.981290 2 2000-01-03 A 1.586494 3 2000-01-04 A 1.378969 4 2000-01-05 A -0.277937 Scene 1:How would you plot multiple time series on the same graph?

matplotlib: Ha! Haha! Beyond simple. While I could and would accomplish this task in any number of complex ways, I know your feeble brains would crumble under the weight of their ingenuity. Hence, I dumb it down, showing you two simple methods. In the first, I loop through your trumped-up matrix ― I believe you peons call it a “Data” “Frame” ― and subset it to the relevant time series. Next, I invoke my “plot” method and pass in the relevant columns from that subset.

# MATPLOTLIB fig, ax = plt.subplots(1, 1, figsize=(7.5, 5)) for k in ts.kind.unique(): tmp = ts[ts.kind == k] ax.plot(tmp.dt, tmp.value, label=k) ax.set(xlabel='Date', ylabel='Value', title='Random Timeseries') ax.legend(loc=2) fig.autofmt_xdate()
A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

MPL: Next, I enlist this chump (*motions to pandas*) , and have him pivot this “Data” “Frame” so that it looks like this…

# in matplotlib-land, the notion of a tidy # dataframe matters not dfp = ts.pivot(index='dt', columns='kind', values='value') dfp.head() kind A B C D dt 2000-01-01 1.442521 1.808741 0.437415 0.096980 2000-01-02 1.981290 2.277020 0.706127 -1.523108 2000-01-03 1.586494 3.474392 1.358063 -3.100735 2000-01-04 1.378969 2.906132 0.262223 -2.660599 2000-01-05 -0.277937 3.489553 0.796743 -3.417402

MPL: By transforming the data into an index with four columns ― one for each line I want to plot ― I can do thewhole thing in one fell swoop (i.e., a single call of my “plot” function).

# MATPLOTLIB fig, ax = plt.subplots(1, 1, figsize=(7.5, 5)) ax.plot(dfp) ax.set(xlabel='Date', ylabel='Value', title='Random Timeseries') ax.legend(dfp.columns, loc=2) fig.autofmt_xdate()
A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...
pandas (*looking timid*): That was great, Mat. Really great. Thanks for including me. I do the same thing ― hopefully as good?

(*smiles weakly*)

# PANDAS fig, ax = plt.subplots(1, 1, figsize=(7.5, 5)) dfp.plot(ax=ax) ax.set(xlabel='Date', ylabel='Val

Viewing all articles
Browse latest Browse all 9596

Trending Articles