Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Python for Analytics - Exploring Data with Pandas

$
0
0
A Crack Team!

At Rittman Mead, we're always encouraged to branch out and pursue new skills in the field in an effort to improve upon our skill sets, and as a result, become more technically fluent. One of the great things about working here, aside from the previous, is that while we all have a really solid foundation in Oracle technologies, there are many who possess an incredibly diverse range of skills, fostered by years of working in tech-agnostic engagements. It's a great feeling to know that if you ever run up against some sort of bizarre challenge, or have to integrate some piece of arcane tech into an architectural solution, more than likely, someone, somewhere within Rittman Mead has had to do it. It is this freedom to pursue, within reason of course, other technical exploits that has nurtured a real spirit of innovation around the company within the past year. Our GitHub is overflowing with open source visualizations and performance monitoring and maintenance scripts that are simply there for the taking! We've put a lot of time into developing this stuff so our clients and partners don't have to.

python

But I digress. This blog is about Python , and well, I haven't really mentioned it up until this point. It is in this spirit of innovation, learning, and frankly, automating the boring stuff, however, that a lot of us have been pursuing automation and analytical endeavors using the Python language. It has been touted by many as THE language for data science, and rightfully so, given its accessibility and healthy selection of libraries perfectly suited to the task, such as NumPy , Seaborn , Pandas, Matplotlib . In today's exercise, we're going to walk through common data munging, transformation, and visualization tasks using some of these libraries in order to deliver quick insights into a data set that's near and dear to my heart, Game of Thrones battles and character deaths!


Python for Analytics - Exploring Data with Pandas

Through this process, we will be creating our own data narrative that will help to expound upon the idle numbers and labels of the data set. We'll see that the process is less a hard and fast, rigid, set of rules for which to approach data exploration, and something more akin to solving a crime, clue by clue, letting the data tell the story.

PYTHON FOR DATA SCIENCE

Aside from its myriad, community driven and maintained libraries, the greatest thing, to me anyway, about Python is its relatively low barrier to entry. Even with little to no previous programming skills, an enterprising lad or lady can get up and running, performing basic, functional programming tasks in no time. You'll be amazed at how quickly you'll start coming up with daily tasks that you can throw a bit (or a lot) of Python at. Today, we'll be tackling some tasks like these, common to the everyday processes of data analysis and data science. Utilizing the Pandas library, in addition to a few others, we'll see how we can programmatically go from question to answer in no time, and with most any structured or unstructured data set. The primary audience of this blog will be those with a bit of Python fluency, in addition to those with an interest in data science and analytics. I will be explaining the steps and providing a Jupyter notebook (link here ) for those who wish to follow along, however, for those who might need the extra guidance. So don't bail now! Let's get to it. In this instance, we'll be downloading the Game of Thrones data set fromkaggle, a great site that provides open data sets and accompanying analysis projects in multiple languages. Download the data set if you'd like to follow along.

GETTING STARTED
Python for Analytics - Exploring Data with Pandas
Let's begin by taking some steps to get our heads on straight and carve out a clear work flow. I find this is really helpful, especially in programming and/or analytical scenarios where one can begin to suffer from "analysis paralysis". So, at a high level, we'll be doing the following:

First, we'll take a cursory look at the Python libraries we'll be incorporating into our data sleuthing exercise, how they're used, and some examples of their output and ideal use cases.

Next we'll use the tools in these libraries to take a deeper dive into our data set and start to construct our initial line of questioning. This is where we get to be a bit creative in coming up with how we're going to wrap our heads around the data, and what kind of questions we're going to throw at it.

We'll then chase down any leads, incorporating additional analyses where necessary, and begin to construct a narrative about our data set. At this point we'll be formulating hypotheses and attempting to construct visualizations that will help us to further or disprove our investigation.

PANDAS IN THE JUNGLE

Any great detective must always have with them a toolkit with which to thoroughly examine any crime scene, and that's essentially what we have in the Pandas, Seaborn, and Numpy ("num-pie") libraries for the Python programming language. They provide a set of methods (functions) that can take an input, or a number of inputs, do some magic, and then provide us with lots of really useful information. So let's begin by examining these libraries and what we can do with each.

Pandas and Numpy
Python for Analytics - Exploring Data with Pandas

Pandas is great at doing a bunch of really common tasks related to data exploration, not limited to, indexing and selection, merging and joining data sets, grouping and aggregations, and visualizing data. This will be the library with which we'll be doing a lot of the heavy lifting. Pandas also provides us with the Dataframe object that greatly expands on the comparatively more rigid Numpy's ndarray object. These 'objects' are simply containers that hold data of some kind, and allow us to interact on that data.

Matplotlib and Seaborn
Python for Analytics - Exploring Data with Pandas

Matplotlib is a robust visualization library built to enable interactive, MATLAB style plotting on most any platform or back-end. This library, along with Seaborn, should be your go-to for producing super malleable graphs and visualizations. Working alongside matplotlib, seaborn pitches itself as a go-to for statistical based visualizations, but also supports complex, grid and algorithm based charts as well. Both of these libraries will help us to make quick and insightful decisions about our data, and help us to gather evidence further supporting, or disproving and hypotheses we might form.

THE INVESTIGATION
Python for Analytics - Exploring Data with Pandas

Now that we've armed ourselves with the tools we need to thoroughly examine any potential data set we're given, let's do just that! Opening up the Game of Thrones csv files from above, let's first take a look at what kinds of information we have. The basic stats are as follows:

Synopsis

Battles - a complete listing of the battles in the book series and their stats! Attacker, defender, army size, you name it, we've got it.

Character Deaths - something the series/show is quite known for, who died? This contains some great info, such as allegiance and nobility status.

Character Predictions - The more morbid of the lot, this data set lists predictions on which character will die. We won't be using this sheet for our exercise.

A Hypothesis of Thrones

Having just finished the monumental series myself, you could say that at this point I'm somewhat of a subject matter expert; that at this point, we have a situation not unlike that which you might find in any organization. We've got an interested party that wants to look further into some aspect of their data. We can use our investigatory tool-set to get real results and gain some valuable and informative insights. As subject matter experts though, we should ideally be coming at our data with at least some semblance of a hypothesis, or something that we're trying to prove using our data (or disprove for that matter). For the sake of this exercise, and fitting in with the theme of the data, I'm going to try and dig up an answer to the following:

Does House Lannister, for as evil and scheming as they are, and as much as they get away with, eventually get what's coming to them?


Python for Analytics - Exploring Data with Pandas

As much as I'd like to believe it's true, however, we're going to need to run the numbers, and let our data do the talking.

Importing the Data

You can follow along in the Jupyter notebook here now. Working with our Pandas library, we first need to get our data into some sort of workable object. As we stated before, this is the data frame. It is simply a table type object that is really good at handling empty values and data of many different types. We can easily perform operations on these objects and visualize them with minimal fuss. So, enough talk. Let's do it!

Working in your favorite IDE ( Pycharm is easy to use and comes in a free version), we start a new project, import the libraries we need, and then drop in our first piece of code. This is the section that imports our csv data set and then converts it to data frame. So, now that we have our object, what do we do with it?


Python for Analytics - Exploring Data with Pandas
A Graph Has No Name

Now that we have our data frame object, we can begin to throw some code at it, crunch some numbers, and see if, in fact, the Lannisters really did get what was coming to them by the end of book 5. Starting with the battles data set, let's see how they fared in the field through the arc of the story. Did they lose more or less troops comparatively? We can do this easily by breaking our data frame into smaller, more manageable chunks, and then graph these data points, accordingly. We are going to use the data set to build a step by step, set of analyses that examines the Lannister victories and defeats throughout the story.

Battle / Troop Loss Over Time

Did the Lannisters hit a losing streak, or did they do well throughout the story? Did they win or lose more of their battles over time?

Start with new data frame based on house and troop sizes:


Python for Analytics - Exploring Data with Pandas

Filter to get new results (Lannisters only):


Python for Analytics - Exploring Data with Pandas

Right away we see we have some data issues, that is, there are some holes in the attacker size column. The good thing is that we can more or less look at this small table and get the all the info we need from it right away. The numbers drop down significantly through the years, and that's all there really is to it. But, was this in fact, because they lost more troops, or simply threw less at the problem as they began to carve out their claim to the kingdom? This analysis is not very telling. We're going to need to do some digging elsewhere to answer our question. Let's do some comparisons.

% of Battles Won / Lost

So how did the Lannisters do in the field? Of the 8 battles they fought in, how many did they win? How does this compare with the other armies of Seven Kingdoms?

As we did before, lets get a new data frame together, and then do our grunt work on it to try and answer these questions. Grabbing the columns we need, let's run the numbers on how the Lannisters stack up against the other houses of Westeros in the field.

How many battles did they fight compared to the other Houses?


Python for Analytics - Exploring Data with Pandas

How many did they actually win?


Python for Analytics - Exploring Data with Pandas

We can see right away, that out of all the battles they fought throughout the series (which is decidedly more than the other houses in the series), that they came out on top. Could the Lannisters be the dominating force on the field, as well as at court? The Starks are the only house that meet them conflict for conflict, and the Lannisters still reign supreme! Let's take things down to a finer grain and see how those who aligned themselves with the Lion did compared to those who didn't.

Death by Allegiance

Opening up our character deaths file, right away we see we have some pretty good info here. We have a laundry list of characters, their death year, and the house, if any, to which they were aligned. Let's start by building a data frame, and first, filtering out those who are unaligned, in the Night's Watch, or a Wildling. We want to get a comparison between houses, and these groups will just muck up the works. Let's do the numbers. We can now plot this info on a basic bar chart to get a basic rundown of the massacre.


Python for Analytics - Exploring Data with Pandas

Things are starting to look up...depending on your point of view, I guess. The Lannisters, for all their dirty business, do seem to, in fact, lose the most named characters tied to their house. Of these, let's see how many were actually nobility, or rather, the most influential in furthering their cause!


Python for Analytics - Exploring Data with Pandas

It would seem our Lannisters aren't too good at keeping their hands clean, and letting those of lesser station do their dirty work for them. Although they have the second most aligned character deaths in the series, roughly 75% of them are Noble deaths, meaning that people important to their cause are dying! The only other houses that come close unfortunately, are the Starks (the Red Wedding, no doubt), and the Greyjoys. What this also means, however, is that our claim is gathering more support; the Lannisters may have climbed the royal ladder, but at what cost?

Paying Your Debts
Python for Analytics - Exploring Data with Pandas

We can see from the donut chart above (excuse the repetition of colors) that indeed, the Lannisters have one of the highest % to total death numbers out of all the major houses in the Seven Kingdoms. This actually goes quite a long way in backing up our hypothesis; that of all the named characters in the series, the Lannisters lost the lion's share (pun intended). The disconcerting thing is that they either seem to bring down many others with them, or the other noble houses aren't terribly great at keeping themselves among the living either.

Conclusion

Are these figures, combined with their high noble of ranking noble deaths enough to satisfy my desire for vengeance? Did they truly reap what they have sown? I have to say I am ultimately undecided in the matter, as, although they did lose a great many, they in turn took a a greater number down along with them. It seems that despite these losses, any notion of vengeful satisfaction must be tempered by this fact; that although the Lannisters did end up getting hit pretty hard with significant losses, this is bittersweet when compared to the real and lasting damage they did throughout the span of the book's and show's history. Were you able to come up with any additional evidence for or against my case? Link out and show us! Thanks for reading.


Viewing all articles
Browse latest Browse all 9596

Trending Articles