Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

The data pipeline (Part 2)

$
0
0

This is Part 2 of a 3-part series on data science in name, theory and practice.

With all of today's emphasis on big data and machine learning,it's easy to overlook an important fact: data does not just appear out of nowhere - clean, structured and ready to be analyzed. In reality, data must first be gathered, interpreted, validated and transformed. In this post, I will outline these steps within framework I call "the data pipeline,"based on the flow of data from its extraction to its presentation.


The data pipeline (Part 2)

The intended audience for this post are beginners entering the field of data science in an analyst or engineer role.Terminology and relevant software are in bold .Before expanding on the steps, note that most of the software alluded to in this series will be related to python because that is the language I am most familiar with. Regardless, the takeaway here is the flow of data in theory; in Part 3 , I will demonstrate this flow in practice.

1. Data extraction

The first step to any analysis is getting the relevant data. Sometimes, in a B2B environment, you are provided the data by the client. More often, you have to get the data yourself. This can mean scraping it from a website, interfacing with an API or even streaming it from hardware. Once you are familiar with scraping data, a new realm of opportunities emerges: any data you see, you can scrape. (Disclaimer: you should never scrape resources prohibited by a website's Terms of Service).

Two generalized software solutions I've found are i mport.io , a free SaaS designed to scrape structured data (eg. listings on eBay),and Google Sheets' import_html () (eg. a table on ESPN). If these don't fit your use case, you will have to employ a programmatic solution.

The primary Python library for scraping your data is requests . When you visit a webpage, your browser sends an HTTP request to the web server with certain headers (eg. the page it wants, which browser it is), and sometimes cookies or a payload as well; the web server then replies with an HTTP response, most often of which is the webpage's Document Object Model (DOM) . The DOM is the code (HTML tags and all) that your browser uses to render the webpage you see. Requests mimics some of your browser's basic functionality: you create an HTTP request, send it off to the web server and receive the webpage's HTTP response. But because you used requests and not a browser, you get that response in code/text format (ie. not a rendered webpage), which is just the format you need it in to parse it.

Since you are scraping the data you first saw from your browser,it makes sense to copy the exact request your browser sent to the webpage. Some web servers won't let you access their pages unless it looks like the request was made from a browser. So head over to Chrome Developer Tools (F12 in Chrome) and click the Network tab to start "recording". Go to any website and you'll see all the requests your browser makes (eg. the page's images, stylesheets, the DOM itself). It can sometimes be a game of guess-and-check to find the right request in a sea of other requests (if the webpage uses AJAX, you can look just at the XHR menu), but once you find it, you can copy the exact information your browser sent and paste it into the parameters requests will use.

Requests sends the HTTP request and receives the response, but that's as far as it goes. So while you get the web page you were looking for (with the data within it), you get a bunch of HTML tags as well.This string of text represents unstructured data - there is no delineation between the information of interest and everything else, at least as far as your machine knows. To make sense of it, you need to structure it. When it comes to parsing the DOM, you should use BeautifulSoup . BeautifulSoup is an HTML parser that structures the DOM, allowing you to quickly isolate any set of elements and extract the data you need. For example, you can iterate over the rows in a table and store certain values in a list or dictionary.

Less commonly, your HTTP response won't be the DOM but instead a JSON response . You should be familiar with data serialization formats (mainly JSON and XML), though it's not really necessary to get too deep here. When it comes to .json, it behaves exactly like a Python dictionary so you just need to specify keys to get the values you're interested in. Finally, if you need to scrape webpages that use javascript, you may need to simulate a browser even more by using something like selenium .

2. Data storage and data modeling

Once you've scraped the data, you need to find a place to store it.If you primarily need to write a small amount of data and store it for later, you can stick to flat files , eg. comma- or tab-delimited text files. While convenient for write operations, flat files are unsuitable for read operations. Without iterating over each line in the file, how can you extract the 200th row? Flat files lack indexes, meaning you can't specify the index on which you'd like to operate (eg. read, update, delete).

For this reason, most large projects and enterprise-level applications are backed by databases . Databases come in many flavors, though the most common are relational (eg. Postgres, sqlite3, Oracle, SQL Server)and non-relational (eg. MongoDB, Redis, Cassandra). I won't attempt to completely explain their differences, but I like to think that relational databases are used when you have neatly structured data. You specify the relationships between tables (eg. foreign keys), the constraints within tables (eg. column values must be unique) and the type of data for each column. All of these must be known and specified in the database schema in advance. On the other hand, non-relational databases shine when it comes to unstructured data. You have data and you want to throw it into a database - you don't want the database worrying about their relationships or constraints or data types, but rather how to quickly get you the data you're looking for.

In my opinion, you should work with relational databases (ie. learning SQL)especially if you're starting out. Analyzing unstructured data from a non-relational database is much harder precisely because your data lacks structure.

3. Exploratory data analysis

Once your data is persistent (ie.saved to disk and not memory), you need to know what's in it. Exploratory data analysis (EDA) is the systematic process of learning about your data: its contents, quality, relationships and assumptions. When people refer to "mastering the data," thorough EDA is an implicit assumption. Here are some important things to check for:

Search for and utilize any associated documentation View all the columns (ie. even those which don't immediately fit on your sc

Viewing all articles
Browse latest Browse all 9596

Trending Articles