This week Pronto CycleShare , Seattle's Bicycle Share system, turned one year old. To celebrate this, Pronto made available a large cache of data from the first year of operation and announced the Pronto Cycle Share's Data Challenge , which offers prizes for different categories of analysis.
There are a lot of tools out there that you could use to analyze data like this, but my tool of choice is (obviously) python. In this post, I want to show how you can get started analyzing this data and joining it with other available data sources using the PyData stack, namely NumPy , Pandas , Matplotlib , and Seaborn . Here I'll take a look at some of the basic questions you can answer with this data. Later I hope to find the time to dig deeper and ask some more interesting and creative questions stay tuned!
For those who aren't familiar, this post is composed in the form of a Jupyter Notebook , which is an open document format that combines text, code, data, and graphics and is viewable through the web browser if you have not used it before I encourage you to try it out! You can download the notebook containing this posthere, open it with Jupyter, and start asking your own questions of the data.
Downloading Pronto's DataWe'll start by downloading the data (available on Pronto's Website ) which you can do by uncommenting the following shell commands (the exclamation mark here is a special IPython syntax to run a shell command). The total download is about 70MB, and the unzipped files are around 900MB.
In[1]:# !curl -O https://s3.amazonaws.com/pronto-data/open_data_year_one.zip # !unzip open_data_year_one.zip
Next we need some standard Python package imports:
In[2]:%matplotlib inline import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns; sns.set()
And now we load the trip data with Pandas:
In[3]: trips = pd.read_csv('2015_trip_data.csv', parse_dates=['starttime', 'stoptime'], infer_datetime_format=True) trips.head() Out[3]: trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear 0 431 2014-10-13 10:31:00 2014-10-13 10:48:00 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Male 1960 1 432 2014-10-13 10:32:00 2014-10-13 10:48:00 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Male 1970 2 433 2014-10-13 10:33:00 2014-10-13 10:48:00 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Female 1988 3 434 2014-10-13 10:34:00 2014-10-13 10:48:00 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Female 1977 4 435 2014-10-13 10:34:00 2014-10-13 10:49:00 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Annual Member Male 1971Each row of this trip dataset is a single ride by a single person, and the data contains over 140,000 rows!
Exploring Trips over TimeLet's start by looking at the trend in number of daily trips over the course of the year
In[4]: # Find the start date ind = pd.DatetimeIndex(trips.starttime) trips['date'] = ind.date.astype('datetime64') trips['hour'] = ind.hour In[5]:# Count trips by date by_date = trips.pivot_table('trip_id', aggfunc='count', index='date', columns='usertype', )
In[6]: fig, ax = plt.subplots(2, figsize=(16, 8)) fig.subplots_adjust(hspace=0.4) by_date.il