Jupyter notebooks provide a useful environment for interactive exploration of data. A common question I get, though, is how you can progress from this nonlinear, interactive, trial-and-error style of exploration to a more linear and reproducible analysis based on organized, packaged, and tested code. This series of videos presents a case study in how I personally approach reproducible data analysis within the Jupyter notebook.
Each video is approximately 5-8 minutes; the videos are available in a YouTube Playlist . Alternatively, below you can find the videos with some description and links to relevant resources
In[1]:# Quick utility to embed the videos below from Ipython.display import YouTubeVideo def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'): return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)
Part 1: Loading and Visualizing DataIn this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.
In[2]:embed_video(1)
Out[2]:Relevant resources:
Fremont Bridge Bike Counter : the website where you can explore the data
A Whirlwind Tour of Python : my book introducing the Python programming language, aimed at scientists and engineers.
Python Data Science Handbook : my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here.
Part 2: Further Data ExplorationIn this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.
In[3]:embed_video(2)
Out[3]:Relevant Resources:
Pivot Tables Section from the Python Data Science Handbook Part 3: Version Control with Git & GitHubIn this video, I set up a repository on GitHub and commit the notebook into version control.
In[4]:embed_video(3)
Out[4]:Relevant Resources:
Version Control With Git : excellent novice-level tutorial from Software Carpentry Github Guides : set of tutorials on using GitHub The Whys and Hows of Licensing Scientific Code : my 2014 blog post on AstroBetter Part 4: Working with Data and GitHubIn this video, I refactor the data download script so that it only downloads the data when needed
In[5]:embed_video(4)
Out[5]: Part 5: Creating a Python PackageIn this video, I move the data download utility into its own separate package
In[6]:embed_video(5)
Out[6]:Relevant Resources:
How To Package Your Python Code : broad tutorial on Python packaging. Part 6: Unit Testing with PyTestIn this video, I add unit tests for the data download utility
In[7]:embed_video(6)
Out[7]:Relevant resources:
Pytest Documentation Getting Started with Pytest : a nice tutorial by Jacob Kaplan-Moss Part 7: Refactoring for SpeedIn this video, I refactor the data download function to be a bit faster
In[8]:embed_video(7)
Out[8]:Relevant Resources:
Python strftime reference Pandas Datetime Section from the Python Data Science Handbook Part 8: Debugging a Broken FunctionIn this video, I discover that my refactoring has caused a bug. I debug it and fix it.
In[9]:embed_video(8)
Out[9]: Part 8.5: Finding and Fixing a scikit-learn bugIn this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug
In[10]:embed_video(9)
Out[10]: Part 9: Further Data Exploration: PCA and GMMIn this video, I apply unsupervised learning techniques to the data to explore what we can learn from it
In[11]:embed_video(10)
Out[11]:Relevant Resources:
Principal Component Analysis In-Depth from the Python Data Science Handbook Gaussian Mixture Models In-Depth from the Python Data Science Han