Reproducible Data Analysis in Jupyter

Jupyter notebooks provide a useful environment for interactive exploration of data. A common question I get, though, is how you can progress from this nonlinear, interactive, trial-and-error style of exploration to a more linear and reproducible analysis based on organized, packaged, and tested code. This series of videos presents a case study in how I personally approach reproducible data analysis within the Jupyter notebook.

Each video is approximately 5-8 minutes; the videos are available in a YouTube Playlist . Alternatively, below you can find the videos with some description and links to relevant resources

In[1]:

# Quick utility to embed the videos below from Ipython.display import YouTubeVideo def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'): return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)

Part 1: Loading and Visualizing Data

In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.

In[2]:

embed_video(1)

Out[2]:

Relevant resources:

Fremont Bridge Bike Counter : the website where you can explore the data

A Whirlwind Tour of Python : my book introducing the Python programming language, aimed at scientists and engineers.

Python Data Science Handbook : my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here.

Part 2: Further Data Exploration

In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.

In[3]:

embed_video(2)

Out[3]:

Relevant Resources:

Pivot Tables Section from the Python Data Science Handbook Part 3: Version Control with Git & GitHub

In this video, I set up a repository on GitHub and commit the notebook into version control.

In[4]:

embed_video(3)

Out[4]:

Relevant Resources:

Version Control With Git : excellent novice-level tutorial from Software Carpentry Github Guides : set of tutorials on using GitHub The Whys and Hows of Licensing Scientific Code : my 2014 blog post on AstroBetter Part 4: Working with Data and GitHub

In this video, I refactor the data download script so that it only downloads the data when needed

In[5]:

embed_video(4)

Out[5]: Part 5: Creating a Python Package

In this video, I move the data download utility into its own separate package

In[6]:

embed_video(5)

Out[6]:

Relevant Resources:

How To Package Your Python Code : broad tutorial on Python packaging. Part 6: Unit Testing with PyTest

In this video, I add unit tests for the data download utility

In[7]:

embed_video(6)

Out[7]:

Relevant resources:

Pytest Documentation Getting Started with Pytest : a nice tutorial by Jacob Kaplan-Moss Part 7: Refactoring for Speed

In this video, I refactor the data download function to be a bit faster

In[8]:

embed_video(7)

Out[8]:

Relevant Resources:

Python strftime reference Pandas Datetime Section from the Python Data Science Handbook Part 8: Debugging a Broken Function

In this video, I discover that my refactoring has caused a bug. I debug it and fix it.

In[9]:

embed_video(8)

Out[9]: Part 8.5: Finding and Fixing a scikit-learn bug

In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug

In[10]:

embed_video(9)

Out[10]: Part 9: Further Data Exploration: PCA and GMM

In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it

In[11]:

embed_video(10)

Out[11]:

Relevant Resources:

Principal Component Analysis In-Depth from the Python Data Science Handbook Gaussian Mixture Models In-Depth from the Python Data Science Han

Reproducible Data Analysis in Jupyter

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎