Configuring IPython Notebook Support for PySpark

Apache Spark is a great way for performing large-scale data processing. Lately, I have begun working with PySpark , a way of interfacing with Spark through python. After a discussion with a coworker, we were curious whether PySpark could run from within an IPython Notebook . It turns out that this is fairly straightforward by setting up an IPython profile.

Here’s the tl;dr summary:

Install Spark Create PySpark profile for IPython Some config Simple word count example

The steps below were successfully executed using Mac OS X 10.10.2 and Homebrew . The majority of the steps should be similar for non-windows environments. For demonstration purposes, Spark will run in local mode, but the configuration can be updated to submit code to a cluster.

Many thanks to my coworker Steve Wampler who did much of the work.

Installing Spark Download the source for the latest Spark release Unzip source to ~/spark-1.2.0/ (or wherever you wish to install Spark) From the CLI, type: cd ~/spark-1.2.0/ Install the Scala build tool: brew install sbt Build Spark: sbt assembly (Takes a while) Create PySpark Profile for IPython

After Spark is installed, let’s start by creating a new IPython profile for PySpark.

ipython profile create pyspark

To avoid port conflicts with other IPython profiles, I updated the default port to 42424 within ~/.ipython/profile_pyspark/ipython_notebook_config.py :

c = get_config() # Simply find this line and change the port value c.NotebookApp.port = 42424

Set the following environment variables in .bashrc or .bash_profile :

# set this to whereever you installed spark export SPARK_HOME="$HOME/spark-1.2.0" # Where you specify options you would normally add after bin/pyspark export PYSPARK_SUBMIT_ARGS="--master local[2]"

Create a file named ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py containing the following:

# Configure the necessary Spark environment import os import sys spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, spark_home + "/python") # Add the py4j to the path. # You may need to change the version number to match your install sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) # Initialize PySpark to predefine the SparkContext variable 'sc' execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Now we are ready to launch a notebook using the PySpark profile

ipython notebook --profile=pyspark Word Count Example

Make sure the ipython pyspark profile created a SparkContext by typing sc within the notebook. You should see output similar to <pyspark.context.SparkContext at 0x1097e8e90> .

Next, load a text file into a Spark RDD. For example, load the Spark README file:

import os spark_home = os.environ.get('SPARK_HOME', None) text_file = sc.textFile(spark_home + "/README.md")

The word count script below is quite simple. It takes the following steps:

Split each line from the file into words Map each word to a tuple containing the word and an initial count of 1 Sum up the count for each word word_counts = text_file \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)

At this point, the word count has not been executed (lazy evaluation). To actually count the words, execute the pipeline:

word_counts.collect()

Here’s a portion of the output:

[(u'all', 1), (u'when', 1), (u'"local"', 1), (u'including', 3), (u'computation', 1), (u'Spark](#building-spark).', 1), (u'using:', 1), (u'guidance', 3), ... (u'spark://', 1), (u'programs', 2), (u'documentation', 3), (u'It', 2), (u'graphs', 1), (u'./dev/run-tests', 1), (u'first', 1), (u'latest', 1)]

Configuring IPython Notebook Support for PySpark

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本