Configuring Jupyter/IPython notebook to work with PySpark 1.4.0

I have been using Apache Spark recently to process some data. I was using built-in PySpark interface that has capabilities to run as a Jupyter notebook, however I wanted to integrate it to be able to run it from any Ipython client and our JupyterHub instance that I configured recently on our server.

Happily I have found a simple way to get that working on PySpark 1.4.0.

There are few posts, here and here , describing a way to achieve that. However they did not work perfectly on Spark 1.4.0, at least for me in my configuration. Below I will show few options that I came up with. I assume that Spark is already installed somewhere.

One hacky IPython kernel

First, maybe a bit hacky way that uses kernels, is as follows:

1. Create kernel directory ( pyspark is the kernel name):

mkdir -p ~/.ipython/kernels/pyspark

2. Create kernel file:

touch ~/.ipython/kernels/pyspark/kernel.json

3. Put the following inside this file:

{ "display_name": "pySpark (Spark 1.4.0)", "language": "python", "argv": [ "/usr/bin/python2", "-m", "IPython.kernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "<spark_dir>", "PYTHONPATH": "<spark_dir>/python/:<spark_dir>/python/lib/py4j-0.8.2.1-src.zip", "PYTHONSTARTUP": "<spark_dir>/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell" } }

Replace <spark_dir> with the location of you Spark in my case it was /home/spark/local/spark-1.4.0 . You have to provide absolute paths that is why it is a bit hacky way.

In PYSPARK_SUBMIT_ARGS you can put arguments that you normally pass to pyspark here I am connecting to the cluster using the address and the port of my master node.

4. Check if you can use this kernel and if you get SparkContext object:

Run IPython console:

ipython console --kernel pyspark

After it finishes loading, type sc and you should get something like:

In [1]: sc Out[1]: <pyspark.context.SparkContext at 0x7f2b480f0e90>

It means it works!

If you work with Jupyter, you should see new type of notebook to create or in the “Kernel” menu while notebook is open..

Making it a bit nicer

I do not like hardcoded paths like there. You could move all these variables from env section and export them in e.g. ~/.bashrc file as typical environment variables. However if you do so, every time you start Python or IPython PySpark context would load. What is more, when you run pyspark , you would get errors that Spark context has already been created.

There is other way to achieve the same:1. Create new IPython profile:

ipython profile create pyspark

2. Create profile startup script:

touch ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py

3. Put the following inside the file (as they did that here ):

import os import sys spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME environment variable is not set') sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

4. You can run this profile using following command:

ipython console --profile=pyspark

however probably nothing will happen because SPARK_HOME is not defined.

If you define it:

export SPARK_HOME=<spark_dir>

and re-run IPython you should see PySpark loading.

You can define PYSPARK_SUBMIT_ARGS as well in such way to pass additional arguments in my case, cluster location:

export PYSPARK_SUBMIT_ARGS="--master spark://127.0.0.1:7077 pyspark-shell"

IPython console is working, but if you want to use it in Jupyter you need to do few more things create a kernel that uses this profile:

1. Follow steps from previous section.

2. Put following content in the kernel file:

{ "display_name": "pySpark (Spark 1.4.0)", "language": "python", "argv": [ "/usr/bin/python", "-m", "IPython.kernel", "--profile=pyspark", "-f", "{connection_file}" ] }

3. You should be able to use it in IPython and Jupyter now.

Update:you can simplify this configuration a bit more if you install py4j :

pip install py4j

then you do not need to put py4j manually in the PYTHONPATH . Just be sure that versions match.

What if I am using JupyterHub?

In my configuration I am using JupyterHub that spawns a new server if user logs in. This is run from the root account but when you log in using your account it reads your kernels and profiles that you have in your home directory you should not have any problems to run the first solution that was presented.

Second one will not work because, unless you put those exports of variables globally. Putting them in files like ~/.bashrc does not work but you can put them in the kernel file (you still need one to access it from Jupyter). Just add at the end:

{ ... "argv": [ ... ], "env": { "SPARK_HOME": "<spark_dir>", "PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell" } }

and everything should be grand!

I use the last solution, it works both in IPython and Jupyter. Instead of having these kernels in .ipython directory of your home directory you can put them globally to enable other users to use them if using JupyterHub.

Hope you enjoyed!

Configuring Jupyter/IPython notebook to work with PySpark 1.4.0

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎