I have been using Apache Spark recently to process some data. I was using built-in PySpark interface that has capabilities to run as a Jupyter notebook, however I wanted to integrate it to be able to run it from any Ipython client and our JupyterHub instance that I configured recently on our server.
Happily I have found a simple way to get that working on PySpark 1.4.0.
There are few posts, here and here , describing a way to achieve that. However they did not work perfectly on Spark 1.4.0, at least for me in my configuration. Below I will show few options that I came up with. I assume that Spark is already installed somewhere.
One hacky IPython kernelFirst, maybe a bit hacky way that uses kernels, is as follows:
1. Create kernel directory ( pyspark is the kernel name):
mkdir -p ~/.ipython/kernels/pyspark2. Create kernel file:
touch ~/.ipython/kernels/pyspark/kernel.json3. Put the following inside this file:
{ "display_name": "pySpark (Spark 1.4.0)", "language": "python", "argv": [ "/usr/bin/python2", "-m", "IPython.kernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "<spark_dir>", "PYTHONPATH": "<spark_dir>/python/:<spark_dir>/python/lib/py4j-0.8.2.1-src.zip", "PYTHONSTARTUP": "<spark_dir>/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell" } }Replace <spark_dir> with the location of you Spark in my case it was /home/spark/local/spark-1.4.0 . You have to provide absolute paths that is why it is a bit hacky way.
In PYSPARK_SUBMIT_ARGS you can put arguments that you normally pass to pyspark here I am connecting to the cluster using the address and the port of my master node.
4. Check if you can use this kernel and if you get SparkContext object:
Run IPython console:
ipython console --kernel pysparkAfter it finishes loading, type sc and you should get something like:
In [1]: sc Out[1]: <pyspark.context.SparkContext at 0x7f2b480f0e90>It means it works!
If you work with Jupyter, you should see new type of notebook to create or in the “Kernel” menu while notebook is open..
Making it a bit nicerI do not like hardcoded paths like there. You could move all these variables from env section and export them in e.g. ~/.bashrc file as typical environment variables. However if you do so, every time you start Python or IPython PySpark context would load. What is more, when you run pyspark , you would get errors that Spark context has already been created.
There is other way to achieve the same:1. Create new IPython profile:
ipython profile create pyspark2. Create profile startup script:
touch ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py3. Put the following inside the file (as they did that here ):
import os import sys spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME environment variable is not set') sys.path.insert(0, os.path.join(spark_home, 'python')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))4. You can run this profile using following command:
ipython console --profile=pysparkhowever probably nothing will happen because SPARK_HOME is not defined.
If you define it:
export SPARK_HOME=<spark_dir>and re-run IPython you should see PySpark loading.
You can define PYSPARK_SUBMIT_ARGS as well in such way to pass additional arguments in my case, cluster location:
export PYSPARK_SUBMIT_ARGS="--master spark://127.0.0.1:7077 pyspark-shell"IPython console is working, but if you want to use it in Jupyter you need to do few more things create a kernel that uses this profile:
1. Follow steps from previous section.
2. Put following content in the kernel file:
{ "display_name": "pySpark (Spark 1.4.0)", "language": "python", "argv": [ "/usr/bin/python", "-m", "IPython.kernel", "--profile=pyspark", "-f", "{connection_file}" ] }3. You should be able to use it in IPython and Jupyter now.
Update:you can simplify this configuration a bit more if you install py4j :
pip install py4jthen you do not need to put py4j manually in the PYTHONPATH . Just be sure that versions match.
What if I am using JupyterHub?In my configuration I am using JupyterHub that spawns a new server if user logs in. This is run from the root account but when you log in using your account it reads your kernels and profiles that you have in your home directory you should not have any problems to run the first solution that was presented.
Second one will not work because, unless you put those exports of variables globally. Putting them in files like ~/.bashrc does not work but you can put them in the kernel file (you still need one to access it from Jupyter). Just add at the end:
{ ... "argv": [ ... ], "env": { "SPARK_HOME": "<spark_dir>", "PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell" } }and everything should be grand!
I use the last solution, it works both in IPython and Jupyter. Instead of having these kernels in .ipython directory of your home directory you can put them globally to enable other users to use them if using JupyterHub.
Hope you enjoyed!