PySpark is a Spark API that allows you to interact with Spark through the python shell. If you have a Python programming background, this is an excellent way to get introduced to Spark data types and parallel programming. PySpark is a particularly flexible tool for exploratory big data analysis because it integrates with the rest of the Python data analysis ecosystem, including pandas (DataFrames), NumPy (arrays), and Matplotlib (visualization). In this blog post, you’ll get some hands-on experience using PySpark and the MapR Sandbox.
Example: Using Clustering on Cyber Network Data to Identify Anomalous BehaviorUnsupervised learning is an area of data analysis that is exploratory. These methods are used to learn about the structure and behavior of the data. Keep in mind that these methods are not used to predict or classify, but rather to interpret and understand.
Clustering is a popular unsupervised learning method where the algorithm attempts to identify natural groups within the data. K-means is the most widely used clustering algorithm where “k” is the number of groups that the data falls into. In k-means, k is assigned by the analyst, and choosing the value of k is where the interpretation of the data comes into play.
In this example, we will be using a dataset from an annual data mining competition, The KDD Cup ( http://www.sigkdd.org/kddcup/index.php ). One year (1999), the topic was network intrusion and the data set is still available ( http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html ). The data set will be the kddcup.data.gz file and consists of 42 features and approximately 4.9 million rows.
Using clustering on cyber network data to identify anomalous behavior is a common utilization of unsupervised learning. The sheer amount of data collected makes it impossible to go through each log or event to properly determine if that network event was normal or anomalous. Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) are often the only applications networks have to filter this data, and the filter is often assigned based on anomalous signatures that can take time to be updated. Before the update occurs, it is valuable to have analysis techniques to check your network data for recent anomalous activity.
K-means is also used in analysis of social media data, financial transactions, and demographics. For example, you can use clustering analysis to identify groups of Twitter users who Tweet from specific geographic regions using their latitude, longitude, and sentiment scores.
Code for computing k-means in Spark using Scala can be found in many books and blogs. Implementing this code in PySpark uses a slightly different syntax, but many elements are the same, so it will look familiar. The MapR Sandbox offers an excellent environment where Spark is already pre-installed and allows you to get right to the analysis and not worry about software installation.
Install the SandboxThe instructions in this example will be using the Sandbox in Virtual Box, but either VMware or Virtual Box can be used. For directions on installing the Sandbox in Virtual Box, click on this link.
http://maprdocs.mapr.com/51/#SandboxHadoop/t_install_sandbox_vbox.html Start the Sandbox in Your Virtual MachineTo begin, start the MapR Sandbox that you have installed using VMware or Virtual Box. It might take a minute or two to get fully initiated.

NOTE: you need to press the “command” key in MacOS or the right “control” key in windows to get your mouse cursor out of the console window.
Once the Sandbox is started, take a look at what comes up. The Sandbox itself is an environment where you can interact with your data, but if you go to http://127.0.0.1:8443/ you can access the file system and familiarize yourself with how the data is stored.

For this tutorial, we will be in HUE. Launch HUE and type in the username/password combination: Username:
Username: mapr Password: maprOnce HUE opens, go to the file browser:

When you are the in the file browser, you will see you are in the /user/mapr directory.

We are going to operate as user01. To get to that directory, click on the /user directory

Make sure you see user01.
Now we have access to user01 within our Sandbox. This is where you can create folders and store data to be used to test out your Spark code. When working with the Sandbox itself, you can use the Sandbox command line if you choose, or you can connect via the terminal or PuTTY on your machine as “user01”. If you choose to connect via a terminal, use ssh and the following command: $ ssh [emailprotected] -p 2222The password is: mapr Welcome to your Mapr Demo Virtual machine. [[email protected]~]$
For this tutorial, I am using a Mac laptop and a terminal application called iTerm2. I could also use my normal default terminal in my Mac as well.
The Sandbox comes with Spark installed. Python is also installed on the Sandbox, and the Python version is 2.6.6.
[[email protected]~]$ python --version Python 2.6.6
PySpark uses Python and Spark; however, there are some additional packages needed. To install these additional packages, we need to become the root user for the sandbox. (password is: mapr)
[[email protected] ~]$ su - Password: [[emailprotected] ~]# [[emailprotected] ~]# yum -y install python-pip [[emailprotected] ~]# pip install nose [ [emailprotected]~]# pip install numpy
The numpy install might take a minute or two. NumPy and Nose are packages that allow for array manipulation and unit tests within Python.
[[email protected] ~]# su - user01 [ [emailprotected]~]$
PySpark in the SandboxTo start PySpark, type the following:
[<a href="/cdn-cgi/l/email-protection" data-cfemail="37424452450706775a56474553525a58">[email protected]</a>~]$ pyspark --master yarn-client
Below is a screen shot of what your output will approximately look like. You will be in Spark, but with a Python shell.

The following code will be executed within PySpark at the >>