Tutorial: Using PySpark and the MapR Sandbox

PySpark is a Spark API that allows you to interact with Spark through the python shell. If you have a Python programming background, this is an excellent way to get introduced to Spark data types and parallel programming. PySpark is a particularly flexible tool for exploratory big data analysis because it integrates with the rest of the Python data analysis ecosystem, including pandas (DataFrames), NumPy (arrays), and Matplotlib (visualization). In this blog post, you’ll get some hands-on experience using PySpark and the MapR Sandbox.

Example: Using Clustering on Cyber Network Data to Identify Anomalous Behavior

Unsupervised learning is an area of data analysis that is exploratory. These methods are used to learn about the structure and behavior of the data. Keep in mind that these methods are not used to predict or classify, but rather to interpret and understand.

Clustering is a popular unsupervised learning method where the algorithm attempts to identify natural groups within the data. K-means is the most widely used clustering algorithm where “k” is the number of groups that the data falls into. In k-means, k is assigned by the analyst, and choosing the value of k is where the interpretation of the data comes into play.

In this example, we will be using a dataset from an annual data mining competition, The KDD Cup ( http://www.sigkdd.org/kddcup/index.php ). One year (1999), the topic was network intrusion and the data set is still available ( http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html ). The data set will be the kddcup.data.gz file and consists of 42 features and approximately 4.9 million rows.

Using clustering on cyber network data to identify anomalous behavior is a common utilization of unsupervised learning. The sheer amount of data collected makes it impossible to go through each log or event to properly determine if that network event was normal or anomalous. Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) are often the only applications networks have to filter this data, and the filter is often assigned based on anomalous signatures that can take time to be updated. Before the update occurs, it is valuable to have analysis techniques to check your network data for recent anomalous activity.

K-means is also used in analysis of social media data, financial transactions, and demographics. For example, you can use clustering analysis to identify groups of Twitter users who Tweet from specific geographic regions using their latitude, longitude, and sentiment scores.

Code for computing k-means in Spark using Scala can be found in many books and blogs. Implementing this code in PySpark uses a slightly different syntax, but many elements are the same, so it will look familiar. The MapR Sandbox offers an excellent environment where Spark is already pre-installed and allows you to get right to the analysis and not worry about software installation.

Install the Sandbox

The instructions in this example will be using the Sandbox in Virtual Box, but either VMware or Virtual Box can be used. For directions on installing the Sandbox in Virtual Box, click on this link.

http://maprdocs.mapr.com/51/#SandboxHadoop/t_install_sandbox_vbox.html Start the Sandbox in Your Virtual Machine

To begin, start the MapR Sandbox that you have installed using VMware or Virtual Box. It might take a minute or two to get fully initiated.

Tutorial: Using PySpark and the MapR Sandbox

NOTE: you need to press the “command” key in MacOS or the right “control” key in windows to get your mouse cursor out of the console window.

Once the Sandbox is started, take a look at what comes up. The Sandbox itself is an environment where you can interact with your data, but if you go to http://127.0.0.1:8443/ you can access the file system and familiarize yourself with how the data is stored.

For this tutorial, we will be in HUE. Launch HUE and type in the username/password combination: Username:

Username: mapr Password: mapr

Once HUE opens, go to the file browser:

When you are the in the file browser, you will see you are in the /user/mapr directory.

We are going to operate as user01. To get to that directory, click on the /user directory

Make sure you see user01.

Now we have access to user01 within our Sandbox. This is where you can create folders and store data to be used to test out your Spark code. When working with the Sandbox itself, you can use the Sandbox command line if you choose, or you can connect via the terminal or PuTTY on your machine as “user01”. If you choose to connect via a terminal, use ssh and the following command: $ ssh [emailprotected] -p 2222The password is: mapr Welcome to your Mapr Demo Virtual machine. [[email protected]

~]$

For this tutorial, I am using a Mac laptop and a terminal application called iTerm2. I could also use my normal default terminal in my Mac as well.

The Sandbox comes with Spark installed. Python is also installed on the Sandbox, and the Python version is 2.6.6.

[[email protected]

~]$ python --version Python 2.6.6

PySpark uses Python and Spark; however, there are some additional packages needed. To install these additional packages, we need to become the root user for the sandbox. (password is: mapr)

[[email protected] ~]$ su - Password: [[emailprotected] ~]# [[emailprotected] ~]# yum -y install python-pip [[emailprotected] ~]# pip install nose [ [emailprotected]

~]# pip install numpy

The numpy install might take a minute or two. NumPy and Nose are packages that allow for array manipulation and unit tests within Python.

[[email protected] ~]# su - user01 [ [emailprotected]

~]$

PySpark in the Sandbox

To start PySpark, type the following:

[<a href="/cdn-cgi/l/email-protection" data-cfemail="37424452450706775a56474553525a58">[email protected]</a>

~]$ pyspark --master yarn-client

Below is a screen shot of what your output will approximately look like. You will be in Spark, but with a Python shell.

The following code will be executed within PySpark at the >&gt

Tutorial: Using PySpark and the MapR Sandbox

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本