Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

The quantity of digital data generated every day is growing exponentially with the advent of Digital Media, Internet of Things amongother developments. This scenario has given rise to challenges increating next generation tools and technologies to store and manipulate these data. This is where Hadoop Streaming comes in! Given below is a graph which depicts the growth of data generated annually in the worldfrom 2013. IDCestimates that the amount of data created annually will reach180 Zettabytes in2025!

Source: IDC

IBM states that, every day, almost 2.5 quintillion bytes of data are created, with 90 percent of world’s data created in the last two years! It is a challenging task to store such an expansive amount of data. Hadoop can handle large volumes of structured and unstructured data more efficiently than the traditional enterprise Data Warehouse. It stores these enormous data sets across distributed clusters of computers. Hadoop Streaming uses MapReduce framework which can be used to write applications to process humongous amounts of data.

Since MapReduce framework is based on Java, you might be wondering how a developer can work on it if he/ she doesnot have experience in Java. Well, developers can write mapper/Reducer application using their preferred language and without having much knowledge of Java, using Hadoop Streaming rather than switching to new tools or technologies like Pig and Hive.

What is Hadoop Streaming?

Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to execute programs for big data analysis. Hadoop streaming can be performed using languages like python, Java, php, Scala, Perl, UNIX, and many more. The utility allows us to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper /bin/cat \

-reducer /bin/wc

Parameters Description :

Parameter Optional / Required Description -input directoryname or filename Required Input location for mapper -output directoryname or filename Required Output location for mapper -mapper executable Required Mapper executable -reducer executable Required Reducer executable -file filename Optional Make the mapper, reducer, or combiner executable available locally on the compute nodes Python MapReduce Code :

mapper.py

#!/usr/bin/python

importsys

#Word Count Example

# input comes from standard input STDIN

forline in sys.stdin:

line = line.strip() #remove leading and trailing whitespaces

words = line.split() #split the line into words and returns as a list

forword in words:

#write the results to standard output STDOUT

print ‘%s \t %s’ % (word,1) #Emit the word

reducer.py

#!/usr/bin/python

importsys

fromoperator import itemgetter

# using a dictionary to map words to their counts

current_word = None

current_count = 0

word = None

# input comes from STDIN

forline in sys.stdin:

line = line.strip()

word,count = line.split( ‘ \t ‘ ,1)

try:

count = int(count)

exceptValueError:

continue

ifcurrent_word == word:

current_count += count

else:

ifcurrent_word:

print ‘%s \t %s’ % (current_word, current_count)

current_count = count

current_word = word

ifcurrent_word == word:

print ‘%s \t %s’ % (current_word,current_count)

Run:

Create a file with the following content and name it word.txt.

Cat mouse lion deer Tiger lion Elephant lion deer

Copy the mapper.py and reducer.py scripts to the same folder where the above file exists.
Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Open terminal and Locate the directory of the file.Command:ls : to list all files in the directorycd : to change directory/folder
Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

See the content of the file.

Command: cat file_name

> content of mapper.py

command: cat mapper.py

>Content of reducer.py

command: cat reducer.py

We can run mapper and reducer on local files (ex: word.txt). In order to run the Map and reduce on the Hadoop Distributed File System (HDFS), we need the Hadoop Streaming jar. So before we run the scripts on HDFS, let’s run them locally to ensure that they are working fine.

>Run the mapper

command: cat word.txt | python mapper.py