Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

$
0
0

The quantity of digital data generated every day is growing exponentially with the advent of Digital Media, Internet of Things amongother developments. This scenario has given rise to challenges increating next generation tools and technologies to store and manipulate these data. This is where Hadoop Streaming comes in! Given below is a graph which depicts the growth of data generated annually in the worldfrom 2013. IDCestimates that the amount of data created annually will reach180 Zettabytes in2025!


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Source: IDC

IBM states that, every day, almost 2.5 quintillion bytes of data are created, with 90 percent of world’s data created in the last two years! It is a challenging task to store such an expansive amount of data. Hadoop can handle large volumes of structured and unstructured data more efficiently than the traditional enterprise Data Warehouse. It stores these enormous data sets across distributed clusters of computers. Hadoop Streaming uses MapReduce framework which can be used to write applications to process humongous amounts of data.

Since MapReduce framework is based on Java, you might be wondering how a developer can work on it if he/ she doesnot have experience in Java. Well, developers can write mapper/Reducer application using their preferred language and without having much knowledge of Java, using Hadoop Streaming rather than switching to new tools or technologies like Pig and Hive.

What is Hadoop Streaming?

Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to execute programs for big data analysis. Hadoop streaming can be performed using languages like python, Java, php, Scala, Perl, UNIX, and many more. The utility allows us to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. For example:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-input myInputDirs \

-output myOutputDir \

-mapper /bin/cat \

-reducer /bin/wc

Parameters Description :

Parameter Optional / Required Description -input directoryname or filename Required Input location for mapper -output directoryname or filename Required Output location for mapper -mapper executable Required Mapper executable -reducer executable Required Reducer executable -file filename Optional Make the mapper, reducer, or combiner executable available locally on the compute nodes Python MapReduce Code :

mapper.py

#!/usr/bin/python

importsys

#Word Count Example

# input comes from standard input STDIN

forline in sys.stdin:

line = line.strip() #remove leading and trailing whitespaces

words = line.split() #split the line into words and returns as a list

forword in words:

#write the results to standard output STDOUT

print ‘%s \t %s’ % (word,1) #Emit the word


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

reducer.py

#!/usr/bin/python

importsys

fromoperator import itemgetter

# using a dictionary to map words to their counts

current_word = None

current_count = 0

word = None

# input comes from STDIN

forline in sys.stdin:

line = line.strip()

word,count = line.split( ‘ \t ‘ ,1)

try:

count = int(count)

exceptValueError:

continue

ifcurrent_word == word:

current_count += count

else:

ifcurrent_word:

print ‘%s \t %s’ % (current_word, current_count)

current_count = count

current_word = word

ifcurrent_word == word:

print ‘%s \t %s’ % (current_word,current_count)


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Run:

Create a file with the following content and name it word.txt.

Cat mouse lion deer Tiger lion Elephant lion deer

Copy the mapper.py and reducer.py scripts to the same folder where the above file exists.
Hadoop Streaming: Writing A Hadoop MapReduce Program In Python
Open terminal and Locate the directory of the file.Command:ls : to list all files in the directorycd : to change directory/folder
Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

See the content of the file.

Command: cat file_name


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

> content of mapper.py

command: cat mapper.py


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

>Content of reducer.py

command: cat reducer.py


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

We can run mapper and reducer on local files (ex: word.txt). In order to run the Map and reduce on the Hadoop Distributed File System (HDFS), we need the Hadoop Streaming jar. So before we run the scripts on HDFS, let’s run them locally to ensure that they are working fine.

>Run the mapper

command: cat word.txt | python mapper.py


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

>Run reducer.py

command: cat word.txt | python mapper.py | sort -k1,1 | python reducer.py


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

We can see that the mapper and reducer are working as expected so we won’t face any further issues.

Running the Python Code on Hadoop

Before we run the MapReduce task on Hadoop, copy local data (word.txt) to HDFS

>example: hdfs dfs -put source_directory hadoop_destination_directory

command: hdfs dfs -put /home/edureka/MapReduce/word.txt /user/edureka


Hadoop Streaming: Writing A Hadoop MapReduce Program In Python

Copy the path of the jar file

The path of Hadoop Streaming jar based on the version of the jar is:

/usr/lib/hadoop-2.2.X/share/hadoop/tools/lib/hadoop-streaming-2.2.X.jar

So locate the Hadoop Streaming jar on your terminal and copy the path.

command:

ls /usr/lib/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.

Viewing all articles
Browse latest Browse all 9596

Latest Images

Trending Articles