hadoop streaming tutorial with python

20
/* Joe Stein, Chief Architect http://www.medialets.com Twitter: @ allthingshadoop */ 1 Tutorial: Streaming Jobs (& Non-Java Hadoop) Sample Code https://github.com/joestein/ amaunet

Upload: joe-stein

Post on 15-Jan-2015

9.546 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop Streaming Tutorial With Python

1

/*

Joe Stein, Chief Architecthttp://www.medialets.comTwitter: @allthingshadoop

*/

Tutorial: Streaming Jobs (& Non-Java Hadoop)

Sample Codehttps://github.com/joestein/amaunet

Page 3: Hadoop Streaming Tutorial With Python

3

Medialets

Page 4: Hadoop Streaming Tutorial With Python

4

Medialets• Largest deployment of rich media ads for mobile devices• Installed on hundreds of millions of devices• 3-4 TB of new data every day• Thousands of services in production• Hundreds of thousands of events received every second• Response times are measured in microseconds• Languages

–35% JVM (20% Scala & 10% Java)–30% Ruby–20% C/C++–13% Python–2% Bash

Page 5: Hadoop Streaming Tutorial With Python

6

MapReduce 101

Why and How It Works

Page 6: Hadoop Streaming Tutorial With Python

7

Sample Dataset

Data set 1: countries.dat

name|key

United States|USCanada|CAUnited Kingdom|UKItaly|IT

Page 7: Hadoop Streaming Tutorial With Python

8

Sample Dataset

Data set 2: customers.dat

name|type|countryAlice Bob|not bad|USSam Sneed|valued|CAJon Sneed|valued|CAArnold Wesise|not so good|UKHenry Bob|not bad|USYo Yo Ma|not so good|CAJon York|valued|CAAlex Ball|valued|UKJim Davis|not so bad|JA

Page 8: Hadoop Streaming Tutorial With Python

9

Sample Dataset

The requirement: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name).

To-do this you need to:

1) Join the data sets2) Key on country3) Count type of customer per country4) Output the results

Page 9: Hadoop Streaming Tutorial With Python

10

Sample Dataset

United States|USCanada|CAUnited Kingdom|UKItaly|IT

Alice Bob|not bad|USSam Sneed|valued|CAJon Sneed|valued|CAArnold Wesise|not so good|UKHenry Bob|not bad|USYo Yo Ma|not so good|CAJon York|valued|CAAlex Ball|valued|UKJim Davis|not so bad|JA

Canada not so good 1Canada valued 3JA - Unkown Country not so bad 1United Kingdom not so good 1United Kingdom valued 1United States not bad 2

Page 10: Hadoop Streaming Tutorial With Python

11

So many ways to MapReduce

• Java• Hive• Pig• Datameer• Cascading

–Cascalog–Scalding

• Streaming with a framework–Wukong–Dumbo–MrJobs

• Streaming without a framework–You can even do it with bash scripts, but don’t

Page 11: Hadoop Streaming Tutorial With Python

12

Why and When

There are two types of jobs in Hadoop1) data transformation 2) queries

• Java–Faster? Maybe not, because you might not know how to

optimize it as well as the Pig and Hive committers do, its Java … so … Does not work outside of Hadoop without other Apache projects to let it do so.

• Hive & Pig–Definitely a possibility but maybe better after you have

created your data set. Does not work outside of Hadoop.• Datameer

–WICKED cool front end, seriously!!!• Streaming

–With a framework – one more thing to learn–Without a framework – MapReduce with and without

Hadoop, huh? really? Yeah!!!

Page 12: Hadoop Streaming Tutorial With Python

13

How does streaming work

stdin & stdout

• Hadoop actually opens a process and writes and reads• Is this efficient? Yeah it is when you look at it• You can read/write to your process without Hadoop – score!!!• Why would you do this?

–You should not put things into Hadoop that don’t belong there. Prototyping and go live without the overhead!

–You can have your MapReduce program run outside of Hadoop until it is ready and NEEDS to be running there

–Really great dev lifecycles–Did I mention about the great dev lifecycles?–You can write a script in 5 minutes, seriously and then

interrogate TERABYTES of data without a fuss

Page 13: Hadoop Streaming Tutorial With Python

14

Blah blah blah

Where's the beef?

#!/usr/bin/env python import sys # input comes from STDIN (standard input)for line in sys.stdin: try: #sometimes bad data can cause errors use this how you like to deal with lint and bad data personName = "-1" #default sorted as first personType = "-1" #default sorted as first countryName = "-1" #default sorted as first country2digit = "-1" #default sorted as first # remove leading and trailing whitespace line = line.strip() splits = line.split("|") if len(splits) == 2: #country data countryName = splits[0] country2digit = splits[1] else: #people data personName = splits[0] personType = splits[1] country2digit = splits[2] print '%s^%s^%s^%s' % (country2digit,personType,personName,countryName) except: #errors are going to make your job fail which you may or may not want pass

Page 14: Hadoop Streaming Tutorial With Python

15

Here is the output of that

CA^-1^-1^CanadaCA^not so good^Yo Yo Ma^-1CA^valued^Jon Sneed^-1CA^valued^Jon York^-1CA^valued^Sam Sneed^-1IT^-1^-1^ItalyJA^not so bad^Jim Davis^-1UK^-1^-1^United KingdomUK^not so good^Arnold Wesise^-1UK^valued^Alex Ball^-1US^-1^-1^United StatesUS^not bad^Alice Bob^-1US^not bad^Henry Bob^-1

Page 15: Hadoop Streaming Tutorial With Python

16

Padding is your friend

All sorts are not created equal

Josephs-MacBook-Pro:~ josephstein$ cat test1,,21,1,2Josephs-MacBook-Pro:~ josephstein$ cat test |sort1,,21,1,2

[root@megatron joestein]# cat test1,,21,1,2[root@megatron joestein]# cat test|sort1,1,21,,2

Page 16: Hadoop Streaming Tutorial With Python

17

And the reducer#!/usr/bin/env python import sys # maps words to their countsfoundKey = ""foundValue = ""isFirst = 1currentCount = 0currentCountry2digit = "-1"currentCountryName = "-1"isCountryMappingLine = False # input comes from STDINfor line in sys.stdin: # remove leading and trailing whitespace line = line.strip() try: # parse the input we got from mapper.py country2digit,personType,personName,countryName = line.split('^') #the first line should be a mapping line, otherwise we need to set the currentCountryName to not known if personName == "-1": #this is a new country which may or may not have people in it currentCountryName = countryName currentCountry2digit = country2digit isCountryMappingLine = True else: isCountryMappingLine = False # this is a person we want to count if not isCountryMappingLine: #we only want to count people but use the country line to get the right name #first check to see if the 2digit country info matches up, might be unkown country if currentCountry2digit != country2digit: currentCountry2digit = country2digit currentCountryName = '%s - Unkown Country' % currentCountry2digit currentKey = '%s\t%s' % (currentCountryName,personType) if foundKey != currentKey: #new combo of keys to count if isFirst == 0: print '%s\t%s' % (foundKey,currentCount) currentCount = 0 #reset the count else: isFirst = 0 foundKey = currentKey #make the found key what we see so when we loop again can see if we increment or print out currentCount += 1 # we increment anything not in the map list except: pass try: print '%s\t%s' % (foundKey,currentCount)except: pass

Page 17: Hadoop Streaming Tutorial With Python

18

How to run it

• cat customers.dat countries.dat|./smplMapper.py|sort|./smplReducer.py

• su hadoop -c "hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+169.89-streaming.jar -D mapred.map.tasks=75 -D mapred.reduce.tasks=42 -file ./smplMapper.py -mapper ./smplMapper.py -file ./smplReducer.py -reducer ./smplReducer.py -input $1 –output $2 -inputformat SequenceFileAsTextInputFormat -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -jobconf stream.map.output.field.separator=^ -jobconf stream.num.map.output.key.fields=4 -jobconf map.output.key.field.separator=^ -jobconf num.key.fields.for.partition=1"

Page 18: Hadoop Streaming Tutorial With Python

19

Breaking down the Hadoop job

• -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner–This is how you handle keying on values

• -jobconf stream.map.output.field.separator=^–Tell hadoop how it knows how to parse your output so it can

key on it• -jobconf stream.num.map.output.key.fields=4

–How many fields total• -jobconf map.output.key.field.separator=^

–You can key on your map fields seperatly• -jobconf num.key.fields.for.partition=1

–This is how many of those fiels are your “key” the rest are sort

Page 19: Hadoop Streaming Tutorial With Python

20

Some tips

• chmod a+x your py files, they need to execute on the nodes as they are LITERALLY a process that is run

• NEVER hold too much in memory, it is better to use the last variable method than holding say a hashmap

• It is ok to have multiple jobs DON’T put too much into each of these it is better to make pass over the data. Transform then query and calculate. Creating data sets for your data lets others also interrogate the data

• To join smaller data sets use –file and open it in the script• http://hadoop.apache.org/common/docs/r0.20.1/streaming.html• For Ruby streaming check out the podcast

http://allthingshadoop.com/2010/05/20/ruby-streaming-wukong-hadoop-flip-kromer-infochimps/

• Sample Code for this talk https://github.com/joestein/amaunet

Page 20: Hadoop Streaming Tutorial With Python

[email protected]/showcase

MedialetsThe rich media ad platform for mobile.

21

We are hiring!/*

Joe Stein, Chief Architecthttp://www.medialets.comTwitter: @allthingshadoop

*/