mapreduce with hadoop and ruby

17
Ohai Hadoop! Build your first MapReduce with Hadoop & Ruby

Upload: swanand-pagnis

Post on 03-Jul-2015

266 views

Category:

Technology


7 download

DESCRIPTION

Write map reduce jobs in Hadoop in Ruby

TRANSCRIPT

Page 1: MapReduce with Hadoop and Ruby

Ohai Hadoop!

Build your first MapReduce with Hadoop & Ruby

Page 2: MapReduce with Hadoop and Ruby

Tweet@_swanandGitHub@swanandpStackOverflow@18678Work@KaverisoftMake { DispatchTrack }mailto:[email protected]

Who am I?Ruby, Coffeescript, Java, Rails, Sinatra, Android, TextMate, Emacs, Minitest, MySQL, Cassandra, Hadoop, Mountain Lion, Curl, Zsh, GMail, Solarized, Oscar Wilde, Robert Jordan, Quentin Tarantino, Charlize Theron

Page 3: MapReduce with Hadoop and Ruby

● MapReduce! Wait, what?● Enter the Hadoop. *gong*● Convention over Configuration? You wish.● Instant Gratification. Now you're talkin'● Further Reading. Go forth and read!

Tell 'em what you're going to tell 'em

Page 4: MapReduce with Hadoop and Ruby

MapReduce! Wait, what?

● Map: Given a set of values (or key-values), output another set of values (or key-values)

● [K1, V1] -> map -> [K2, V2]● Map each value into a new value

Page 5: MapReduce with Hadoop and Ruby

MapReduce! Wait, what?

● Reduce: Given a set of values for a key, come up with a summarized version

● K1[V1, V2 ... Vn] -> reduce -> K1[Y]● Reduce given values into 1 value

Page 6: MapReduce with Hadoop and Ruby

MapReduce! Wait, what?

Page 7: MapReduce with Hadoop and Ruby

MapReduce! Um.. hmm..

Q: What is the single biggest takeaway from mapping?A: Map operation is stateless i.e. one iteration doesn't depend on previous iteration.

Q: What is the single biggest takeaway from reducing?A: Reduce represents an operation for a particular key.

Page 8: MapReduce with Hadoop and Ruby

Enter the Hadoop. *gong*

"The really interesting thing I want you to notice, here, is that as soon as you think of map and reduce as functions that everybody can use, and they use them, you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant." - Joel Spolsky

Page 9: MapReduce with Hadoop and Ruby

MapReduce! Oh, yeah!

1. Convert raw data into readable format2. Iterate over data chunks, convert each

chunk into meaningful key, value pairs3. Do this for all your data using massive

parallelization4. Group all the keys and their respective

values5. Take values for a key and convert into

desired meaningful format6. Step 2 is called mapper7. Step 5 is called reducer

Page 10: MapReduce with Hadoop and Ruby

Enter the Hadoop. *gong*

Same process has now become:1. Put data into Hadoop2. Define your mapper3. Define your reducer4. Run your jobs5. Read processed data from Hadoop

Other advantages:● Encapsulations over common problems

like large files, process management, disk / node failure

Page 11: MapReduce with Hadoop and Ruby

Top Level Descriptorjob has_many tasksHDFS Boss core-site.xmlHDFS Slaves slavesMapReduce Boss mapred-site.xmlMapReduce Slave mapred-site.xmlUser's window into Hadoop, through the command hadoop

Convention over Configuration? You wish.

JobTaskNameNode DataNodeJobTracker TaskTrackerClient

Page 12: MapReduce with Hadoop and Ruby

Convention over Configuration? You wish.

● Configuration in XML & Shell scripts. Yuck!● Respite:

○ Option for specifying a configuration directory○ Shell script configuration is mostly ENV variables

● Which means:○ Configuration can be written in YML or JSON or

Ruby and exported in XML○ ENV variables can be set using rake, thor or just

plain Ruby● Caveats:

○ No standard wrapper to do this (Go write one!)

Page 13: MapReduce with Hadoop and Ruby

Convention over Configuration? You wish.

● Default mappers and reducers are defined in Java

● Other languages supported using Streaming API

● Streaming API makes use of STDIN and STDOUT to read and output data and executable binaries for processing

● Caveats○ No dependency management, we are on our own

Page 14: MapReduce with Hadoop and Ruby

Instant Gratification. Now you're talkin'

GOAL:1. Take a couple of books in txt format2. Find out the total usage of each character in

the english alphabet.3. Establish that e is the most used.4. Why this example?

a. Perfect use case for MapReduce.b. Algorithm is simple.c. Results are simple to analyze.d. Txt formatted books are easily available in Project

Gutenberg.

Page 15: MapReduce with Hadoop and Ruby

● Official Documentation● Wiki: http://wiki.apache.org/hadoop/● Hadoop examples that ship with Hadoop● http://www.bigfastblog.com/map-reduce-

with-ruby-using-hadoop● http://www.youtube.com/watch?

v=d2xeNpfzsYI

Further Reading and Watching

Page 16: MapReduce with Hadoop and Ruby

Questions?

Page 17: MapReduce with Hadoop and Ruby

Thank you!