why hadoop map reduce needs scala, an introduction to scoobi and scalding
DESCRIPTION
TRANSCRIPT
![Page 1: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/1.jpg)
@agemooij
A Look at Scoobi and Scalding Scala DSLs for Hadoop
Why
Needs Scala
Scoobi
Scalding
![Page 2: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/2.jpg)
Obligatory “About Me” Slide
![Page 3: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/3.jpg)
Rocks!
![Page 4: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/4.jpg)
Sucks!
But programming
kinda
![Page 5: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/5.jpg)
Hello World Word Count using
Hadoop MapReduce
![Page 6: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/6.jpg)
For each word, sum the 1s to get the total
Split lines into words
Group by word (?)
Turn each word into a Pair(word, 1)
![Page 7: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/7.jpg)
Low level glue code
Lots of small unintuitive Mapper and Reducer
Classes
Lots of Hadoop intrusiveness(Context, Writables, Exceptions, etc.)
Actually runs the code on the cluster
![Page 8: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/8.jpg)
This does not make me a happy Hadoop developer!
Especially for things that are a little bit more complicated than counting words
• Unintuitive, invasive programming model• Hard to compose/chain jobs into real, more
complicated programs• Lots of low-level boilerplate code• Branching, Joins, CoGroups, etc. hard to
implement
![Page 9: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/9.jpg)
What Are the Alternatives?
![Page 10: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/10.jpg)
Counting Words using Apache Pig
Already a lot better, but anything more complex gets hard pretty fast.
Handy for quick exploration of data!
Pig is hard to customize/extend
Nice!
And the same goes for Hive
![Page 11: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/11.jpg)
package cascadingtutorial.wordcount;
/**
* Wordcount example in Cascading
*/
public class Main
{
public static void main( String[] args )
{
String inputPath = args[0];
String outputPath = args[1];
Scheme inputScheme = new TextLine(new Fields("offset", "line"));
Scheme outputScheme = new TextLine();
Tap sourceTap = inputPath.matches( "^[^:]+://.*") ?
new Hfs(inputScheme, inputPath) :
new Lfs(inputScheme, inputPath);
Tap sinkTap = outputPath.matches("^[^:]+://.*") ?
new Hfs(outputScheme, outputPath) :
new Lfs(outputScheme, outputPath);
Pipe wcPipe = new Each("wordcount",
new Fields("line"),
new RegexSplitGenerator(new Fields("word"), "\\s+"),
new Fields("word"));
wcPipe = new GroupBy(wcPipe, new Fields("word"));
wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
Properties properties = new Properties();
FlowConnector.setApplicationJarClass(properties, Main.class);
Flow parsedLogFlow = new FlowConnector(properties)
.connect(sourceTap, sinkTap, wcPipe);
parsedLogFlow.start();
parsedLogFlow.complete();
}
}
Pipes & Filters
Not very intuitive
Lots of boilerplate code
Very powerful!Record Model
Strange new abstraction
Joins & CoGroups
![Page 12: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/12.jpg)
Meh...I’m lazy
I want more power with less work!
![Page 13: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/13.jpg)
How would we count words in plain Scala?
(My current language of choice)
![Page 14: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/14.jpg)
Nice!Familiar, intuitiveWhat if...?
![Page 15: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/15.jpg)
But that code doesn’t scale to my cluster!
Or does it?
Meanwhile at Google...
![Page 16: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/16.jpg)
Introducing Scoobi & ScaldingScala DSLs for Hadoop MapReduce
Scalding5%
Scoobi95%
NOTE: My relative familiarity with either platform:
![Page 17: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/17.jpg)
http://github.com/nicta/scoobi
A Scala library that implements a higher level programming model for
Hadoop MapReduce
![Page 18: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/18.jpg)
Counting Words using Scoobi
For each word, sum the 1s to get the total
Split lines into words
Group by wordTurn each word into a Pair(word, 1)
Actually runs the code on the cluster
![Page 19: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/19.jpg)
Scoobi is...• A distributed collections abstraction:
• Distributed collection objects abstract data in HDFS
• Methods on these objects abstract map/reduce operations
• Programs manipulate distributed collections objects
• Scoobi turns these manipulations into MapReduce jobs
• Based on Google’s FlumeJava / Cascades
• A source code generator (it generates Java code!)
• A job plan optimizer
• Open sourced by NICTA
• Written in Scala (W00t!)
![Page 20: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/20.jpg)
DList[T]• Abstracts storage of data and files on HDFS
• Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce
• Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution
• Very familiar: like standard Scala Lists
• Strongly typed
• Parameterized with rich types and Tuples
• Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.
![Page 21: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/21.jpg)
DList[T]
![Page 22: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/22.jpg)
• Can read/write text files, Sequence files and Avro files
• Can influence sorting (raw, secondary)
IO
Serialization• Serialization of custom types through Scala type
classes and WireFormat[T]
• Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T], etc.
• Out of the box support for serialization of Scala case classes
![Page 23: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/23.jpg)
IO/Serialization I
![Page 24: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/24.jpg)
IO/Serialization II
For normal (i.e. non-case) classes
![Page 25: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/25.jpg)
Further Info
http://nicta.github.com/scoobi/
[email protected]@googlegroups.com
Version 0.4 released today (!)• Avro, Sequence Files• Materialized DObjects• DList reduction methods (product, min,
etc.)• Vastly improved testing support• Less overhead• Much more
![Page 26: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/26.jpg)
Scalding!
http://github.com/twitter/scalding
A Scala library that implements a higher level programming model for
Hadoop MapReduceCascading
![Page 27: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/27.jpg)
Counting Words using Scalding
![Page 28: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/28.jpg)
Scalding is...• A distributed collections abstraction
• A wrapper around Cascading (i.e. no source code generation)
• Based on the same record model (i.e. named fields)
• Less strongly typed
• Uses Kryo Serialization
• Used by Twitter in production
• Written in Scala (W00t!)
![Page 29: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/29.jpg)
Further Info
@scalding
http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
https://github.com/twitter/scalding/wiki
Current version: 0.5.4
http://github.com/twitter/scalding
![Page 30: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/30.jpg)
How do they compare?Different approaches,
similar power
Small feature differences, which will
even out over time
Scoobi gets a little closer to idiomatic
Scala
Twitter is definitely a bigger fish than
NICTA, so Scalding gets all the attention
Both open sourced (last year) Scoobi has better docs!
![Page 31: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/31.jpg)
Which one should I use?Ehm...
...I’m extremely prejudiced!
![Page 32: Why hadoop map reduce needs scala, an introduction to scoobi and scalding](https://reader035.vdocuments.site/reader035/viewer/2022062511/54c6541b4a7959b1098b4656/html5/thumbnails/32.jpg)
Questions?