scalding presentation
DESCRIPTION
Scalding, Scala, MapReduce 24th Hadoop London MeetupTRANSCRIPT
MapReduce with ScaldingAntonios Chalkiopoulos24th Big Data London Meetup
Scalding.io
$ whoami
Scalding.io
http://scalding.io
http://github.com/scalding-io
@chalkiopoulos
My recent achievement..
Scalding.io
What are we gonna talk about..?
Scalding.io
Scalding.io
A Scala API on top of Cascading
Scalding.io
But what is ?
Scalding.io
A few years ago I started on a fresh Big Data team…
Scalding.io
Story!!
How do we efficiently develop MapReduce jobs for our new hadoop cluster ?
Scalding.io
??
MapReduce Techs
Scalding.io
Java MapReduce
Hadoop
ab
stra
ctio
n
ws
Java MapReduce Word count example
MapReduce Techs
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
ab
stra
ctio
n
The promise of Cascading
Scalding.io
[1] A simple, high level java API for MapReduce easy to understand and work with.
Scalding.io
[2] Extensions to
MANY platforms
Scalding.io
Scalding.io
Cascading
NoSQL Databases
SQL Databases
Hadoop Filesystem
Local Filesystem
In memory systems
Search Platforms
MongoDB Cassandra HBASE Accumulo …
ElasticSearch Solr …
Redis Memcached
…
How it works?
Scalding.io
A pipeline architecture
Scalding.io
Scalding.io
data
data
data
Tuple1Tuple2
where tuples flow through pipes
Source tap
data
data
data
Sin
k tap
Scalding.io
Log files
Customer Data
Log & Customer
FinalResults
Log files
Log files
Customer Data
Results
Results
Cascading Example
Scalding.io
Word count in Cascading
1. public class WordCount {
2. public static void main(String[] args) {3. Properties properties = new Properties();4. FlowConnector.setApplicationJarClass (properties, WordCount.class);5. Scheme sourceScheme = new TextLine (new Fields(“line”));6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7. Tap source = new Hfs( sourceScheme, args[0]);8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE );9. Pipe assembly = new Pipe(“ Word Count “);10. String regex = “(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)”;11. Function function = new RegexGenerator( new Fields(“word”), regex);12. assembly = new Each( assembly, new Fields(“line”), function );13. assembly = new GroupBy( assembly, new Fields(“word”) );14. Aggregator count = new Count(new Fields(“count”) );15. assembly = new Every( assembly, count );16. FlowConnector flowConnector = new FlowConnector( properties );17. Flow flow = flowConnector.connect(“word-count”, source, sink,
assembly);18. flow.complete();19. }20. }Scalding.io
70% less boilerplate code
But still some infrastructure code
Scalding.io
Scalding.io
No boilerplate code at all
Functional
Robust & Scalable
Run on JVM
Here it comes
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
ab
stra
ctio
n
Scalding
The power of Scala on top of Cascading
Scalding.io
Scala fits naturally with data
Scalding.io
Word count in Scalding
Scalding.io
1. import com.twitter.scalding._
2. class WordCountJob(args : Args) extends Job(args) {
3. TextLine("input.txt”).read4. .flatMap('line -> 'word) { line : String => line.split("\\s+") }5. .groupBy('word) { _.size }6. .write( Tsv(”results.tsv”) )
7. }
Map phase
Reducephase
4 lines of code!
4
Code that developers enjoy writing
Who is using it?
Scalding.io
Many many others…
Scalding…
…open sourced by twitter at 2011…has more than 100 open source contributors…exposes the right abstractions…maximizes expressiveness…promotes extensibility…adds new capabilities to Cascading
Scalding.io
Core Concepts
Scalding.io
Sources & Sinks
1. Tsv("data.tsv", ('productID,'price,'quantity))2. .read3. .write(UnpackedAvroSource("data.avro”))
Scalding.io
TsvCsvOsvAvroParquet…
Map Operations
Scalding.io
1. pipe1.filter ('age) { age:Int => age > 18 }2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 }3. pipe1.project('name, 'surname)
15 map operations translated into map phases
Join operations
1. pipe1.joinWithSmaller('productId -> 'productId, pipe2)2. pipe1.joinWithLarger ('productId -> 'productId, pipe2)3. pipe1.joinWithTiny ('productId -> 'productId, pipe2)
Scalding.io
Optimize by hinting the relative sizes
Supports Left, Right, Inner, Outer Joins
1. pipe12. .joinWithSmaller('productId -> 'productId, pipe2, 3. joiner=new LeftJoin)
Group operations
1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity))2. .groupBy('shopId) {3. _.sum[Long]('quantity-> 'totalSoldItems)4. }5. .write(Tsv(“results.tsv”))
Scalding.io
Group by particular fields
.groupBy
.groupAll Group all data
Pipe operations
1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes2. .debug // Print sample data to screen3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded
Scalding.io
Simple pipe operations
Connect with external systems
Scalding.io
Scalding + Hive1. class HiveExample (args: Args) extends Job(args) {
2. val USER_SCHEMA = List('userId, 'username, 'photo)
3. HiveSource("myHiveTable", SinkMode.KEEP)4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))5. .write(Tsv("outputFromHive"))6. }
Scalding.io
Define the schemaQuery HcatalogRead directly from
HDFS
Scalding + ElasticSearch1. val schema = List('number, 'product, 'description)
2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))
3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))
Scalding.io
Read from ElasticSearch in
one line!Also index new data in ES
Design patterns
Scalding.io
Dependency InjectionLate boundExternal Operations
How about defining external operations?
Scalding.io
1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA)2. .read3. .ETLOmnitureData4. .calculateOmnitureUserStats5. .joinWithCustomerDB('userId->'userId, customerPipe)6. .write(Tsv(“omniture-results.tsv”))
Custom operations: Re-usable modular code Single responsibility TestabilityFull-code
http://bit.ly/1pNSUKf
Scalding Testing
Scalding.io
Testing challenges in the context of MR
Scalding.io
Acceptance Tests
Unit – Component Tests
System Tests
Integration Tests
Scalding enables
testing in every layer
&
TDD
example
Scalding.io
1. class TsvWordCountJobTest extends FlatSpec2. with ShouldMatchers with TuppleConversions {
3. “WordCountJob” should “count words” in { 4. JobTest(new WordCountJob(_))5. .args(“input”,”inFile”)6. .args(“output”,”outFile”)7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”))8. .sink[(String,Int)](Tsv(“outFile”)) { out =>9. out.toList should contain (“cool” -> 2)10. }11. .run12. .finish13. }14. }
Replaces taps with in-memory
collections and asserts the expected
output
Monitoring
Scalding.io
“Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps”
Scalding.io
http://driven.cascading.io
Scalding.io
Collects telemetry data and expose through a Web UI
Advanced Concepts
Scalding.io
Scalding adds Typed API Matrix API
Graphs Machine Learning Algorithm
Scalding.io
What the future like?
Scalding.io
So far…
Scalding.io
ab
stra
ctio
n
Real TimeBatch Hybrid
Scalding.io
ab
stra
ctio
n
Summingbird
A unified API for everything
StormTEZ Spark
Enables the Lambda architecture
Scalding.io
Questions?