mapreduce with scaldingfiles.meetup.com/1789394/bdl24-2-scalding_24thmeetup.pdfexample scalding.io...

55
MapReduce with Scalding Antonios Chalkiopoulos 24 th Big Data London Meetup Scalding .io

Upload: others

Post on 06-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

MapReduce with Scalding Antonios Chalkiopoulos 24th Big Data London Meetup

Scalding.io

Page 2: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

$ whoami

Scalding.io

http://scalding.io

http://github.com/scalding-io

@chalkiopoulos

Page 3: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

My recent achievement..

Scalding.io

Page 4: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions
Page 5: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

What are we gonna talk about..?

Scalding.io

Page 6: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

Page 7: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

A Scala API on top of Cascading

Scalding.io

Page 8: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

But what is ?

Scalding.io

Page 9: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

A few years ago I started on a fresh Big Data team…

Scalding.io

Story!!

Page 10: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

How do we efficiently develop MapReduce jobs for our new hadoop cluster ?

Scalding.io

??

Page 11: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

MapReduce Techs

Scalding.io

Java MapReduce

Hadoop

ab

stra

cti

on

Page 12: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

ws  

Java MapReduce Word count example

Page 13: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

MapReduce Techs

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

ab

stra

cti

on

Page 14: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

The promise of Cascading

Scalding.io

Page 15: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

[1] A simple, high level java API for MapReduce easy to understand

and work with.

Scalding.io

Page 16: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

[2] Extensions to

MANY platforms

Scalding.io

Page 17: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

Cascading

NoSQL Databases

SQL Databases

Hadoop Filesystem

Local Filesystem

In memory systems

Search Platforms

ü  MongoDB ü  Cassandra ü  HBASE ü  Accumulo …

ü  ElasticSearch ü  Solr …

ü  Redis ü  Memcached …

Page 18: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

How it works?

Scalding.io

Page 19: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

A pipeline architecture

Scalding.io

Page 20: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

data

data data

where tuples flow through pipes

Source tap

data

data data

Sin

k tap

Page 21: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

Log files

Customer Data

Log & Customer

Final Results

Log files

Log files

Customer Data

Results

Results

Page 22: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Cascading Example

Scalding.io

Page 23: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Word count in Cascading 1.  public class WordCount { 

2.  public static void main(String[] args) { 3.  Properties properties = new Properties(); 4.  FlowConnector.setApplicationJarClass (properties, WordCount.class); 5.  Scheme sourceScheme = new TextLine (new Fields(“line”)); 6.  Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7.   Tap source = new Hfs( sourceScheme, args[0]); 8.  Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE ); 9.  Pipe assembly = new Pipe(“ Word Count “); 10.      String regex = “(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)”; 11.      Function function = new RegexGenerator( new Fields(“word”), regex); 12.      assembly = new Each( assembly, new Fields(“line”), function ); 13.      assembly = new GroupBy( assembly, new Fields(“word”) ); 14.    Aggregator count = new Count(new Fields(“count”) ); 15.    assembly = new Every( assembly, count ); 16.      FlowConnector flowConnector = new FlowConnector( properties ); 17.    Flow flow = flowConnector.connect(“word-count”, source, sink,

assembly); 18.    flow.complete(); 19.  } 20.  }

Scalding.io

70% less boilerplate code

But still some infrastructure code

Page 24: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

Page 25: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

ü No boilerplate code at all

ü Functional

ü Robust & Scalable

ü Run on JVM

Page 26: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Here it comes J

Scalding.io

Java MapReduce

Pig Hive

Hadoop

Cascading Others

ab

stra

cti

on

Scalding

Page 27: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

The power of Scala on top of Cascading

Scalding.io

Page 28: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scala fits naturally with data

Scalding.io

Page 29: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Word count in Scalding

Scalding.io

1.  import com.twitter.scalding._

2.  class WordCountJob(args : Args) extends Job(args) {

3.  TextLine("input.txt”).read 4.  .flatMap('line -> 'word) { line : String => line.split("\\s+") } 5.  .groupBy('word) { _.size } 6.  .write( Tsv(”results.tsv”) )

7.  }

Map phase

Reduce phase

4

Code that developers enjoy writing J

Page 30: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Who is using it?

Scalding.io

Many many others…

Page 31: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding…

…open sourced by twitter at 2011 …has more than 100 open source contributors

…exposes the right abstractions …maximizes expressiveness

…promotes extensibility

…adds new capabilities to Cascading

Scalding.io

Page 32: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Core Concepts

Scalding.io

Page 33: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Sources & Sinks

1.   Tsv("data.tsv", ('productID,'price,'quantity)) 2.   .read 3.   .write(UnpackedAvroSource("data.avro”))

Scalding.io

ü Tsv ü Csv ü Osv ü Avro ü Parquet ü …

Page 34: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Map Operations

Scalding.io

1.  pipe1.filter ('age) { age:Int => age > 18 } 2.  pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 } 3.  pipe1.project('name, 'surname)

15 map

operations translated into map phases

Page 35: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Join operations 1.  pipe1.joinWithSmaller('productId -> 'productId, pipe2) 2.  pipe1.joinWithLarger ('productId -> 'productId, pipe2) 3.  pipe1.joinWithTiny ('productId -> 'productId, pipe2)

Scalding.io

Optimize by hinting the relative sizes

Supports Left, Right, Inner, Outer Joins

1.  pipe1 2.   .joinWithSmaller('productId -> 'productId, pipe2, 3.  joiner=new LeftJoin)

Page 36: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Group operations 1.  val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity)) 2.   .groupBy('shopId) { 3.  _.sum[Long]('quantity-> 'totalSoldItems) 4.  } 5.  .write(Tsv(“results.tsv”))

Scalding.io

Group by particular fields

.groupBy

.groupAll Group all data

Page 37: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Pipe operations 1.  val p = (pipe1 ++ pipe2) // Concatenate 2 pipes 2.  .debug // Print sample data to screen 3.   .addTrap(Tsv(“bogus_lines”) // dirty data are recorded

Scalding.io

Simple pipe operations

Page 38: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Connect with external systems

Scalding.io

Page 39: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding + Hive 1.  class HiveExample (args: Args) extends Job(args) {

2.  val USER_SCHEMA = List('userId, 'username, 'photo)

3.  HiveSource("myHiveTable", SinkMode.KEEP) 4.  .withHCatScheme(osvInputScheme(fields = USER_SCHEMA)) 5.  .write(Tsv("outputFromHive")) 6.  }

Scalding.io

Define the schema Query Hcatalog

Read directly from HDFS

Page 40: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding + ElasticSearch 1.  val schema = List('number, 'product, 'description)

2.  val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))

3.  val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))

Scalding.io

Read from ElasticSearch in

one line! Also index new data in ES

Page 41: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Design patterns

Scalding.io

Page 42: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

ü Dependency Injection ü Late bound ü External Operations

Page 43: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

How about defining external operations?

Scalding.io

1.  val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA) 2.   .read 3.   .ETLOmnitureData 4.   .calculateOmnitureUserStats 5.   .joinWithCustomerDB('userId->'userId, customerPipe) 6.   .write(Tsv(“omniture-results.tsv”))

Custom operations: ü  Re-usable modular code ü  Single responsibility ü  Testability

Full-code http://bit.ly/1pNSUKf

Page 44: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding Testing

Scalding.io

Page 45: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Testing challenges in the context of MR

Scalding.io

Acceptance Tests

Unit – Component Tests

System Tests

Integration Tests

Scalding enables

testing in every layer

&

TDD

Page 46: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

example

Scalding.io

1.  class TsvWordCountJobTest extends FlatSpec 2.  with ShouldMatchers with TuppleConversions {

3.  “WordCountJob” should “count words” in { 4.  JobTest(new WordCountJob(_)) 5.  .args(“input”,”inFile”) 6.  .args(“output”,”outFile”) 7.  .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”)) 8.  .sink[(String,Int)](Tsv(“outFile”)) { out => 9.  out.toList should contain (“cool” -> 2) 10.  } 11.  .run 12.  .finish 13.  } 14.  }

Replaces taps with in-memory

collections and asserts the expected

output

Page 47: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Monitoring

Scalding.io

Page 48: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

“Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps”

Scalding.io

http://driven.cascading.io

Page 49: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

Collects telemetry data and expose through a Web UI

Page 50: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Advanced Concepts

Scalding.io

Page 51: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding adds § Typed API § Matrix API

§ Graphs § Machine Learning Algorithm

Scalding.io

Page 52: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

What the future like?

Scalding.io

Page 53: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

So far…

Scalding.io

ab

stra

cti

on

Page 54: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Real Time Batch Hybrid

Scalding.io

ab

stra

cti

on

Summingbird

A unified API for everything

Storm TEZ Spark

Enables the Lambda architecture

Page 55: MapReduce with Scaldingfiles.meetup.com/1789394/BDL24-2-Scalding_24thMeetup.pdfexample Scalding.io 1. class TsvWordCountJobTest extends FlatSpec 2. with ShouldMatchers with TuppleConversions

Scalding.io

Questions?