hadoop user group eu 2014

40
A QUICK INTRODUCTION TO THE CASCADING ECOSYSTEM Chris K Wensel | Hadoop Summit EU 2014

Upload: cwensel

Post on 27-Jan-2015

107 views

Category:

Software


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hadoop User Group EU 2014

A QUICK INTRODUCTION TO THE CASCADING ECOSYSTEM

Chris K Wensel | Hadoop Summit EU 2014

Page 2: Hadoop User Group EU 2014

• Lead developer of the Cascading open-source project

• Founder of Concurrent, Inc.

• Involved with Apache Hadoop since it was called Apache Nutch

!

• Systems Architect, not a Data Scientist

WHO AM I?

2

Page 3: Hadoop User Group EU 2014

3

For creating data oriented applications, frameworks, and languages [on Apache Hadoop]

Originally designed to hide complexity of Hadoop and prevent thinking in MapReduce

cascading.org

Page 4: Hadoop User Group EU 2014

• Started in 2007

• 2.0 released June 2012

• 2.5 out now

• 3.0 WIP (if you look for it)

• Apache 2.0 Licensed

• Supports all Hadoop distros

SOME STATS

4

Page 5: Hadoop User Group EU 2014

5

What’s it used for?

Page 6: Hadoop User Group EU 2014

6

• Cascading Java API

• Data normalization and cleansing of search and click-through

logs for use by analytics tools

• Easy to operationalize heavy lifting of data

Page 7: Hadoop User Group EU 2014

7

• Cascalog (Clojure)

• Weather pattern modeling to protect growers against loss

• ETL against 20+ datasets daily

• Machine learning to create models

• Purchased by Monsanto for $930M US

Page 8: Hadoop User Group EU 2014

8

• Scalding (Scala)

• Machine learning (linear algebra) to improve

• User experience

• Ad quality (matching users and ad effectiveness)

• All revenue applications are running on Cascading/Scalding

• IPO

TWITTER

Page 9: Hadoop User Group EU 2014

9

• Estimate suicide risk from what people write online

• Cascading + Cassandra

• You can do more than optimize add yields

• http://www.durkheimproject.org

Page 10: Hadoop User Group EU 2014

KEY PROJECTS

10

Lingual Pattern

Cascading

Apache Hadoop

Scalding Cascalog

Page 11: Hadoop User Group EU 2014

• Java API (alternative to Hadoop MapReduce)

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING

11

Process Planner

Processing API Integration APIScheduler API

Scheduler

Apache Hadoop

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

Page 12: Hadoop User Group EU 2014

• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical

• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)

• Aggregations ‣ Count, Average, etc ‣ Rolling windows

SOME COMMON PATTERNS

12

filter

filter

function

functionfilterfunctiondata

PipelineSplit Join

Merge

data

Topology

Page 13: Hadoop User Group EU 2014

13

word count – Cascading Java API !String docPath = args[ 0 ];!String wcPath = args[ 1 ];!Properties properties = new Properties();!AppProps.setApplicationJarClass( properties, Main.class );!HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!!// create source and sink taps!Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );!Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );!!// specify a regex to split "document" text lines into token stream!Fields token = new Fields( "token" );!Fields text = new Fields( "text" );!RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );!// only returns "token"!Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!// determine the word counts!Pipe wcPipe = new Pipe( "wc", docPipe );!wcPipe = new GroupBy( wcPipe, token );!wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!!// connect the taps, pipes, etc., into a flow definition!FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );!// create the Flow!Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!wcFlow.writeDOT( "wc.dot" ); // <<-- On Next Slide!wcFlow.complete(); // <<-- Runs jobs on Cluster

1

3

2

scheduling

processing

integration

configuration

Page 14: Hadoop User Group EU 2014

14

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

wc.dot

Page 15: Hadoop User Group EU 2014

A REAL WORLD APP

15

[1/75] map+reduce

[2/75] map+reduce [3/75] map+reduce [4/75] map+reduce[5/75] map+reduce [6/75] map+reduce[7/75] map+reduce [8/75] map+reduce [9/75] map+reduce[10/75] map+reduce [11/75] map+reduce [12/75] map+reduce[13/75] map+reduce [14/75] map+reduce [15/75] map+reduce[16/75] map+reduce [17/75] map+reduce [18/75] map+reduce

[19/75] map+reduce [20/75] map+reduce[21/75] map+reduce [22/75] map+reduce[23/75] map+reduce [24/75] map+reduce[25/75] map+reduce [26/75] map+reduce[27/75] map+reduce [28/75] map+reduce [29/75] map+reduce [30/75] map+reduce[31/75] map+reduce[32/75] map+reduce [33/75] map+reduce [34/75] map+reduce [35/75] map+reduce

[36/75] map+reduce

[37/75] map+reduce

[38/75] map+reduce[39/75] map+reduce [40/75] map+reduce[41/75] map+reduce [42/75] map+reduce[43/75] map+reduce [44/75] map+reduce[45/75] map+reduce [46/75] map+reduce [47/75] map+reduce [48/75] map+reduce[49/75] map+reduce[50/75] map+reduce [51/75] map+reduce [52/75] map+reduce [53/75] map+reduce

[54/75] map+reduce

[55/75] map [56/75] map+reduce [57/75] map[58/75] map[59/75] map

[60/75] map [61/75] map[62/75] map

[63/75] map+reduce[64/75] map+reduce [65/75] map+reduce [66/75] map+reduce[67/75] map+reduce[68/75] map+reduce [69/75] map+reduce[70/75] map+reduce

[71/75] map [72/75] map

[73/75] map+reduce [74/75] map+reduce

[75/75] map+reduce

1 App, 1 Flow, 75 Steps/MRJobs !green = map + reduce purple = map blue = join/merge orange = map split

A graph of jobs, not operations!

Page 16: Hadoop User Group EU 2014

16

It’s not just for Java

Page 17: Hadoop User Group EU 2014

17

word count – Scalding (Scala) // Sujit Pal!// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html!!

package com.mycompany.impatient!!

import com.twitter.scalding._!!

class Part2(args : Args) extends Job(args) {!  val input = Tsv(args("input"), ('docId, 'text))!  val output = Tsv(args("output"))!  input.read.!    flatMap('text -> 'word) {! text : String => text.split("""\s+""")! }.!    groupBy('word) { group => group.size }.!    write(output)!}!

Page 18: Hadoop User Group EU 2014

18

word count – Cascalog (Clojure) ; Paul Lam!; github.com/Quantisan/Impatient!!(ns impatient.core!  (:use [cascalog.api]!        [cascalog.more-taps :only (hfs-delimited)])!  (:require [clojure.string :as s]!            [cascalog.ops :as c])!  (:gen-class))!!(defmapcatop split [line]!  "reads in a line of string and splits it by regex"!  (s/split line #"[\[\]\\\(\),.)\s]+"))!!(defn -main [in out & args]!  (?<- (hfs-delimited out)!       [?word ?count]!       ((hfs-delimited in :skip-header? true) _ ?line)!       (split ?line :> ?word)!       (c/count ?count)))!

Page 19: Hadoop User Group EU 2014

• Step by step tutorials on Cascading on GitHub

• Community has ported them to Scalding and Cascalog

!

• http://docs.cascading.org/impatient/

“FOR THE IMPATIENT” SERIES

19

Page 20: Hadoop User Group EU 2014

• Foundation of patterns and best practices for building

Languages, Frameworks, and Applications

• Designed to abstract Hadoop away from the business logic

• Other models than MapReduce on the way!

WHY CASCADING?

20

Page 21: Hadoop User Group EU 2014

• ANSI Compatible SQL

• JDBC Driver

• Cascading Java API

• SQL Command Shell

• Catalog Manager Tool

• Data Provider API

LINGUAL

21

Query Planner

JDBC API Lingual APIProvider API

Cascading

Apache HadoopLingual

Data Stores

CLI / Shell Enterprise Java

Catalog

Page 22: Hadoop User Group EU 2014

22

Cascading API !

FlowDef flowDef = FlowDef.flowDef()! .setName( "sqlflow" )! .addSource( "example.employee", emplTap )! .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );! !SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );! !flowDef.addAssemblyPlanner( sqlPlanner );!

!

!

Page 23: Hadoop User Group EU 2014

23

JDBC driver public void run() throws ClassNotFoundException, SQLException {! Class.forName( "cascading.lingual.jdbc.Driver" );! Connection connection =! DriverManager.getConnection(! "jdbc:lingual:local;schemas=src/main/resources/data/example" );! Statement statement = connection.createStatement();! ! ResultSet resultSet = statement.executeQuery(! "select *\n"! + "from \"EXAMPLE\".\"SALES_FACT_1997\" as s\n"! + "join \"EXAMPLE\".\"EMPLOYEE\" as e\n"! + "on e.\"EMPID\" = s.\"CUST_ID\"" );! ! // do something!  ! resultSet.close();! statement.close();! connection.close();! }

Page 24: Hadoop User Group EU 2014

SHELL - !TABLES

24

Page 25: Hadoop User Group EU 2014

25

# load the JDBC package!library(RJDBC)! !# set up the driver!drv <- JDBC("cascading.lingual.jdbc.Driver", ! "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")! !# set up a database connection to a local repository!connection <- dbConnect(drv, ! "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES")! !# query the repository: in this case the MySQL sample database (CSV files)!df <- dbGetQuery(connection, ! "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")!head(df)! !# use R functions to summarize and visualize part of the data!df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25!summary(df$hire_age)!!library(ggplot2)!m <- ggplot(df, aes(x=hire_age))!m <- m + ggtitle("Age at hire, people named Gina")!m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()

Page 26: Hadoop User Group EU 2014

26

> summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92

Page 27: Hadoop User Group EU 2014

27

“But we use a custom data format”

Page 28: Hadoop User Group EU 2014

• Any Cascading Tap and/or Scheme can be used from JDBC

• Use a “fat jar” on local disk or from a Maven repo

‣ cascading-jdbc:cascading-jdbc-oracle-provider:1.0

• The Jar is dynamically loaded into cluster

DATA PROVIDER API

28

Page 29: Hadoop User Group EU 2014

29

Amazon Elastic MapReduceJob Job Job Job

SELECT ... FROM file1 JOIN file2 ON file1.id = file2.id ...

Amazon S3

Amazon RedShift

file1 file2

results

Page 30: Hadoop User Group EU 2014

• Quickly migrate existing work loads from RDBMS to Hadoop

• Quickly extract data from Hadoop into applications

WHY LINGUAL

30

Page 31: Hadoop User Group EU 2014

• Predictive model scoring • Java API and PMML parser • Supports: ‣ (General) Regression ‣ Clustering ‣ Decisions Trees ‣ Random Forest ‣ and ensembles of models

PATTERN

31

PMML Parser Pattern API

Cascading

Apache Hadoop

Pattern

Data Stores

Enterprise Java

Page 32: Hadoop User Group EU 2014

32

!

!

FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );! !PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! .retainOnlyActiveIncomingFields();! !flowDef.addAssemblyPlanner( pmmlPlanner );!!

!

Page 33: Hadoop User Group EU 2014

• Standards compliance provides integration with many tools

• Models are independent of data and integration

• Only debugging Cascading, not an ensemble of applications

WHY PATTERN

33

Page 34: Hadoop User Group EU 2014

CLOSING THE LOOP

34

Cluster

Pattern

Desktop

Job

PMMLFlow

JDBCFlowimport data

create models

export models

execute models

import resultsJDBC

Flow

PMML

DATA

DATA

test results

Job Job

Page 35: Hadoop User Group EU 2014

• Understand how your application maps onto your cluster

• Identify bottlenecks (data, code, or the system)

• Jump to the line of code implicated on a failure

• Plugin available via Maven repo

• Beta UI hosted online

DRIVEN

35

http://cascading.io/driven/

Page 36: Hadoop User Group EU 2014

MANAGED WITH DRIVEN

36

Page 37: Hadoop User Group EU 2014

37

Page 38: Hadoop User Group EU 2014

• New query planner ‣ User definable Assertion and Transformation rules

‣ Sub-Graph Isomorphism Pattern Matching

‣ Cordella, L. P., Foggia, P., Sansone, C., & VENTO, M. (2004). A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10), 1367–1372. doi:10.1109/TPAMI.2004.75

• Hadoop Tez support • And likely other platforms

CASCADING 3.0

38

Page 39: Hadoop User Group EU 2014

THERE’S A BOOK!

39

Enterprise Data Workflows with Cascading

- Paco Nathan

O’Reilly, 2013 amazon.com/dp/1449358721

Page 40: Hadoop User Group EU 2014

CONTACT

40

@cwensel | @cascading

[email protected]

www.cascading.org

www.concurrentinc.com