scalding - hadoop word count in less than 70 lines of code

175
Scalding Hadoop Word Count in < 70 lines of code Konrad 'ktoso' Malawski JARCamp #3 12.04.2013 Friday, April 12, 13

Upload: konrad-malawski

Post on 27-Jan-2015

110 views

Category:

Technology


1 download

DESCRIPTION

Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.

TRANSCRIPT

Page 1: Scalding - Hadoop Word Count in LESS than 70 lines of code

ScaldingHadoop Word Count in < 70 lines of code

Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013

Friday, April 12, 13

Page 2: Scalding - Hadoop Word Count in LESS than 70 lines of code

ScaldingHadoop Word Count

in 4 lines of code

Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013

Friday, April 12, 13

Page 3: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 4: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Friday, April 12, 13

Page 5: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)

Friday, April 12, 13

Page 6: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Friday, April 12, 13

Page 7: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)

Friday, April 12, 13

Page 8: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Friday, April 12, 13

Page 9: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)

Friday, April 12, 13

Page 10: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Friday, April 12, 13

Page 11: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)

Friday, April 12, 13

Page 12: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)=

Friday, April 12, 13

Page 13: Scalding - Hadoop Word Count in LESS than 70 lines of code

Agenda

Why Scalding? (10%)+

Hadoop Basics (20%)+

Enter Cascading (40%)+

Hello Scalding (30%)=

100%

Friday, April 12, 13

Page 14: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Types

type Word = Stringtype Count = Int

String => Map[Word, Count]

Friday, April 12, 13

Page 15: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

Friday, April 12, 13

Page 16: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

Friday, April 12, 13

Page 17: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] =

Friday, April 12, 13

Page 18: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text

Friday, April 12, 13

Page 19: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ")

Friday, April 12, 13

Page 20: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1))

Friday, April 12, 13

Page 21: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1)

Friday, April 12, 13

Page 22: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }

Friday, April 12, 13

Page 23: Scalding - Hadoop Word Count in LESS than 70 lines of code

Why Scalding?Word Count in Scala

val text = "a a a b b"

def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }

wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))

Friday, April 12, 13

Page 24: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

Friday, April 12, 13

Page 25: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

Friday, April 12, 13

Page 26: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

Friday, April 12, 13

Page 27: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

Friday, April 12, 13

Page 28: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

in Memory

Friday, April 12, 13

Page 29: Scalding - Hadoop Word Count in LESS than 70 lines of code

Stuff > MemoryScala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))

in Memory

in Memory

in Memory

in Memory

in Memory

Friday, April 12, 13

Page 30: Scalding - Hadoop Word Count in LESS than 70 lines of code

Apache Hadoop (HDFS + MR)http://hadoop.apache.org/

Friday, April 12, 13

Page 31: Scalding - Hadoop Word Count in LESS than 70 lines of code

package org.myorg;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;

import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); }}

Why Scalding?Word Count in Hadoop MR

Friday, April 12, 13

Page 32: Scalding - Hadoop Word Count in LESS than 70 lines of code

package org.myorg;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;

import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); }}

Why Scalding?Word Count in Hadoop MR

Friday, April 12, 13

Page 33: Scalding - Hadoop Word Count in LESS than 70 lines of code

Trivia: How old is Hadoop?

Friday, April 12, 13

Page 34: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 35: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 36: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 37: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 38: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 39: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 40: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 41: Scalding - Hadoop Word Count in LESS than 70 lines of code

Friday, April 12, 13

Page 42: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingwww.cascading.org/

Friday, April 12, 13

Page 43: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingwww.cascading.org/

Friday, April 12, 13

Page 44: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingis

Friday, April 12, 13

Page 45: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingis

Taps & Pipes

Friday, April 12, 13

Page 46: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascadingis

Taps & Pipes& Pipes

& SinksFriday, April 12, 13

Page 47: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

Friday, April 12, 13

Page 48: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

Friday, April 12, 13

Page 49: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

Friday, April 12, 13

Page 50: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Friday, April 12, 13

Page 51: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

Friday, April 12, 13

Page 52: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef()

Friday, April 12, 13

Page 53: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap )

Friday, April 12, 13

Page 54: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

Friday, April 12, 13

Page 55: Scalding - Hadoop Word Count in LESS than 70 lines of code

1: Distributed Copy

// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);

// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");

// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

// run!flowConnector.connect(flowDef).complete();

Friday, April 12, 13

Page 56: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 57: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 58: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 59: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 60: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 61: Scalding - Hadoop Word Count in LESS than 70 lines of code

1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];

Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);

Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);

Pipe copyPipe = new Pipe("copy");

FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);

flowConnector.connect(flowDef).complete();}}

Friday, April 12, 13

Page 62: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

Page 63: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

Page 64: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

Friday, April 12, 13

Page 65: Scalding - Hadoop Word Count in LESS than 70 lines of code

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 66: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 67: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 68: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 69: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 70: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 71: Scalding - Hadoop Word Count in LESS than 70 lines of code

2: Word Count

String docPath = args[ 0 ]; String wcPath = args[ 1 ];

Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

2: Word Count

Friday, April 12, 13

Page 72: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?

Friday, April 12, 13

Page 73: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

Friday, April 12, 13

Page 74: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDef

Friday, April 12, 13

Page 75: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

Friday, April 12, 13

Page 76: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

Friday, April 12, 13

Page 77: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

HadoopCluster.execute(jobs)

Friday, April 12, 13

Page 78: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading - how?// pseudo code...

val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...

val jobs: List[MRJob] = flowConnector(flow)

HadoopCluster.execute(jobs)

Friday, April 12, 13

Page 79: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

Friday, April 12, 13

Page 80: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

flowDef.setDebugLevel( DebugLevel.NONE );

Friday, April 12, 13

Page 81: Scalding - Hadoop Word Count in LESS than 70 lines of code

Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...

// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );

flowDef.setDebugLevel( DebugLevel.NONE );

flowConnector will NOT create the Debug pipe!

Friday, April 12, 13

Page 82: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding=+

Twitter Scaldinggithub.com/twitter/scalding

Friday, April 12, 13

Page 83: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding API

Friday, April 12, 13

Page 84: Scalding - Hadoop Word Count in LESS than 70 lines of code

map

Friday, April 12, 13

Page 85: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

mapScala:

Friday, April 12, 13

Page 86: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapScala:

Friday, April 12, 13

Page 87: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

mapScala:

Friday, April 12, 13

Page 88: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

mapScala:

Friday, April 12, 13

Page 89: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data)

Scala:

Scalding:

Friday, April 12, 13

Page 90: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

Scala:

Scalding:

Friday, April 12, 13

Page 91: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

Friday, April 12, 13

Page 92: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

available in Pipe

Friday, April 12, 13

Page 93: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

available in Pipestays in Pipe

Friday, April 12, 13

Page 94: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

// Int => Int

map

IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

must choose type!

Friday, April 12, 13

Page 95: Scalding - Hadoop Word Count in LESS than 70 lines of code

mapTo

Friday, April 12, 13

Page 96: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

mapToScala:

Friday, April 12, 13

Page 97: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

mapToScala:

Friday, April 12, 13

Page 98: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null

mapToScala:

Friday, April 12, 13

Page 99: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

Friday, April 12, 13

Page 100: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

release reference

Friday, April 12, 13

Page 101: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapToScala:

release reference

Friday, April 12, 13

Page 102: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data)

Scala:

Scalding:

release reference

Friday, April 12, 13

Page 103: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

Scala:

Scalding:

release reference

Friday, April 12, 13

Page 104: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

release reference

Friday, April 12, 13

Page 105: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

doubled stays in Pipe

release reference

Friday, April 12, 13

Page 106: Scalding - Hadoop Word Count in LESS than 70 lines of code

var data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }data = null // Int => Int

mapTo

IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }

// Int => Int

Scala:

Scalding:

doubled stays in Pipenumber is removed

release reference

Friday, April 12, 13

Page 107: Scalding - Hadoop Word Count in LESS than 70 lines of code

flatMap

Friday, April 12, 13

Page 108: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

flatMapScala:

Friday, April 12, 13

Page 109: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String

flatMapScala:

Friday, April 12, 13

Page 110: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]

flatMapScala:

Friday, April 12, 13

Page 111: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

flatMapScala:

Friday, April 12, 13

Page 112: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]

flatMapScala:

Friday, April 12, 13

Page 113: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 114: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 115: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String]

Scala:

Scalding:

Friday, April 12, 13

Page 116: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String]

Scala:

Scalding:

Friday, April 12, 13

Page 117: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

Scalding:

Friday, April 12, 13

Page 118: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]

Scala:

Scalding:

MR map outside

Friday, April 12, 13

Page 119: Scalding - Hadoop Word Count in LESS than 70 lines of code

flatMap

Friday, April 12, 13

Page 120: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

flatMapScala:

Friday, April 12, 13

Page 121: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String

flatMapScala:

Friday, April 12, 13

Page 122: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]

flatMapScala:

Friday, April 12, 13

Page 123: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

flatMapScala:

Friday, April 12, 13

Page 124: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]

flatMapScala:

Friday, April 12, 13

Page 125: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 126: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMapScala:

Friday, April 12, 13

Page 127: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String]

Scala:

Scalding:

Friday, April 12, 13

Page 128: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) }

Scala:

Scalding:

Friday, April 12, 13

Page 129: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

Scalding:

Friday, April 12, 13

Page 130: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}

numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))

flatMap

TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]

Scala:

Scalding:

map inside Scala

Friday, April 12, 13

Page 131: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

Friday, April 12, 13

Page 132: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

groupByScala:

Friday, April 12, 13

Page 133: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groupByScala:

Friday, April 12, 13

Page 134: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groupByScala:

Friday, April 12, 13

Page 135: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))

groupByScala:

Friday, April 12, 13

Page 136: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupByScala:

Friday, April 12, 13

Page 137: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupByScala:

Friday, April 12, 13

Page 138: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

Scala:

Scalding:

Friday, April 12, 13

Page 139: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

Scala:

Scalding:

Friday, April 12, 13

Page 140: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

Friday, April 12, 13

Page 141: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

groups all with == value

Friday, April 12, 13

Page 142: Scalding - Hadoop Word Count in LESS than 70 lines of code

val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

val groups = data groupBy { _ < 10 }

groups // Map[Boolean, Int]

groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }

Scala:

Scalding:

groups all with == value 'lessThanTenCounts

Friday, April 12, 13

Page 143: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

Scalding:

Friday, April 12, 13

Page 144: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num)

Scalding:

Friday, April 12, 13

Page 145: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }

Scalding:

Friday, April 12, 13

Page 146: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

Scalding:

Friday, April 12, 13

Page 147: Scalding - Hadoop Word Count in LESS than 70 lines of code

groupBy

IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }

Scalding:

'total = [3, 74]

Friday, April 12, 13

Page 148: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding API

Friday, April 12, 13

Page 149: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

Friday, April 12, 13

Page 150: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapTo

Friday, April 12, 13

Page 151: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

Friday, April 12, 13

Page 152: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

rename

Friday, April 12, 13

Page 153: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

Friday, April 12, 13

Page 154: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

unique

Friday, April 12, 13

Page 155: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

Friday, April 12, 13

Page 156: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limit

Friday, April 12, 13

Page 157: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Friday, April 12, 13

Page 158: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Group operations

Friday, April 12, 13

Page 159: Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding APIproject / discard

map / mapToflatMap / flatMapTo

renamefilter

uniquegroupBy / groupAll / groupRandom / shuffle

limitdebug

Group operations

joinsFriday, April 12, 13

Page 160: Scalding - Hadoop Word Count in LESS than 70 lines of code

Distributed Copy in Scalding

class WordCountJob(args: Args) extends Job(args) {

Friday, April 12, 13

Page 161: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

Distributed Copy in Scalding

Friday, April 12, 13

Page 162: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

input.read.write(output)

}

Distributed Copy in Scalding

Friday, April 12, 13

Page 163: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val input = Tsv(args("input")) val output = Tsv(args("output"))

input.read.write(output)

}

Distributed Copy in Scalding

The End.

Friday, April 12, 13

Page 164: Scalding - Hadoop Word Count in LESS than 70 lines of code

import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding

object ScaldingJobRunner extends App {

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Main Class - "Runner"

Friday, April 12, 13

Page 165: Scalding - Hadoop Word Count in LESS than 70 lines of code

import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding

object ScaldingJobRunner extends App {

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Main Class - "Runner"

from App

Friday, April 12, 13

Page 166: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

}

Word Count in Scalding

Friday, April 12, 13

Page 167: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

}

Word Count in Scalding

Friday, April 12, 13

Page 168: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile)

}

Word Count in Scalding

Friday, April 12, 13

Page 169: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 170: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 171: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 172: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size('count) }

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 173: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

Friday, April 12, 13

Page 174: Scalding - Hadoop Word Count in LESS than 70 lines of code

class WordCountJob(args: Args) extends Job(args) {

val inputFile = args("input") val outputFile = args("output")

TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))

def tokenize(text: String): Array[String] = implemented}

Word Count in Scalding

4{

Friday, April 12, 13

Page 175: Scalding - Hadoop Word Count in LESS than 70 lines of code

Dzięki! Thanks!ありがとう!

Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl

Friday, April 12, 13