large-scale data mining: mapreduce and beyondysmoon/courses/2011_1/... · tutorial overview part 1...

142
Large-scale Data Mining: MapReduce and beyond Part 1: Basics Spiros Papadimitriou, Google Jimeng Sun, IBM Research Rong Yan, Facebook Monday, August 23, 2010

Upload: others

Post on 31-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

Large-scale Data Mining: MapReduce and beyond

Part 1: Basics

Spiros Papadimitriou, GoogleJimeng Sun, IBM Research

Rong Yan, Facebook

Monday, August 23, 2010

Page 2: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

2

Data everywhere

Monday, August 23, 2010

Page 3: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

2

Data everywhere

Flickr (3 billion photos) YouTube (83M videos, 15 hrs/min) Web (10B videos watched / mo.) Digital photos (500 billion / year) All broadcast (70,000TB / year) Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk) Human genome (2-30TB uncomp.)

Monday, August 23, 2010

Page 4: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

2

Data everywhere

Flickr (3 billion photos) YouTube (83M videos, 15 hrs/min) Web (10B videos watched / mo.) Digital photos (500 billion / year) All broadcast (70,000TB / year) Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk) Human genome (2-30TB uncomp.)

So what ??Monday, August 23, 2010

Page 5: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

2

Data everywhere

Flickr (3 billion photos) YouTube (83M videos, 15 hrs/min) Web (10B videos watched / mo.) Digital photos (500 billion / year) All broadcast (70,000TB / year) Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk) Human genome (2-30TB uncomp.)

So what ??

more is:

Monday, August 23, 2010

Page 6: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

2

Data everywhere

Flickr (3 billion photos) YouTube (83M videos, 15 hrs/min) Web (10B videos watched / mo.) Digital photos (500 billion / year) All broadcast (70,000TB / year) Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk) Human genome (2-30TB uncomp.)

So what ??

more is:more …

Monday, August 23, 2010

Page 7: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

2

Data everywhere

Flickr (3 billion photos) YouTube (83M videos, 15 hrs/min) Web (10B videos watched / mo.) Digital photos (500 billion / year) All broadcast (70,000TB / year) Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk) Human genome (2-30TB uncomp.)

So what ??

more is:different!

more is:more …

Monday, August 23, 2010

Page 8: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

3

Data everywhere Opportunities

Real-time access to contentRicher context from users and hyperlinks Abundant training examples“Brute-force” methods may suffice

Challenges“Dirtier” dataEfficient algorithmsScalability (with reasonable cost)

Monday, August 23, 2010

Page 9: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

3

Data everywhere Opportunities

Real-time access to contentRicher context from users and hyperlinks Abundant training examples“Brute-force” methods may suffice

Challenges“Dirtier” dataEfficient algorithmsScalability (with reasonable cost)

Monday, August 23, 2010

Page 10: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

3

Data everywhere Opportunities

Real-time access to contentRicher context from users and hyperlinks Abundant training examples“Brute-force” methods may suffice

Challenges“Dirtier” dataEfficient algorithmsScalability (with reasonable cost)

Monday, August 23, 2010

Page 11: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

4

“The Google Way”

Chris Anderson, Wired (July 2008)Monday, August 23, 2010

Page 12: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

4

“The Google Way”“All models are wrong, but some are useful”

– George Box

Chris Anderson, Wired (July 2008)Monday, August 23, 2010

Page 13: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

4

“The Google Way”“All models are wrong, but some are useful”

– George Box“All models are wrong, and increasingly you can succeed without them.” – Peter Norvig

Google PageRank Shotgun gene sequencing

Chris Anderson, Wired (July 2008)Monday, August 23, 2010

Page 14: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

4

“The Google Way”“All models are wrong, but some are useful”

– George Box“All models are wrong, and increasingly you can succeed without them.” – Peter Norvig

Google PageRank Shotgun gene sequencing Language translation

Chris Anderson, Wired (July 2008)Monday, August 23, 2010

Page 15: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

4

“The Google Way”“All models are wrong, but some are useful”

– George Box“All models are wrong, and increasingly you can succeed without them.” – Peter Norvig

Google PageRank Shotgun gene sequencing Language translation …

Chris Anderson, Wired (July 2008)Monday, August 23, 2010

Page 16: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

5

Getting over the marketing hype…

Cloud Computing=

Monday, August 23, 2010

Page 17: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

5

Getting over the marketing hype…

Cloud Computing=

Internet

Monday, August 23, 2010

Page 18: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

5

Getting over the marketing hype…

Cloud Computing=

Internet +

Commoditization/standardization

Monday, August 23, 2010

Page 19: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

5

Getting over the marketing hype…

Cloud Computing=

Internet +

Commoditization/standardization ‘It’s what I and many others have worked towards our entire

careers. It’s just happening now.’

– Eric Schmidt

Monday, August 23, 2010

Page 20: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

6

This tutorial Is not about cloud computing But about large scale data processing

Monday, August 23, 2010

Page 21: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

6

This tutorial Is not about cloud computing But about large scale data processing

Data + Algorithms

Monday, August 23, 2010

Page 22: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

7

Tutorial overview Part 1 (Spiros): Basic concepts & tools

MapReduce & distributed storage Hadoop / HBase / Pig / Cascading / Hive

Part 2 (Jimeng): Algorithms Information retrieval Graph algorithms Clustering (k-means) Classification (k-NN, naïve Bayes)

Part 3 (Rong): Applications Text processing Data warehousing Machine learning

Monday, August 23, 2010

Page 23: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

8

Outline Introduction MapReduce & distributed storage Hadoop

HBasePigCascadingHive

Summary

Monday, August 23, 2010

Page 24: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

9

What is MapReduce?

Programming model?

Execution environment?

Software package?

Monday, August 23, 2010

Page 25: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

9

What is MapReduce?

Programming model?

Execution environment?

Software package?

It’s all of those things, depending who you ask…Monday, August 23, 2010

Page 26: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

9

What is MapReduce?

Programming model?

Execution environment?

Software package?

It’s all of those things, depending who you ask…

“MapReduce” (this talk)==

Distributed computation + distributed storage +

scheduling / fault tolerance

Monday, August 23, 2010

Page 27: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

Monday, August 23, 2010

Page 28: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

def getName (line):

Monday, August 23, 2010

Page 29: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapperdef getName (line): return line.split(‘\t’)[1]

def addCounts (hist, name):

Monday, August 23, 2010

Page 30: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapperdef getName (line): return line.split(‘\t’)[1]

def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist

input = open(‘employees.txt’, ‘r’)

Monday, August 23, 2010

Page 31: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapperdef getName (line): return line.split(‘\t’)[1]

def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist

input = open(‘employees.txt’, ‘r’)

Monday, August 23, 2010

Page 32: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapperdef getName (line): return line.split(‘\t’)[1]

def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist

input = open(‘employees.txt’, ‘r’)

intermediate = map(getName, input)

Monday, August 23, 2010

Page 33: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapperdef getName (line): return line.split(‘\t’)[1]

def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist

input = open(‘employees.txt’, ‘r’)

intermediate = map(getName, input)

Monday, August 23, 2010

Page 34: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapper

reducer

def getName (line): return line.split(‘\t’)[1]

def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist

input = open(‘employees.txt’, ‘r’)

intermediate = map(getName, input)

result = reduce(addCounts, \ intermediate, {})

Monday, August 23, 2010

Page 35: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

11

def getName (line): return (line.split(‘\t’)[1], 1)

def addCounts (hist, (name, c)): hist[name] = \ hist.get(name,default=0) + c return hist

input = open(‘employees.txt’, ‘r’)

intermediate = map(getName, input)

result = reduce(addCounts, \ intermediate, {})

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapper

reducer

Monday, August 23, 2010

Page 36: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

11

def getName (line): return (line.split(‘\t’)[1], 1)

def addCounts (hist, (name, c)): hist[name] = \ hist.get(name,default=0) + c return hist

input = open(‘employees.txt’, ‘r’)

intermediate = map(getName, input)

result = reduce(addCounts, \ intermediate, {})

Example – Programming model

# LAST FIRST SALARY Smith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

mapper

reducer

Key-value iterators

Monday, August 23, 2010

Page 37: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

12

public class HistogramJob extends Configured implements Tool {

public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> {

private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]); output.collect(firstname, ONE); } } // class FieldMapper

Example – Programming modelHadoop / Java

Monday, August 23, 2010

Page 38: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

12

public class HistogramJob extends Configured implements Tool {

public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> {

private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]); output.collect(firstname, ONE); } } // class FieldMapper

Example – Programming modelHadoop / Java

non-boilerplate

Monday, August 23, 2010

Page 39: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

12

public class HistogramJob extends Configured implements Tool {

public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> {

private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]); output.collect(firstname, ONE); } } // class FieldMapper

Example – Programming modelHadoop / Java

non-boilerplate

typed…

Monday, August 23, 2010

Page 40: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

13

Example – Programming modelHadoop / Java

public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> {

private static LongWritable sum = new LongWritable(); @Override public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasNext()) s += vals.next().get(); sum.set(s); output.collect(key, sum); } } // class LongSumReducer

Monday, August 23, 2010

Page 41: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

13

Example – Programming modelHadoop / Java

public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> {

private static LongWritable sum = new LongWritable(); @Override public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasNext()) s += vals.next().get(); sum.set(s); output.collect(key, sum); } } // class LongSumReducer

Monday, August 23, 2010

Page 42: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

14

Example – Programming modelHadoop / Java

public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setJobName(“Histogram”); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(FieldMapper.class); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); // ... JobClient.runJob(job); return 0; } // run()

public static main (String[] args) throws Exception { ToolRunner.run(new Configuration(), new HistogramJob(), args); } // main()} // class HistogramJob

Monday, August 23, 2010

Page 43: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

14

Example – Programming modelHadoop / Java

public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setJobName(“Histogram”); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(FieldMapper.class); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); // ... JobClient.runJob(job); return 0; } // run()

public static main (String[] args) throws Exception { ToolRunner.run(new Configuration(), new HistogramJob(), args); } // main()} // class HistogramJob

~ 30 lines = 25 boilerplate (Eclipse) + 5 actual codeMonday, August 23, 2010

Page 44: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

15

MapReduce for… Distributed clusters

Google’s original Hadoop (Apache Software Foundation)

Hardware SMP/CMP: Phoenix (Stanford) Cell BE

Other Skynet (in Ruby/DRB) QtConcurrent BashReduce …many more

Monday, August 23, 2010

Page 45: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

15

MapReduce for… Distributed clusters

Google’s original Hadoop (Apache Software Foundation)

Hardware SMP/CMP: Phoenix (Stanford) Cell BE

Other Skynet (in Ruby/DRB) QtConcurrent BashReduce …many more

Monday, August 23, 2010

Page 46: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

16

Recap

Quick-n-dirty script Hadoop

~5 lines of (non-boilerplate) code~5 lines of (non-boilerplate) code

Single machine, local drive

Up to thousands of machines and drives

vs

Monday, August 23, 2010

Page 47: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

16

Recap

What is hidden to achieve this: Data partitioning, placement and replication Computation placement (and replication) Number of nodes (mappers / reducers) …

Quick-n-dirty script Hadoop

~5 lines of (non-boilerplate) code~5 lines of (non-boilerplate) code

Single machine, local drive

Up to thousands of machines and drives

vs

Monday, August 23, 2010

Page 48: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

16

Recap

What is hidden to achieve this: Data partitioning, placement and replication Computation placement (and replication) Number of nodes (mappers / reducers) …

Quick-n-dirty script Hadoop

~5 lines of (non-boilerplate) code~5 lines of (non-boilerplate) code

Single machine, local drive

Up to thousands of machines and drives

vs

As a programmer, you don’t need to know what I’m about to show you next…

Monday, August 23, 2010

Page 49: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Input file

Output file

Monday, August 23, 2010

Page 50: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Key/value iterators

Input file

Output file

Monday, August 23, 2010

Page 51: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Sequential scan

Key/value iterators

Input file

Output file

Monday, August 23, 2010

Page 52: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Sequential scan

Key/value iteratorsSmith John $90,000

Yates John $80,000

Input file

Output file

Monday, August 23, 2010

Page 53: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Sequential scan

Key/value iterators

John 1

John 1

Input file

Output file

Monday, August 23, 2010

Page 54: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Sequential scan

Key/value iterators

All-to-all, hash partitioning

John 1

John 1

Input file

Output file

Monday, August 23, 2010

Page 55: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Sequential scan

Key/value iterators

All-to-all, hash partitioning

Sort-merge

John 2

Input file

Output file

Monday, August 23, 2010

Page 56: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

17

Execution model: Flow

SPLIT 0

SPLIT 1

SPLIT 2

SPLIT 3

MAPPER

REDUCERMAPPER

MAPPERREDUCER

PART 0

PART 1

MAPPER

Sequential scan

Key/value iterators

All-to-all, hash partitioning

Sort-merge

John 2

Input file

Output file

Monday, August 23, 2010

Page 57: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

18

Execution model: Placement

HOST 0

SPLIT 0Replica 1/3

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

HOST 4

HOST 5

HOST 6

Monday, August 23, 2010

Page 58: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

18

Execution model: Placement

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

Computation co-located with data (as much as possible)Monday, August 23, 2010

Page 59: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

19

Execution model: Placement

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

Monday, August 23, 2010

Page 60: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

19

Execution model: Placement

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

Monday, August 23, 2010

Page 61: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

19

Execution model: Placement

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

Rack/network-awareMonday, August 23, 2010

Page 62: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

19

Execution model: Placement

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

Rack/network-aware

C

C

C

C

C COMBINER

Monday, August 23, 2010

Page 63: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

20

MapReduce Summary

Monday, August 23, 2010

Page 64: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

20

MapReduce Summary

Simple programming model Scalable, fault-tolerant Ideal for (pre-)processing large volumes of

data

Monday, August 23, 2010

Page 65: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

20

MapReduce Summary

Simple programming model Scalable, fault-tolerant Ideal for (pre-)processing large volumes of

data‘However, if the data center is the computer, it leads to the even more intriguing question “What is the equivalent of the ADD instruction for a data center?” […] If MapReduce is the first instruction of the “data center computer”, I can’t wait to see the rest of the instruction set, as well as the data center programming language, the data center operating system, the data center storage systems, and more.’

– David Patterson, “The Data Center Is The Computer”, CACM, Jan. 2008

Monday, August 23, 2010

Page 66: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

21

Outline Introduction MapReduce & distributed storage Hadoop

HBasePigCascadingHive

Summary

Monday, August 23, 2010

Page 67: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

22

Hadoop

HBase

MapReduce

Core Avro

HDFS ZooKeeper

HivePig Chukwa

Monday, August 23, 2010

Page 68: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

22

Hadoop

HBase

MapReduce

Core Avro

HDFS ZooKeeper

HivePig Chukwa

Hadoop’s stated mission (Doug Cutting interview):

Commoditize infrastructure for web-scale,data-intensive applications

Monday, August 23, 2010

Page 69: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

23

Who uses Hadoop? Yahoo! Facebook Last.fm Rackspace Digg

Apache Nutch

… more in part 3

Monday, August 23, 2010

Page 70: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

24

MapReduce

Avro

HDFS

Hadoop

HBase

Core

ZooKeeper

HivePig Chukwa

Monday, August 23, 2010

Page 71: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

24

MapReduce

Avro

HDFS

Hadoop

HBase

Core

ZooKeeper

HivePig Chukwa

Filesystems and I/O: Abstraction APIs RPC / Persistence

Monday, August 23, 2010

Page 72: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

25

Core

ZooKeeperMapReduce HDFS

Hadoop

HBase HivePig Chukwa

Avro

Monday, August 23, 2010

Page 73: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

25

Core

ZooKeeperMapReduce HDFS

Hadoop

HBase HivePig Chukwa

Cross-language serialization: RPC / persistence ~ Google ProtoBuf / FB Thrift

Avro

Monday, August 23, 2010

Page 74: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

26

HBase HivePig

Core

ZooKeeperHDFS

Hadoop

Chukwa

Avro

MapReduce

Monday, August 23, 2010

Page 75: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

26

HBase HivePig

Core

ZooKeeperHDFS

Hadoop

ChukwaDistributed execution (batch) Programming model Scalability / fault-tolerance

Avro

MapReduce

Monday, August 23, 2010

Page 76: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

27

MapReduce

Chukwa

Avro

HBase HivePig

Core

ZooKeeper

Hadoop

HDFS

Monday, August 23, 2010

Page 77: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

27

MapReduce

Chukwa

Avro

HBase HivePig

Core

ZooKeeper

Hadoop

Distributed storage (read-opt.) Replication / scalability ~ Google filesystem (GFS)

HDFS

Monday, August 23, 2010

Page 78: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

28

HDFS

Chukwa

Avro

HBase HivePig

Core

ZooKeeper

Hadoop

MapReduce

Monday, August 23, 2010

Page 79: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

28

HDFS

Chukwa

Avro

HBase HivePig

Core

ZooKeeper

Hadoop

Coordination service Locking / configuration ~ Google Chubby

MapReduce

Monday, August 23, 2010

Page 80: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

29

Avro

MapReduce

HivePigHBase

Core

ZooKeeperHDFS

Hadoop

Chukwa

Monday, August 23, 2010

Page 81: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

29

Avro

MapReduce

HivePigHBase

Core

ZooKeeperHDFS

Hadoop

Chukwa

Column-oriented, sparse store Batch & random access ~ Google BigTable

Monday, August 23, 2010

Page 82: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

30

MapReduce HDFS

HiveHBase

Avro

Pig

Core

ZooKeeper

Hadoop

Chukwa

Monday, August 23, 2010

Page 83: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

30

MapReduce HDFS

HiveHBase

Avro

Pig

Core

ZooKeeper

Hadoop

Chukwa

Data flow language Procedural SQL-inspired lang. Execution environment

Monday, August 23, 2010

Page 84: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

31

Chukwa

HDFS

Pig

MapReduce

HiveHBase

AvroCore

ZooKeeper

Hadoop

Monday, August 23, 2010

Page 85: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

31

Chukwa

HDFS

Pig

MapReduce

HiveHBase

AvroCore

ZooKeeper

Hadoop

Distributed data warehouse SQL-like query language Data mgmt / query execution

Monday, August 23, 2010

Page 86: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

32

Hadoop

HBase

MapReduce

Core Avro

HDFS ZooKeeper

HivePig Chukwa … … more

Monday, August 23, 2010

Page 87: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

33

MapReduce Mapper: (k1, v1) (k2, v2)[]

E.g., (void, textline : string)

(first : string, count : int) Reducer: (k2, v2[]) (k3, v3)[]

E.g., (first : string, counts : int[]) (first : string, total : int)

Combiner: (k2, v2[]) (k2, v2)[] Partition: (k2, v2) int

Monday, August 23, 2010

Page 88: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

34

Mapper interfaceinterface Mapper<K1, V1, K2, V2> { void configure (JobConf conf); void map (K1 key, V1 value, OutputCollector<K2, V2> out, Reporter reporter); void close();} Initialize in configure() Clean-up in close() Emit via out.collect(key,val) any time

Monday, August 23, 2010

Page 89: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

35

Reducer interfaceinterface Reducer<K2, V2, K3, V3> { void configure (JobConf conf); void reduce ( K2 key, Iterator<V2> values, OutputCollector<K3, V3> out, Reporter reporter); void close();} Initialize in configure() Clean-up in close() Emit via out.collect(key,val) any time

Monday, August 23, 2010

Page 90: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

36

Some canonical examples Histogram-type jobs:

Graph construction (bucket = edge)K-means et al. (bucket = cluster center)

Inverted index:Text indicesMatrix transpose

Sorting Equi-join More details in part 2

Monday, August 23, 2010

Page 91: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

37

Equi-joins“Reduce-side”

(Smith, 7)

(Jones, 7)

(Brown, 7)

(Davis, 3)

(Dukes, 5)

(Black, 3)

(Gruhl, 7)

(Sales, 3)

(Devel, 7)

(Acct., 5)

MAP

MAP

Monday, August 23, 2010

Page 92: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

37

Equi-joins“Reduce-side”

(Smith, 7)

(Jones, 7)

(Brown, 7)

(Davis, 3)

(Dukes, 5)

(Black, 3)

(Gruhl, 7)

(Sales, 3)

(Devel, 7)

(Acct., 5)

7: (, (Smith))MAP

MAP

Monday, August 23, 2010

Page 93: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

37

Equi-joins“Reduce-side”

(Smith, 7)

(Jones, 7)

(Brown, 7)

(Davis, 3)

(Dukes, 5)

(Black, 3)

(Gruhl, 7)

(Sales, 3)

(Devel, 7)

(Acct., 5)

7: (, (Smith))

7: (, (Devel))

MAP

MAP

Monday, August 23, 2010

Page 94: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

37

Equi-joins“Reduce-side”

(Smith, 7)

(Jones, 7)

(Brown, 7)

(Davis, 3)

(Dukes, 5)

(Black, 3)

(Gruhl, 7)

(Sales, 3)

(Devel, 7)

(Acct., 5)

7: (, (Smith))

(7,): (Smith)

7: (, (Devel))

(7,): (Devel)

-OR-

-OR-

MAP

MAP

Monday, August 23, 2010

Page 95: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

38

Equi-joins“Reduce-side”

(Smith, 7)

(Jones, 7)

(Brown, 7)

(Davis, 3)

(Dukes, 5)

(Black, 3)

(Gruhl, 7)

(Sales, 3)

(Devel, 7)

(Acct., 5)

7: (, (Smith))

7: (, (Jones))

7: (, (Brown))

7: (, (Gruhl))

7: (, (Devel))

MAP

MAP

Monday, August 23, 2010

Page 96: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

38

Equi-joins“Reduce-side”

(Smith, 7)

(Jones, 7)

(Brown, 7)

(Davis, 3)

(Dukes, 5)

(Black, 3)

(Gruhl, 7)

(Sales, 3)

(Devel, 7)

(Acct., 5)

7: (, (Smith))

7: (, (Jones))

7: (, (Brown))

7: (, (Gruhl))

7: (, (Devel))

7: {(, (Smith)), (, (Jones)), (, (Brown)), (, (Gruhl)), (, (Devel)) }

MAP

MAP

SHUF.

Monday, August 23, 2010

Page 97: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

38

Equi-joins“Reduce-side”

(Smith, 7)

(Jones, 7)

(Brown, 7)

(Davis, 3)

(Dukes, 5)

(Black, 3)

(Gruhl, 7)

(Sales, 3)

(Devel, 7)

(Acct., 5)

7: (, (Smith))

7: (, (Jones))

7: (, (Brown))

7: (, (Gruhl))

7: (, (Devel))

7: {(, (Smith)), (, (Jones)), (, (Brown)), (, (Gruhl)), (, (Devel)) }

(Smith, Devel), (Jones, Devel), (Brown, Devel), (Gruhl, Devel)

MAP

MAP

SHUF.

RE

D.

Monday, August 23, 2010

Page 98: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

39

HDFS & MapReduce processes

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

C

C

C

C

HOST 0

HOST 1HOST 2

HOST 3

Monday, August 23, 2010

Page 99: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

39

HDFS & MapReduce processes

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

C

C

C

C

DNDN

HOST 0

HOST 1HOST 2

HOST 3

DATANODE

DATANODE DATA

NODE

DN

DATANODE

Monday, August 23, 2010

Page 100: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

39

HDFS & MapReduce processes

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

C

C

C

C

DNDN

HOST 0

HOST 1HOST 2

HOST 3

DATANODE

DATANODE DATA

NODE

DN

DATANODE

NAMENODE

Monday, August 23, 2010

Page 101: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

39

HDFS & MapReduce processes

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

C

C

C

C

TTTT

TT

DNDN

HOST 0

HOST 1HOST 2

HOST 3

DATANODE

DATANODE DATA

NODE

TASKTRACKER

TASKTRACKER TASK

TRACKER

DN

DATANODE

NAMENODE

TASKTRACKER

Monday, August 23, 2010

Page 102: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

39

HDFS & MapReduce processes

HOST 0

SPLIT 0Replica 1/3

MAPPER

SPLIT 1Replica 2/3

SPLIT 3Replica 2/3

HOST 1

SPLIT 0Replica 2/3

SPLIT 4Replica 1/3

SPLIT 3Replica 1/3

HOST 2

SPLIT 3Replica 3/3

MAPPER

SPLIT 2Replica 2/3

SPLIT 0Replica 3/3

HOST 3

SPLIT 2Replica 3/3

MAPPER

SPLIT 1Replica 1/3

SPLIT 4Replica 2/3

MAPPER

HOST 4

HOST 5

HOST 6

REDUCER

C

C

C

C

TTTT

TT

DNDN

HOST 0

HOST 1HOST 2

HOST 3

DATANODE

DATANODE DATA

NODE

TASKTRACKER

JOBTRACKER

TASKTRACKER TASK

TRACKER

DN

DATANODE

NAMENODE

TASKTRACKER

Monday, August 23, 2010

Page 103: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

40

Hadoop Streaming & Pipes Don’t have to use Java for MapReduce

Hadoop Streaming:Use stdin/stdout & text formatAny language (C/C++, Perl, Python, shell, etc)

Hadoop Pipes:Use sockets & binary format (more efficient)C++ library required

Monday, August 23, 2010

Page 104: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

41

Outline Introduction MapReduce & distributed storage Hadoop

HBasePigCascadingHive

Summary

Monday, August 23, 2010

Page 105: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

42

HBase introduction MapReduce canonical example:

Inverted index (more in Part 2)

Batch computations on large datasets: Build static index on crawl snapshot

However, in reality crawled pages are: Updated by crawler Augmented by other parsers/analytics Retrieved by cache search Etc…

Monday, August 23, 2010

Page 106: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

43

HBase introduction MapReduce & HDFS:

Distributed storage + computationGood for batch processingBut: no facilities for accessing or updating

individual items

HBase:Adds random-access read / write operationsOriginally developed at PowersetBased on Google’s Bigtable

Monday, August 23, 2010

Page 107: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

44

HBase data model

Column familyColumn

Row

key

Partitioned over many nodesMonday, August 23, 2010

Page 108: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

44

HBase data model

Column familyColumn

Row

key

(millions)

(billions;sorted)

(hundreds; fixed)

(thousands)Partitioned over many nodesMonday, August 23, 2010

Page 109: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

44

HBase data model

Column familyColumn

Row

key

(millions)

(billions;sorted)

(hundreds; fixed)

(thousands)Partitioned over many nodes

Keys and cell values are arbitrary byte arrays

Monday, August 23, 2010

Page 110: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

44

HBase data model

Column familyColumn

Row

key

(millions)

(billions;sorted)

(hundreds; fixed)

(thousands)Partitioned over many nodes

Can use any underlying data store (local, HDFS, S3, etc)

Keys and cell values are arbitrary byte arrays

Monday, August 23, 2010

Page 111: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

45

Data model example

empId

profile:last profile:first profile:salary

profile: family

Smith John $90,000

Monday, August 23, 2010

Page 112: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

46

Data model example

empId

profile:last profile:first profile:salary

profile: family

Smith John $90,000

bm: (bookmarks) family

bm:url1 bm:urlN

Monday, August 23, 2010

Page 113: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

46

Data model example

empId

profile:last profile:first profile:salary

profile: family

Smith John $90,000

Always access via primary key

bm: (bookmarks) family

bm:url1 bm:urlN

Monday, August 23, 2010

Page 114: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

47

HBase vs. RDBMS Different solution, similar problems RDBMSes:

Row-oriented Fixed-schema ACID

HBase et al.: Designed from ground-up to scale out, by adding

commodity machines Simple consistency scheme: atomic row writes Fault tolerance Batch processing No (real) indexes

Monday, August 23, 2010

Page 115: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

48

Outline Introduction MapReduce & distributed storage Hadoop

HBasePigCascadingHive

Summary

Monday, August 23, 2010

Page 116: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

49

Pig introduction “ ~5 lines of non-boilerplate code ” Writing a single MapReduce job requires

significant gruntworkBoilerplates (mapper/reducer, create job, etc) Input / output formats

Many tasks require more than one MapReduce job

Monday, August 23, 2010

Page 117: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

50

Pig main features Data structures (multi-valued, nested) Pig-latin: data flow language

SQL-inspired, but imperative (not declarative)

Monday, August 23, 2010

Page 118: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

51

Pig example

records = LOAD filename AS (last: chararray, first: chararray, salary:int);grouped = GROUP records BY first;counts = FOREACH grouped GENERATE group, COUNT(records.first);DUMP counts;

# LAST FIRST SALARYSmith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

Monday, August 23, 2010

Page 119: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

52

Pig schemas Schema = tuple data type

Schemas are optional!Data-loading step is not required“Unknown” schema: similar to AWK ($0, $1, ..)

Support for most common datatypes Support for nesting

Monday, August 23, 2010

Page 120: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

53

Pig Latin feature summary Data loading / storing

LOAD / STORE / DUMP Filtering

FILTER / DISTINCT / FOREACH / STREAM Group-by

GROUP Join & co-group

JOIN / COGROUP / CROSS Sorting

ORDER / LIMIT Combining / splitting

UNION / SPLIT

Monday, August 23, 2010

Page 121: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

54

Outline Introduction MapReduce & distributed storage Hadoop

HBasePigCascadingHive

Summary

Monday, August 23, 2010

Page 122: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

55

Cascading introduction Provides higher-level abstraction

Fields, TuplesPipesOperationsTaps, Schemes, Flows

Ease composition of multi-job flows

Monday, August 23, 2010

Page 123: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

55

Cascading introduction Provides higher-level abstraction

Fields, TuplesPipesOperationsTaps, Schemes, Flows

Ease composition of multi-job flows

Library, not a new languageMonday, August 23, 2010

Page 124: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

56

Cascading exampleScheme srcScheme = new TextLine();Tap source = new Hfs(srcScheme, filename);Scheme dstScheme = new TextLine();Tap sink = new Hfs(dstScheme, filename, REPLACE);

Pipe assembly = new Pipe(“lastnames”);

Function splitter = new RegexSplitter( new Fields(“last”, “first”, “salary”), “\t”);assembly = new Each(assembly, new Fields(“line”), splitter);

assembly = new GroupBy(assembly, new Fields(“first”));

Aggregator count = new Count(new Fields(“count”));assembly = new Every(assembly, count);

FlowConnector flowConnector = new FlowConnector();Flow flow = flowConnector.connect(“last-names”, source, sink, assembly);flow.complete();

# LAST FIRST SALARYSmith John $90,000Brown David $70,000... ... ...

employees.txt

Q: “What is the frequency of each first name?”

Monday, August 23, 2010

Page 125: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

56

Cascading exampleScheme srcScheme = new TextLine();Tap source = new Hfs(srcScheme, filename);Scheme dstScheme = new TextLine();Tap sink = new Hfs(dstScheme, filename, REPLACE);

Pipe assembly = new Pipe(“lastnames”);

Function splitter = new RegexSplitter( new Fields(“last”, “first”, “salary”), “\t”);assembly = new Each(assembly, new Fields(“line”), splitter);

assembly = new GroupBy(assembly, new Fields(“first”));

Aggregator count = new Count(new Fields(“count”));assembly = new Every(assembly, count);

FlowConnector flowConnector = new FlowConnector();Flow flow = flowConnector.connect(“last-names”, source, sink, assembly);flow.complete();

# LAST FIRST SALARYSmith John $90,000Brown David $70,000... ... ...

employees.txt

Q: “What is the frequency of each first name?”

Monday, August 23, 2010

Page 126: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

57

Cascading feature summary Pipes: transform streams of tuples

EachGroupBy / CoGroupEverySubAssembly

Operations: what is done to tuplesFunctionFilterAggregator / Buffer

Monday, August 23, 2010

Page 127: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

58

Outline Introduction MapReduce & distributed storage Hadoop

HBasePigCascadingHive

Summary

Monday, August 23, 2010

Page 128: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

59

Hive introduction Originally developed at Facebook

Now a Hadoop sub-project Data warehouse infrastructure

Execution: MapReduceStorage: HDFS files

Large datasets, e.g. Facebook daily logs30GB (Jan’08), 200GB (Mar’08), 15+TB (2009)

Hive QL: SQL-like query language

Monday, August 23, 2010

Page 129: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

60

Hive example

CREATE EXTERNAL TABLE records (last STRING, first STRING, salary INT)ROW FORMAT DELIMETED FIELDS TERMINATED BY ’\t’ STORED AS TEXTFILE LOCATION filename;

SELECT records.first, COUNT(1)FROM records GROUP BY records.first;

# LAST FIRST SALARYSmith John $90,000Brown David $70,000Johnson George $95,000Yates John $80,000Miller Bill $65,000Moore Jack $85,000Taylor Fred $75,000Smith David $80,000Harris John $90,000... ... ...... ... ...

employees.txt

Q: “What is the frequency of each first name?”

Monday, August 23, 2010

Page 130: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

61

Hive schemas Data should belong to tables

But can also use pre-existing data Data loading optional (like Pig) but encouraged

Partitioning columns: Mapped to HDFS directories E.g., (date, time) datadir/2009-03-12/18_30_00

Data columns (the rest): Stored in HDFS files

Support for most common data types Support for pluggable serialization

Monday, August 23, 2010

Page 131: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

62

Hive QL feature summary Basic SQL

FROM subqueries JOIN (only equi-joins) Multi GROUP BY Multi-table insert Sampling

Extensibility Pluggable MapReduce scripts User Defined Functions User Defined Types SerDe (serializer / deserializer)

Monday, August 23, 2010

Page 132: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

63

Outline Introduction MapReduce & distributed storage Hadoop

HBasePigCascadingHive

Summary

Monday, August 23, 2010

Page 133: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

64

Recap

Monday, August 23, 2010

Page 134: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

64

Recap Scalable: all

Monday, August 23, 2010

Page 135: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

64

Recap Scalable: all High(-er) level: all except MR

Monday, August 23, 2010

Page 136: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

64

Recap Scalable: all High(-er) level: all except MR Existing language: MR, Cascading

Monday, August 23, 2010

Page 137: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

64

Recap Scalable: all High(-er) level: all except MR Existing language: MR, Cascading “Schemas”: HBase, Pig, Hive, (Casc.)

Pluggable data types: all

Monday, August 23, 2010

Page 138: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

64

Recap Scalable: all High(-er) level: all except MR Existing language: MR, Cascading “Schemas”: HBase, Pig, Hive, (Casc.)

Pluggable data types: all Easy transition: Hive, (Pig)

Monday, August 23, 2010

Page 139: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

65

Related projectsHigher level—computation: Dryad & DryadLINQ (Microsoft) [EuroSys 2007] Sawzall (Google) [Sci Prog Journal 2005]

Higher level—storage: Bigtable [OSDI 2006] / Hypertable

Lower level: Kosmos Filesystem (Kosmix) VSN (Parascale) EC2 / S3 (Amazon) Ceph / Lustre / PanFS

Sector / Sphere (http://sector.sf.net/) …

Monday, August 23, 2010

Page 140: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

66

SummaryMapReduce: Simplified parallel programming modelHadoop: Built from ground-up for:

ScalabilityFault-toleranceClusters of commodity hardware

Growing collection of components and extensions (HBase, Pig, Hive, etc)

Monday, August 23, 2010

Page 141: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

67

Tutorial overview Part 1 (Spiros): Basic concepts & tools

MapReduce & distributed storage Hadoop / HBase / Pig / Cascading / Hive

Part 2 (Jimeng): Algorithms Information retrieval Graph algorithms Clustering (k-means) Classification (k-NN, naïve Bayes)

Part 3 (Rong): Applications Text processing Data warehousing Machine learning

NEXT:

Monday, August 23, 2010

Page 142: Large-scale Data Mining: MapReduce and beyondysmoon/courses/2011_1/... · Tutorial overview Part 1 (Spiros): Basic concepts & tools MapReduce & distributed storage Hadoop / HBase

Large-scale Data Mining: MapReduce and beyond

Part 1: Basics

Spiros Papadimitriou, GoogleJimeng Sun, IBM Research

Rong Yan, Facebook

Monday, August 23, 2010