thinking at scale with hadoop
DESCRIPTION
Presentation by Ken Krugler at the SDForum SAM SIG (Software Architecture & Modeling) meeting on Sept 22nd, 2010. This talk provides a brief introduction to Map-Reduce & Hadoop, then discusses challenges of implementing complex data processing using low-level Map-Reduce support, and a number of solutions.TRANSCRIPT
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Thinking at Scale with Hadoop
1
phot
o by
: i_p
inz,
flick
r
Ken Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Introduction - Ken Krugler
Founder, Bixo Labs - http://bixolabs.com
Large scale web mining workflows, running in Amazon’s cloud
Instructor, Scale Unlimited - http://scaleunlimited.com
Teaching Hadoop Boot Camp
Founder, Krugle (2005 - 2009)
Code search startup
Parse/index 2.6 billion lines of open source code
2
Friday, September 24, 2010
Goals for Talk
Intro to Map-Reduce
Overview of Hadoop
Challenges with Map-Reduce
Solutions to Complexity
3
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
4
Introduction to Map-Reduce
phot
o by
: exf
ordy
, flick
r
Ken Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Map -> The ‘map’ function in the MapReduce algorithm, user defined
Reduce -> The ‘reduce’ function in the MapReduce algorithm, user defined
Group -> The ‘magic’ that happens between Map and Reduce
Key Value Pair -> two units of data, exchanged between Map & Reduce
5
Definitions
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
How MapReduce Works
Map translates input to keys and values to new keys and values
System Groups each unique key with all its values
Reduce translates the values of each unique key to new keys and values
6
[K1,V1] Map [K2,V2]
[K2,{V2,V2,....}] [K3,V3]Reduce
[K2,V2] [K2,{V2,V2,....}]Group
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
All Together
7
Reducer
Reducer
Reducer
Mapper
Mapper
Mapper
Mapper
Mapper
Shuffle
Shuffle
Shuffle
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Canonical Example - Word Count
Word Count
Read a document, parse out the words, count the frequency of each word
Specifically, in MapReduce
With a document consisting of lines of text
Translate each line of text into key = word and value = 1
Sum the values (ones) for each unique word
8
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
9
[1, When in the Course of human events, it becomes]
[When,1] [in,1] [the,1] [Course,1] [k,1]
[2, necessary for one people to dissolve the political bands]
[n, {k,k,k,k,...}]
[When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]
[k,{v,v,...}][connected,{1,1,1,1,1}]
[When,4] [peope,6] [dissolve,3]
[k,sum(v,v,v,..)][connected,5]
Map
Reduce
GroupUser
Defined
[K1,V1]
[K2,{V2,V2,....}]
[K2,V2]
[K3,V3]
[3, which have connected them with another, and to assume]
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Divide & Conquer
Because
The Map function only cares about the current key and value, and
The Reduce function only cares about the current key and its values
Mappers & Reducers can run Map/Reduce functions independently
Then
A Mapper can invoke Map on an arbitrary number of input keys and values
A Reducer can invoke Reduce on an arbitrary number of the unique keys
Any number of Mappers & Reducers can be run on any number of nodes
10
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
11
Overview of Hadoop
phot
o by
: exf
ordy
, flick
r
Ken Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Logical Architecture
Logically, Hadoop is simply a computing cluster that provides:
a Storage layer, and
an Execution layer
12
Cluster
Storage
Execution
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Logical Architecture
Where users can submit jobs that:
Run on the Execution layer, and
Read/write data from the Storage layer
13
Cluster
Storage
Execution
Users
Client
Client
Client
job job job
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Storage - HDFS
14
Performance
Optimized for many concurrent serialized reads of large files
But only a single writer (write-once, no append)
Fidelity
Replication via block replication
GeographyGeography
RackRackRack
Node Node Node ...Node
block
block
block
block
block
block
block
block
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Execution
Performance
Divide and Conquer
MapReduce
Data Affinity
Move code to data
Fidelity
Applications must choose to fail on bad data, or ignore bad data
15
GeographyGeography
RackRackRack
Node Node Node ...Node
map
reduce
map map map map
reducereduce
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Architectural Components
16
Solid boxes are unique applicationsDashed boxes are child JVM instances on same node as parentDotted boxes are blocks of managed files on same node as parent
read/write
nsoperations
read/write
read/writens
operations
nsoperations
NameNode
Secondary
DataNodeDataNode
DataNodeDataNode
JobTrackerClientjobs
TaskTracker
mapper
child jvm
reducer
child jvm
mapper
child jvm
reducer
child jvmreducer
child jvm
mapper
child jvm
data block
tasks
Friday, September 24, 2010
Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
17
Challenges of Map-Reduce
phot
o by
: Gra
ham
Rac
her,
flickrKen Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Complexity of keys/values
Simple != Simplicity
People think about processing records/objects
The correct key definition changes during a workflow
A “value” is typically a set of fields
18
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Complexity of chaining jobs
Most real-world workflows require multiple Map-Reduce jobs
Connecting data between jobs is tedious
Keeping key/value definitions in sync is error-prone
As an example...
19
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
A trivial filter/count/sort workflow
20
private void start(String inputPath, String outputPath, String filterUser) throws Exception { // counts the pages the users clicked and sort by number of clicks
// filter and count
Path outputDir = new Path(outputPath); Path parent = outputDir.getParent(); Path tmpOutput = new Path(parent, "" + System.currentTimeMillis());
// Create the job configuration Configuration conf = new Configuration(); conf.set(FilterMapper.USER_KEY, filterUser);
Job filterCountJob = new Job(conf, "filter+count job");
FileInputFormat.setInputPaths(filterCountJob, new Path(inputPath)); FileOutputFormat.setOutputPath(filterCountJob, tmpOutput);
filterCountJob.setMapperClass(FilterMapper.class); filterCountJob.setMapOutputKeyClass(Text.class); filterCountJob.setMapOutputValueClass(LongWritable.class);
filterCountJob.setReducerClass(CounterReducer.class); filterCountJob.setOutputKeyClass(Text.class); filterCountJob.setOutputValueClass(LongWritable.class); filterCountJob.setInputFormatClass(TextInputFormat.class); filterCountJob.setOutputFormatClass(SequenceFileOutputFormat.class);
// run job and block until job is done filterCountJob.waitForCompletion(false);
// sort Job sortJob = new Job(conf, "sort job");
FileInputFormat.setInputPaths(sortJob, tmpOutput); FileOutputFormat.setOutputPath(sortJob, outputDir);
sortJob.setMapperClass(SortMapper.class); sortJob.setMapOutputKeyClass(UserAndCount.class); sortJob.setMapOutputValueClass(NullWritable.class);
// sort are only stable with one reducer but you lose distribution. sortJob.setNumReduceTasks(1);
sortJob.setInputFormatClass(SequenceFileInputFormat.class); sortJob.setOutputFormatClass(TextOutputFormat.class); // run job and block until job is done sortJob.waitForCompletion(false);
// Get rid of temp directory used to bridge between the two jobs FileSystem.get(conf).delete(tmpOutput, true);
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Not seeing the forest...
Too much low-level detail obscures intent
Not seeing intent leads to bugs
And makes optimization harder
21
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Hard to use MR optimizations
MR => MR{0,1} => M{1,}R{0,1} => M{1,}R{0,1}M{0,}
Combiners can reduce map output size
Map-side joins where one of the sides is “small”
22
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Solutions to Complexity
Cascading - Tuples, Pipes, and Operations
FlumeJava - Java Classes and Deferred Execution
Pig/Hive - “It’s just a big database”
Datameer - “It’s just a big spreadsheet”
23
Friday, September 24, 2010
Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
24
Cascading
phot
o by
: Gra
ham
Rac
her,
flickrKen Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Cascading
An API for defining data processing workflows
First public release Jan 2008
Currently 1.1.1 and supports Hadoop 0.19.x through 0.20.2
http://www.cascading.org
25
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Cascading
Basic unit of data is “Tuple” - roughly equal to a record
Implements all standard data processing operations
function, filter, group by, co-group, aggregator, etc
Complex applications are built by chaining operations with pipes
And are planned into the minimum number of MapReduce jobs
26
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Operations on Tuples
Map and Reduce communicate through Keys and Values
Our problems are really about elements of Keys and Values
e.g. “username”, not “line”
The effect is more reusability of our ‘map’ and ‘reduce’ operations
27
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Mapping onto Hadoop
Functions & Filters are map operations
Aggregates & Buffers are reduce operations
Built-in support for joining, multi-field sorting
Concept of “Taps” with “Schemes” as input and output formats
28
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Loose Coupling
29
Groupfunc aggr Sink
func
Source
Above, all the possible relationships between operations and data
If Fields are the interface between operations, we can build flexible apps
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Layered Over MapReduce
30
Map
CoGroupfunc aggr SinkSource GroupBy
func
aggr
Source func
functemp
Reduce Map Reduce
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Cascading Code
31
Tap input = new Hfs( new TextLine( new Fields( "line" ) ), inputPath ); Tap output = new Hfs( new TextLine( 1 ), outputPath, SinkMode.REPLACE );
Pipe pipe = new Pipe( "example" );
RegexSplitGenerator function = new RegexSplitGenerator( new Fields( "word" ), "\\s+" ); pipe = new Each( pipe, new Fields( "line" ), function );
pipe = new GroupBy( pipe, new Fields( "word" ) );
pipe = new Every( pipe, new Fields( "word" ), new Count( new Fields( "count" ) ) ); pipe = new Every( pipe, new Fields( "word" ), new Average( new Fields( "avg" ) ) );
pipe = new GroupBy( pipe, new Fields( "count", -1 ) );
FlowConnector.setApplicationJarClass( new Properties(), Main.class ); Flow flow = new FlowConnector( properties ).connect( "example flow", input, output, pipe ); flow.complete();
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
32
Every('tsCount')[Count[decl:'count']]
Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']']
GroupBy('tsCount')[by:['ts']]
Each('arrival rate')[DateParser[decl:'ts'][args:1]]
Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']']
Every('tmCount')[Count[decl:'count']]
Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']']
GroupBy('tmCount')[by:['tm']]
Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]]
[head]
[tail]
['ts', 'count']
['ts']
['ip', 'time', 'method', 'event', 'status', 'size']
['ip', 'time', 'method', 'event', 'status', 'size']
tsCount['ts']
['ts']
['tm', 'count']
['tm']
['ts']
['ts']
tmCount['tm']
['tm']
['tm']
['tm']
['ts']
['ts']
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
33
Every('')[Agregator[decl:'']]
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
GroupBy('')[by:['']]
Each('')[Identity[decl:'', '']]
CoGroup('')[by:a:['']b:['']]
Each('')[RegexSplitter[decl:'', ''][args:1]]
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
CoGroup('')[by:a:['']b:['']]
Each('')[Identity[decl:'', '', '', '']]
CoGroup('')[by:a:['']b:['']]
Each('')[Identity[decl:'', '', '']]
CoGroup('')[by:a:['']b:['']]
Each('')[RegexSplitter[decl:'', '', '', ''][args:1]]
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
Each('')[RegexSplitter[decl:'', '', ''][args:1]]
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
Every('')[Agregator[decl:'', '']]
Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']
GroupBy('')[by:['']]
Each('')[Identity[decl:'', '', '']]
CoGroup('')[by:a:['']b:['']]
Each('')[Identity[decl:'', '', '']]
CoGroup('')[by:a:['']b:['']]
Each('')[Identity[decl:'', '', '', '']]
CoGroup('')[by:a:['']b:['']]
[head]
[tail]
TempHfs['SequenceFile[['', '', '', '']]'][]
TempHfs['SequenceFile[['', '', '']]'][]
TempHfs['SequenceFile[['', '', '']]'][]
TempHfs['SequenceFile[['', '']]'][]
TempHfs['SequenceFile[['', '', '', '']]'][]
TempHfs['SequenceFile[['', '', '']]'][]
TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][]
TempHfs['SequenceFile[['', '']]'][]
TempHfs['SequenceFile[['', '']]'][]
TempHfs['SequenceFile[['', '']]'][]
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
a[''],b['']
['', '', '', '']
['', '']
['', '']
a[''],b['']
['', '', '', '', '']
a[''],b['']
['', '', '', '', '', '', '', '', '']
a['']
['', '']
['', '', '']
['', '', '']
a[''],b['']
['', '', '', '', '', '']
a[''],b['']
['', '', '', '', '', '', '']
a[''],b['']
['', '', '', '', '']
a['']
['', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', ''] ['', '', '', '']
['', '', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '']
['', '']
['', '']
['', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '']
['', '', '']
['', '', '']
['', '', '']
a[''],b['']
['', '', '', '', '', '', '']
['', '', '', '', '', '', '']
['', '', '', '', '', '', '']
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
['', ''] ['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
Friday, September 24, 2010
Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
34
FlumeJava
phot
o by
: Gra
ham
Rac
her,
flickrKen Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
FlumeJava
Recent paper (June 2010) by Googlites - Chambers et. al.
Internal project to make it easier to write Map-Reduce jobs
Equivalent open source project just started
Plume - http://github.com/tdunning/Plume
35
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Java Classes, Deferred Exec
Java-centric view of map-reduce processing
Handful of classes to represent immutable, parallel collections
PCollection, PTable
Small number of primitive operations on classes
parallelDo, groupByKey, combineValues, flatten
Deferred execution allows construction/optimization of workflow
36
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
FlumeJava Code
37
PCollection<String> lines = readTextFileCollection("/gfs/data/shakes/hamlet.txt");
Collection<String> words = lines.parallelDo(new DoFn<String,String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings()));
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Results of Use at Google
38
TextText
Lines of code
Text
Number of Map-Reduce Jobs
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
39
Pig & Hive
phot
o by
: Tam
bako
the
Jagu
ar, fl
ickr
Ken Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Pig
Pig is a data flow ‘programming environment’ for processing large files
PigLatin is the text language it provides for defining data flows
Developed by researchers in Yahoo! and widely used internally
A sub-project of Hadoop
APL licensed
http://hadoop.apache.org/pig/
First incubator release v0.1.0 on Sept 2008
40
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Pig
Provides two kinds of optimizations:
Physical - caching and stream optimizations at the MapReduce level
Algebraic - mathematical reductions at the syntax level
Allows for User Defined Functions (UDF)
must register jar of functions via register path/some.jar
can be called directly from PigLatin
B = foreach A generate myfunc.MyEvalFunc($0);
41
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
PigLatin - Example
42
W = LOAD 'filename' AS (url, outlink);G = GROUP W by url;R = FOREACH G { FW = FILTER W BY outlink eq 'www.apache.org'; PW = FW.outlink; DW = DISTINCT PW; GENERATE group, COUNT(DW);}
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Hive
Hive is a data warehouse built on top of flat files
Developed by Facebook
Available as a Hadoop contribution in v0.19.0
http://wiki.apache.org/hadoop/Hive
43
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Hive
Data Organization into Tables with logical and hash partitioning
A Metastore to store metadata about Tables/Partitions/Buckets
Partitions are ‘how’ the data is stored (collections of related rows)
Buckets further divide Partitions based on a hash of a column
metadata is actually stored in a RDBMS
A SQL like query language over object data stored in Tables
Custom types, structs of primitive types
44
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Hive - Queries
Queries always write results into a table
45
FROM user INSERT OVERWRITE TABLE user_active SELECT user.* WHERE user.active = true;
FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age;
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
46
Datameer Analytics Solution (DAS)
phot
o by
: Tam
bako
the
Jagu
ar, fl
ickr
Ken Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
DAS - Big Spreadsheets
Commercial solution from Datameer
Similar to IBM BigSheets
GUI + workflow editor on top of Hadoop
Define data sources, operations (via spreadsheet), and data sinks
http://www.datameer.com
47
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
48
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
49
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
50
Summary
phot
o by
: Tam
bako
the
Jagu
ar, fl
ickr
Ken Krugler
Bixo Labs &Scale Unlimited
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Thinking at Scale with Hadoop
Understand the underlying Map-Reduce architecture
Model the problem as a workflow with operations on records
Consider using Cascading or (eventually) Plume
Use Pig/Hive to solve query-centric problems
Look at DAS or BigSheets for easier analytics solutions
51
Friday, September 24, 2010
Resources
Hadoop mailing lists - many, e.g. [email protected]
Hadoop: The Definitive Guide by Tom White (O’Reilly)
Hadoop Training
Scale Unlimited: http://scaleunlimited.com/courses
Cloudera: http://www.cloudera.com/hadoop-training/
Cascading: http://www.cascading.org
DAS: http://www.datameer.com
52
Friday, September 24, 2010
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Questions?
53
Friday, September 24, 2010