mapreduce design patterns
TRANSCRIPT
MapReduce Design Patterns
Anastasiia Kornilova,SoftServe Data Science Group
MapReduce Components
❖ record reader
❖ map
❖ combiner
❖ partitioner
❖ shuffle and sort
❖ reduce
❖ output format
Mapper
Reducer
Reader
Shuffle
and sort
Output
Partitioner
Combiner
MapReduce Patterns
❖ Filtering Patterns
❖ Summarization Patterns
❖ Join Patterns
❖ Data Organization Patterns
❖ Metapatterns
❖ Input and Output Patterns
Filtering patterns
❖ Filtering
❖ Bloom filtering
❖ Top-N
❖ Distinct
❖ Closer view of data
❖ Tracking a thread of events
❖ Distributed grep
❖ Data cleansing
❖ Simple random sampling
❖ Removing low scoring data
Filtering
Input split
Input split
Input
split
Filter Mappe
r
FilterMappe
r
Filter Mappe
r
Output
file
Output
file
Output
file
Bloom filtering
❖ Removing most of non watched values
❖ Prefiltering a data set for an expensive set membership check
• Probabilistic data structure
• Hash functions comparing
• Answer: probably yes or now
Input
split
Bloom FilterTraining
Output
file
Step 1 - Filter Training
Step 2 - Bloom Filtering via MapReduce
Input split
Bloom Filter
Mapper
Output
file
DiscardedLoad filter from
distributed cache
Bloom Filter Test
Maybe
No
Input split
Bloom Filter
Mapper
Output
file
DiscardedLoad filter from
distributed cache
Bloom Filter Test
Maybe
No
Top N
❖ Outlier analysis
❖ Select interesting data
❖ Catchy dashboards
Input split
Input split
Input
split
Top Ten
Mapper
Top Ten
Mapper
Top Ten
Mapper
Top Ten
Reducer
Top 10
Output
Input split
Top Ten
Mapper
local top 10
local top 10
local top 10
local top 10
final top 10
Distinct
❖ Deduplicate data
❖ Getting distinct values
❖ Protecting from inner join explosions
Summarization patterns
❖ Numerical summarization
❖ Inverted index
❖ Counting with counters
Numerical summarization
❖ Word count
❖ Record count
❖ Min/Max/Count
❖ Average/Median/Standart deviation
Mapper
Mapper
Mapper
Partitoner
Partitoner
Partitoner
Reducer
Reducer
(group B, summary)(group D, summary)
(group B, summary)(group D, summary)
(key, summary field)
(key, summary field)
(key, summary field)
(key, summary field)
(key, summary field)
(key, summary field)
Inverted index
Mapper
Mapper
Mapper
Partitoner
Partitoner
Partitoner
Reducer
Reducer
(keyword, unique ID)
(keyword, unique ID)
(keyword, unique ID)
(keyword, unique ID)
(keyword, unique ID)
(keyword, unique ID)
(keyword A, list of IDs)
(keyword D, list of IDs)
(keyword A, list of IDs)
(keyword D, list of IDs)
Data Organization Patterns
❖ Structured to Hierarchical
❖ Partitioning
❖ Binning
❖ Total Order Sorting
❖ Shuffling
Join patterns
❖ Reduce Side Join
❖ Replicated Join
❖ Composite Join
❖ Cartesian Product
Input split
Input split
Input split
Input split
Input split
Data Set A
Data Set B
JoinMapp
er
JoinMapp
er
JoinMapp
er
JoinMapp
er
JoinMapp
er
Shuffle
and sort
JoinReduc
er
JoinReduc
er
JoinReduc
er
Output
part
Output
part
Output
part
(key, values A)
(key, values A)
(key, values A)
(key, values B)
(key, values B)
id
title
tagnames
authorized
body
node type
parent id
abs parent id
added at
score
state string
last edited id
last activity id
last activity at
activity revision
extra
extra def
extra count
marked
user id
reputation
gold
silver
bronze
Node table
User table
Pig examples
- - Inner Join:A = JOIN comments BY userID, users BY userID;
- - Outer Join:A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;
- - Binning:SPLIT data INTO
eights IF col1 == 8,
bigs IF col1 > 8,
smalls IF (col1 < 8 and col1 > 0 );
- - Top Ten:B = ORDER A BY col4 DESC’
C = limit B 10;
- - Filtering:b = FILTER a BY value < 3;