mapreduce design patterns

MapReduce Design Patterns

Anastasiia Kornilova,SoftServe Data Science Group

MapReduce Components

❖ record reader

❖ map

❖ combiner

❖ partitioner

❖ shuffle and sort

❖ reduce

❖ output format

Mapper

Reducer

Reader

Shuffle

and sort

Output

Partitioner

Combiner

MapReduce Patterns

❖ Filtering Patterns

❖ Summarization Patterns

❖ Join Patterns

❖ Data Organization Patterns

❖ Metapatterns

❖ Input and Output Patterns

Filtering patterns

❖ Filtering

❖ Bloom filtering

❖ Top-N

❖ Distinct

❖ Closer view of data

❖ Tracking a thread of events

❖ Distributed grep

❖ Data cleansing

❖ Simple random sampling

❖ Removing low scoring data

Filtering

Input split

Input split

Input

split

Filter Mappe

r

FilterMappe

r

Filter Mappe

r

Output

file

Output

file

Output

file

Bloom filtering

❖ Removing most of non watched values

❖ Prefiltering a data set for an expensive set membership check

• Probabilistic data structure

• Hash functions comparing

• Answer: probably yes or now

Input

split

Bloom FilterTraining

Output

file

Step 1 - Filter Training

Step 2 - Bloom Filtering via MapReduce

Input split

Bloom Filter

Mapper

Output

file

DiscardedLoad filter from

distributed cache

Bloom Filter Test

Maybe

No

Input split

Bloom Filter

Mapper

Output

file

DiscardedLoad filter from

distributed cache

Bloom Filter Test

Maybe

No

Top N

❖ Outlier analysis

❖ Select interesting data

❖ Catchy dashboards

Input split

Input split

Input

split

Top Ten

Mapper

Top Ten

Mapper

Top Ten

Mapper

Top Ten

Reducer

Top 10

Output

Input split

Top Ten

Mapper

local top 10

local top 10

local top 10

local top 10

final top 10

Distinct

❖ Deduplicate data

❖ Getting distinct values

❖ Protecting from inner join explosions

Summarization patterns

❖ Numerical summarization

❖ Inverted index

❖ Counting with counters

Numerical summarization

❖ Word count

❖ Record count

❖ Min/Max/Count

❖ Average/Median/Standart deviation

Mapper

Mapper

Mapper

Partitoner

Partitoner

Partitoner

Reducer

Reducer

(group B, summary)(group D, summary)

(group B, summary)(group D, summary)

(key, summary field)






Inverted index

Mapper

Mapper

Mapper

Partitoner

Partitoner

Partitoner

Reducer

Reducer

(keyword, unique ID)






(keyword A, list of IDs)

(keyword D, list of IDs)

(keyword A, list of IDs)

(keyword D, list of IDs)

Data Organization Patterns

❖ Structured to Hierarchical

❖ Partitioning

❖ Binning

❖ Total Order Sorting

❖ Shuffling

Join patterns

❖ Reduce Side Join

❖ Replicated Join

❖ Composite Join

❖ Cartesian Product

Input split

Input split

Input split

Input split

Input split

Data Set A

Data Set B

JoinMapp

er

JoinMapp

er

JoinMapp

er

JoinMapp

er

JoinMapp

er

Shuffle

and sort

JoinReduc

er

JoinReduc

er

JoinReduc

er

Output

part

Output

part

Output

part

(key, values A)

(key, values A)

(key, values A)

(key, values B)

(key, values B)

id

title

tagnames

authorized

body

node type

parent id

abs parent id

added at

score

state string

last edited id

last activity id

last activity at

activity revision

extra

extra def

extra count

marked

user id

reputation

gold

silver

bronze

Node table

User table

Pig examples

- - Inner Join:A = JOIN comments BY userID, users BY userID;

- - Outer Join:A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;

- - Binning:SPLIT data INTO

eights IF col1 == 8,

bigs IF col1 > 8,

smalls IF (col1 < 8 and col1 > 0 );

- - Top Ten:B = ORDER A BY col4 DESC’

C = limit B 10;

- - Filtering:b = FILTER a BY value < 3;

mapreduce design patterns

Technology

unique id keyword

values b key

activity id

split data

join comments

outer join

summary fieldkey

summary field key