mapreduce design patterns

25
MapReduce Design Patterns Anastasiia Kornilova, SoftServe Data Science Group

Upload: anastasiia-kornilova

Post on 10-Jun-2015

541 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: MapReduce Design Patterns

MapReduce Design Patterns

Anastasiia Kornilova,SoftServe Data Science Group

Page 2: MapReduce Design Patterns

MapReduce Components

❖ record reader

❖ map

❖ combiner

❖ partitioner

❖ shuffle and sort

❖ reduce

❖ output format

Mapper

Reducer

Reader

Shuffle

and sort

Output

Partitioner

Combiner

Page 3: MapReduce Design Patterns
Page 4: MapReduce Design Patterns

MapReduce Patterns

❖ Filtering Patterns

❖ Summarization Patterns

❖ Join Patterns

❖ Data Organization Patterns

❖ Metapatterns

❖ Input and Output Patterns

Page 5: MapReduce Design Patterns

Filtering patterns

❖ Filtering

❖ Bloom filtering

❖ Top-N

❖ Distinct

Page 6: MapReduce Design Patterns

❖ Closer view of data

❖ Tracking a thread of events

❖ Distributed grep

❖ Data cleansing

❖ Simple random sampling

❖ Removing low scoring data

Filtering

Page 7: MapReduce Design Patterns

Input split

Input split

Input

split

Filter Mappe

r

FilterMappe

r

Filter Mappe

r

Output

file

Output

file

Output

file

Page 8: MapReduce Design Patterns

Bloom filtering

❖ Removing most of non watched values

❖ Prefiltering a data set for an expensive set membership check

• Probabilistic data structure

• Hash functions comparing

• Answer: probably yes or now

Page 9: MapReduce Design Patterns

Input

split

Bloom FilterTraining

Output

file

Step 1 - Filter Training

Step 2 - Bloom Filtering via MapReduce

Input split

Bloom Filter

Mapper

Output

file

DiscardedLoad filter from

distributed cache

Bloom Filter Test

Maybe

No

Input split

Bloom Filter

Mapper

Output

file

DiscardedLoad filter from

distributed cache

Bloom Filter Test

Maybe

No

Page 10: MapReduce Design Patterns

Top N

❖ Outlier analysis

❖ Select interesting data

❖ Catchy dashboards

Page 11: MapReduce Design Patterns

Input split

Input split

Input

split

Top Ten

Mapper

Top Ten

Mapper

Top Ten

Mapper

Top Ten

Reducer

Top 10

Output

Input split

Top Ten

Mapper

local top 10

local top 10

local top 10

local top 10

final top 10

Page 12: MapReduce Design Patterns

Distinct

❖ Deduplicate data

❖ Getting distinct values

❖ Protecting from inner join explosions

Page 13: MapReduce Design Patterns

Summarization patterns

❖ Numerical summarization

❖ Inverted index

❖ Counting with counters

Page 14: MapReduce Design Patterns

Numerical summarization

❖ Word count

❖ Record count

❖ Min/Max/Count

❖ Average/Median/Standart deviation

Page 15: MapReduce Design Patterns

Mapper

Mapper

Mapper

Partitoner

Partitoner

Partitoner

Reducer

Reducer

(group B, summary)(group D, summary)

(group B, summary)(group D, summary)

(key, summary field)

(key, summary field)

(key, summary field)

(key, summary field)

(key, summary field)

(key, summary field)

Page 16: MapReduce Design Patterns

Inverted index

Page 17: MapReduce Design Patterns

Mapper

Mapper

Mapper

Partitoner

Partitoner

Partitoner

Reducer

Reducer

(keyword, unique ID)

(keyword, unique ID)

(keyword, unique ID)

(keyword, unique ID)

(keyword, unique ID)

(keyword, unique ID)

(keyword A, list of IDs)

(keyword D, list of IDs)

(keyword A, list of IDs)

(keyword D, list of IDs)

Page 18: MapReduce Design Patterns

Data Organization Patterns

❖ Structured to Hierarchical

❖ Partitioning

❖ Binning

❖ Total Order Sorting

❖ Shuffling

Page 19: MapReduce Design Patterns

Join patterns

❖ Reduce Side Join

❖ Replicated Join

❖ Composite Join

❖ Cartesian Product

Page 20: MapReduce Design Patterns
Page 21: MapReduce Design Patterns

Input split

Input split

Input split

Input split

Input split

Data Set A

Data Set B

JoinMapp

er

JoinMapp

er

JoinMapp

er

JoinMapp

er

JoinMapp

er

Shuffle

and sort

JoinReduc

er

JoinReduc

er

JoinReduc

er

Output

part

Output

part

Output

part

(key, values A)

(key, values A)

(key, values A)

(key, values B)

(key, values B)

Page 22: MapReduce Design Patterns

id

title

tagnames

authorized

body

node type

parent id

abs parent id

added at

score

state string

last edited id

last activity id

last activity at

activity revision

extra

extra def

extra count

marked

user id

reputation

gold

silver

bronze

Node table

User table

Page 23: MapReduce Design Patterns
Page 24: MapReduce Design Patterns
Page 25: MapReduce Design Patterns

Pig examples

- - Inner Join:A = JOIN comments BY userID, users BY userID;

- - Outer Join:A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;

- - Binning:SPLIT data INTO

eights IF col1 == 8,

bigs IF col1 > 8,

smalls IF (col1 < 8 and col1 > 0 );

- - Top Ten:B = ORDER A BY col4 DESC’

C = limit B 10;

- - Filtering:b = FILTER a BY value < 3;