mapreduce: teoria e prática

MapReduce 101

by Chaordic Systems

Brought to you by...

Big Data, what's the big deal?

Why is this talk relevant to you?

● we have too much datato process in a single computer

● we make too few informed decisionbased on the data we have

● we have too little {time|CPU|memory}to analyze all this data

● 'cuz not everything needs to be on-lineIt's 2013 but doing batch processing is still OK

Map-what?

And why MapReduce and not, say MPI?

● Simple computation modelMapReduce exposes a simple (and limited) computational model.It can be a restraining at times but it is a trade off.

● Fault-tolerance, parallelization and distribution among machines for freeThe framework deals with this for you so you don't have to

● Because it is the bread-and-butter of Big Data processingIt is available in all major cloud computing platforms, and it is against what other Big Data systems compare themselves against.

Outline

● Fast recap on python and whatnot

● Introduction to MapReduce

● Counting Words

● MrJob and EMR

● Real-life examples

Fast recap

Let's assume you know what the following is:

● JSON

● Python's yield keyword

● Generators in Python

● Amazon S3

● Amazon EC2

If you don't, raise your hand now. REALLY

Fast recap

RecapJSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format.It's like if XML and JavaScript slept together and gave birth a bastard but good-looking child.

{"timestamp": "2011-08-15 22:17:31.334057",

"track_id": "TRACCJA128F149A144",

"tags": [["Bossa Nova", "100"],

["jazz", "20"],

["acoustic", "20"],

["romantic", "20"],],

"title": "Segredo",

"artist": "Jo\u00e3o Gilberto"}

RecapPython generators

From Python's wiki:“Generators functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.”

The difference is: a generator can be iterated (or read) only once as you don't store things in memory but create them on the fly [2].

You can create generators using the yield keyword.

http://wiki.python.org/moin/Generators

RecapPython yield keyword

It's just like a return, but turns your function into a generator.Your function will suspend its execution after yielding a value and resume its execution for after the request for the next item in the generator (next loop).

def count_from_1(): i = 1

while True: yield i i += 1

for j in count_from_1(): print j

RecapAmazon S3

From Wikipedia:“Amazon S3 (Simple Storage Service) is an online storage web service offered by Amazon Web Services.”

Its like a distributed filesystem that is easy to use from other Amazon services, specially from Amazon Elastic MapReduce.

http://en.wikipedia.org/wiki/Amazon_S3

RecapEC2 - Elastic Cloud Computing

From Wikipedia:“EC2 allows users to rent virtual computers on which to run their own computer applications”

So you can rent clusters on demand, no need to maintain, keep fixing and up-to-date your ever breaking cluster of computers. Less headache, moar action.

Instances can be purchased on demand for fixed prices or you can bid on those.

http://en.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud

MapReduce:a quick introduction

MapReduce

MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.

MapReduce


Map

MapReduce


Map Reduce

Typical (big data) problem

● Iterate over a large number of records

● Extract something of interest from each

● Shuffle and sort intermediate results

● Aggregate intermediate results

● Generate final output

Map

Reduce

Phases of a MapReduction

MapReduce have the following steps:

map(key, value) -> [(key1, value1), (key1, value2)]

combine

sort + shuffle

reduce(key1, [value1, value2]) -> [(keyX, valueY)]

May happen in parallel, in multiple machines!

Notice:

Reduce phase only starts after all mappers have completed.Yes, there is a synchronization barrier right there.

There is no global knowledgeNeither mappers nor reducers know what other mappers (or reducers) are processing

Counting Words

Counting the number of occurrences of a word in a document collection is quite a big deal.

Let's try with a small example:

"Me gusta correr, me gustas tu.Me gusta la lluvia, me gustas tu."

Counting Words

"Me gusta correr, me gustas tu.Me gusta la lluvia, me gustas tu."

me 4

gusta 2

correr 1

gustas 2

tu 2

la 1

lluvia 1

Counting word - in Python

doc = open('input')

count = {}

for line in doc: words = line.split()

for w in words: count[w] = count.get(w, 0) + 1

Easy, right? Yeah... too easy. Let's split what we do for each line and aggregate, shall we?

Counting word - in MapReduce

def map_get_words(self, key, line):

for word in line.split():

yield word, 1

def reduce_sum_words(self, word, occurrences):

yield word, sum(occurrences)

def map_get_words(self, key, line):


yield word, 1

What is Map's output?

key=1line="me gusta correr me gustas tu"

('me', 1)

('gusta', 1)

('correr', 1)

('me', 1)

('gustas', 1)

('tu', 1)

key=2line="me gusta la lluvia me gustas tu"

('me', 1),

('gusta', 1)

('la', 1)

('lluvia', 1)

('me', 1)

('gustas', 1)

('tu', 1)

What about shuffle?

What about shuffle?

Think of it as a distributed group by operation.

In the local map instance/node:

● it sorts map output values,● groups them by their key,● send this group of key and associated values to the

reduce node responsible for this key.

In the reduce instance/node:

● the framework joins all values associated with this key in a single list - for you, for free.

What's Shuffle output? orWhat's Reducer input?

Notice:

This table represents a global view.

"In real life", each reducer instance only knows about its own key and values.

Key (input) Values

correr [1]

gusta [1, 1]

gustas [1, 1]

la [1]

lluvia [1]

me [1, 1, 1, 1]

tu [1, 1]

def reduce_sum_words(self, word, occurrences):


What's Reducer output?

word occurrences output

correr [1] (correr, 1)

gusta [1, 1] (gusta, 2)

gustas [1, 1] (gustas, 2)

la [1] (la, 1)

lluvia [1] (lluvia, 1)

me [1, 1, 1, 1] (me, 4)

tu [1, 1] (tu, 2)

MapReduce (main) Implementations

Google MapReduce● C++● Proprietary

Apache Hadoop● Java

○ interfaces for anything that runs in the JVM○ Hadoop streamming for a pipe-like programming

language agnostic interface● Open source

Nobody really cares about the others (for now... ;)

Amazon Elastic MapReduce (EMR)

Amazon Elastic MapReduce

● Uses Hadoop with extra sauces

● creates a hadoop cluster on demand

● It's magical -- except when it fails

● Can be a sort of unpredictable sometimes○ Installing python modules can fail for no clear reason

MrJob

It's a python interface for hadoop streaming jobs with a really easy to use interface

● Can run jobs locally or in EMR.● Takes care of uploading your python code to

EMR.● Deals better if everything is in a single

python module.● Easy interface to chain sequences of M/R

steps.● Some basic tools to aid debugging.

Counting wordsFull MrJob Examplefrom mrjob.job import MRJob

class MRWordCounter(MRJob):

def get_words(self, key, line):


yield word, 1

def sum_words(self, word, occurrences):


def steps(self):

return [self.mr(self.get_words, self.sum_words),]

if __name__ == '__main__':

MRWordCounter.run()

MrJobLauching a job

Running it locallypython countwords.py --conf-path=mrjob.conf input.txt

Running it in EMRDo not forget to set AWS_ env. vars!

python countwords.py \ --conf-path=mrjob.conf \ -r emr \ 's3://ufcgplayground/data/words/*' \ --no-output \ --output-dir=s3://ufcgplayground/tmp/bla/

Install MrJob using pip or easy_installDo not, I repeat DO NOT install the version in Ubuntu/Debian.

sudo pip install mrjob

Setup your environment with AWS credentialsexport AWS_ACCESS_KEY_ID=...

export AWS_SECRET_ACCESS_KEY=...

Setup your environment to look for MrJob settings:

export MRJOB_CONF=<path to mrjob.conf>

MrJobInstalling and Environment setup

Use our sample MrJob app as your templategit clone https://github.com/chaordic/mr101ufcg.git

Modify the sample mrjob.conf so that your jobs are labeled to your teamIt's the Right Thing © to do.

s3_logs_uri: s3://ufcgplayground/yournamehere/log/

s3_scratch_uri: s3://ufcgplayground/yournamehere/tmp/

Profit!

MrJobInstalling and Environment setup

Target Categories

Objective: Find the most commonly viewed categories per user

Input:● views and orders

Patterns used:● simple aggregation

zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [eletro, caos, furadeira]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]

Map input


Map input

Key


Map input

Key

Reduce Input

(zezin, fulano)[telefone, celulares, vivo][telefone, celulares, vivo][eletro, caos, furadeira]

(lojaX, fulano)[livros, arte, anime][livros, arte, anime][livros, arte, anime]

Sort + Shuffle

Reduce Input



Reduce Output

(zezin, fulano) ([telefone, celulares, vivo], 2)([eletro, caos, furadeira], 1)

(lojaX, fulano) ([livros, arte, anime], 3)

Reduce Input



Filter Expensive Categories

Objective: List all categories where a user purchased something expensive.

Input:● Orders (for price and user information)● Products (for category information)

Patterns used:● merge using reducer

lojaX livro fulano R$ 20

lojaX iphone deltrano R$ 1800

lojaX livro [livros, arte, anime]

lojaX iphone [telefone, celulares, vivo]

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

We have to merge those tables above!

lojaX livro fulano R$ 20

lojaX iphone deltrano R$ 1800

lojaX livro [livros, arte, anime]

lojaX iphone [telefone, celulares, vivo]

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

commonKey

lojaX livro fulano R$ 20 (nada, é barato)

lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”}

lojaX livro [livros, arte, anime] {“cat”: [livros...]}

lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]}

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

Key Value

Map Output

lojaX livro fulano R$ 20 (nada, é barato)

lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”}

lojaX livro [livros, arte, anime] {“cat”: [livros...]}

lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]}

Pro

duct

sB

uyO

rder

s

Map

Inpu

t

Key Value

Map Output

(lojaX, livro) {“cat”: [livros, arte, anime]}

(lojaX, iphone) {”usuario” : “deltrano”}

{“cat”: [telefone, celulares, vivo]}

Red

uce

Inpu

t




Key Values

Red

uce

Inpu

t




Key Values

Red

uce

Inpu

t

Those are the parts we care about!




Key Values

(lojaX, deltrano) [telefone, celulares, vivo]

Red

uce

Out

put

Red

uce

Inpu

t

Real

Datasets

Real datasets, real problems

In the following hour we will write code to analyse some real datasets:● Twitter Dataset (from an article published in WWW'10)● LastFM Dataset, from The Million Song Datset

Supporting code ● available at GitHub, under https://github.

com/chaordic/mr101ufcg● comes with sample data under data for

local runs.

http://an.kaist.ac.kr/traces/WWW2010.html

http://an.kaist.ac.kr/traces/WWW2010.html

http://labrosa.ee.columbia.edu/millionsong/

https://github.com/chaordic/mr101ufcg



Twitter Followers Dataset

A somewhat big dataset● 41.7 million profiles● 1.47 billion social relations (who follows who)● 25 Gb of uncompressed data

Available at s3://mr101ufcg/data/twitter/ ...● splitted/*.gz

full dataset splitted in small compressed files

● numeric2screen.txtnumerid id to original screen name mapping

● followed_by.txtoriginal 25Gb dataset as a single file

Twitter Followers Dataset

Each line in followed_by.txt has the following format:

user_id \t follower_id

For instance:12 \t 38

12 \t 41

13 \t 47

13 \t 52

13 \t 53

14 \t 56

Million Song Dataset project'sLast.fm Dataset

A not-so-big dataset● 943,347 tracks● 1.2G of compressed data

Yeah, it is not all that big...

Available at s3://mr101ufcg/data/lastfm/ ...● metadata/*.gz

Track metadata information, in JSONProtocol format.

● similars/*.gzTrack similarity information, in JSONProtocol format.

Million Song Dataset project'sLast.fm Dataset

JSONProcotol encodes key-pair information in a single line using json-encoded values separated by a tab character ( \t ).

<JSON encoded data> \t <JSON encoded data>

Exemple line:

"TRACHOZ12903CCA8B3" \t {"timestamp": "2011-09-07 22:12:47.150438", "track_id": "TRACHOZ12903CCA8B3", "tags": [], "title": "Close Up", "artist": "Charles Williams"}

Questions?

Stuff I didn't talk about but are sorta cool

Persistent jobs

Serialization (protocols in MrJob parlance)

Amazon EMR Console

Hadoop dashboard (and port 9100)

Combiners

Are just like reducers but take place just after a Map and just before data is sent to the network during shuffle.

Combiners must...● be associative {a.(b.c) == (a.b).c}● commutative (a.b == b.a)● have the same input and output types as yours Map

output type.

Caveats:● Combiners can be executed zero, one or many times,

so don't make your MR depend on them

Reference & Further reading

[1] MapReduce: A Crash Course

[2] StackOverflow: The python yield keyword explained

[3] Explicando iterables, generators e yield no python

[4] MapReduce: Simplied Data Processing on Large Clusters

http://events.inf.ed.ac.uk/ilw/hadoop/mr-overview.pdf

http://events.inf.ed.ac.uk/ilw/hadoop/mr-overview.pdf

http://stackoverflow.com/questions/231767/the-python-yield-keyword-explained




http://diofeher.wordpress.com/2011/04/12/explicando-iterables-generators-yield-no-python/




http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf




Reference & Further reading

[5] MrJob 4.0 - Quick start

[6] Amazon EC2 Instance Types

http://pythonhosted.org/mrjob/guides/quickstart.html

http://pythonhosted.org/mrjob/guides/quickstart.html

http://aws.amazon.com/ec2/instance-types/#instance-details

http://aws.amazon.com/ec2/instance-types/#instance-details

Life beyond MapReduce

What reading about other frameworks for distributed processing with BigData?● Spark● Storm● GraphLab

And don't get me started on NoSQL...

http://spark-project.org/

http://spark-project.org/

http://storm-project.net/

http://storm-project.net/

http://graphlab.org/

http://graphlab.org/

Many thanks to...

for supporting this course.You know there will be some live, intense, groovy Elastic MapReduce action right after this presentation, right?

Questions?

Feel free to contact me at [email protected]

Or follows us @chaordic

So, lets write some code?

Twitter Dataset● Count how many followers each user has● Discover the user with more followers● What if I want the top-N most followed?

LastFM● Merge similarity and metadata for tracks● What is the most "plain" song?● What is the plainest rock song according only to rock

songs?

Extra slides

mapreduce: teoria e prática

Technology