mapreduce: teoria e prática
DESCRIPTION
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.TRANSCRIPT
![Page 1: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/1.jpg)
MapReduce 101
by Chaordic Systems
![Page 2: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/2.jpg)
Brought to you by...
![Page 3: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/3.jpg)
Big Data, what's the big deal?
Why is this talk relevant to you?
● we have too much datato process in a single computer
● we make too few informed decisionbased on the data we have
● we have too little {time|CPU|memory}to analyze all this data
● 'cuz not everything needs to be on-lineIt's 2013 but doing batch processing is still OK
![Page 4: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/4.jpg)
Map-what?
And why MapReduce and not, say MPI?
● Simple computation modelMapReduce exposes a simple (and limited) computational model.It can be a restraining at times but it is a trade off.
● Fault-tolerance, parallelization and distribution among machines for freeThe framework deals with this for you so you don't have to
● Because it is the bread-and-butter of Big Data processingIt is available in all major cloud computing platforms, and it is against what other Big Data systems compare themselves against.
![Page 5: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/5.jpg)
Outline
● Fast recap on python and whatnot
● Introduction to MapReduce
● Counting Words
● MrJob and EMR
● Real-life examples
![Page 6: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/6.jpg)
Fast recap
![Page 7: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/7.jpg)
Let's assume you know what the following is:
● JSON
● Python's yield keyword
● Generators in Python
● Amazon S3
● Amazon EC2
If you don't, raise your hand now. REALLY
Fast recap
![Page 8: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/8.jpg)
RecapJSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format.It's like if XML and JavaScript slept together and gave birth a bastard but good-looking child.
{"timestamp": "2011-08-15 22:17:31.334057",
"track_id": "TRACCJA128F149A144",
"tags": [["Bossa Nova", "100"],
["jazz", "20"],
["acoustic", "20"],
["romantic", "20"],],
"title": "Segredo",
"artist": "Jo\u00e3o Gilberto"}
![Page 9: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/9.jpg)
RecapPython generators
From Python's wiki:“Generators functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.”
The difference is: a generator can be iterated (or read) only once as you don't store things in memory but create them on the fly [2].
You can create generators using the yield keyword.
![Page 10: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/10.jpg)
RecapPython yield keyword
It's just like a return, but turns your function into a generator.Your function will suspend its execution after yielding a value and resume its execution for after the request for the next item in the generator (next loop).
def count_from_1(): i = 1
while True: yield i i += 1
for j in count_from_1(): print j
![Page 11: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/11.jpg)
RecapAmazon S3
From Wikipedia:“Amazon S3 (Simple Storage Service) is an online storage web service offered by Amazon Web Services.”
Its like a distributed filesystem that is easy to use from other Amazon services, specially from Amazon Elastic MapReduce.
![Page 12: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/12.jpg)
RecapEC2 - Elastic Cloud Computing
From Wikipedia:“EC2 allows users to rent virtual computers on which to run their own computer applications”
So you can rent clusters on demand, no need to maintain, keep fixing and up-to-date your ever breaking cluster of computers. Less headache, moar action.
Instances can be purchased on demand for fixed prices or you can bid on those.
![Page 13: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/13.jpg)
MapReduce:a quick introduction
![Page 14: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/14.jpg)
MapReduce
MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.
![Page 15: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/15.jpg)
MapReduce
MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.
Map
![Page 16: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/16.jpg)
MapReduce
MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.
Map Reduce
![Page 17: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/17.jpg)
Typical (big data) problem
● Iterate over a large number of records
● Extract something of interest from each
● Shuffle and sort intermediate results
● Aggregate intermediate results
● Generate final output
Map
Reduce
![Page 18: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/18.jpg)
Phases of a MapReduction
MapReduce have the following steps:
map(key, value) -> [(key1, value1), (key1, value2)]
combine
sort + shuffle
reduce(key1, [value1, value2]) -> [(keyX, valueY)]
May happen in parallel, in multiple machines!
![Page 19: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/19.jpg)
![Page 20: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/20.jpg)
Notice:
Reduce phase only starts after all mappers have completed.Yes, there is a synchronization barrier right there.
There is no global knowledgeNeither mappers nor reducers know what other mappers (or reducers) are processing
![Page 21: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/21.jpg)
![Page 22: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/22.jpg)
Counting Words
Counting the number of occurrences of a word in a document collection is quite a big deal.
Let's try with a small example:
"Me gusta correr, me gustas tu.Me gusta la lluvia, me gustas tu."
![Page 23: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/23.jpg)
Counting Words
"Me gusta correr, me gustas tu.Me gusta la lluvia, me gustas tu."
me 4
gusta 2
correr 1
gustas 2
tu 2
la 1
lluvia 1
![Page 24: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/24.jpg)
Counting word - in Python
doc = open('input')
count = {}
for line in doc: words = line.split()
for w in words: count[w] = count.get(w, 0) + 1
Easy, right? Yeah... too easy. Let's split what we do for each line and aggregate, shall we?
![Page 25: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/25.jpg)
Counting word - in MapReduce
def map_get_words(self, key, line):
for word in line.split():
yield word, 1
def reduce_sum_words(self, word, occurrences):
yield word, sum(occurrences)
![Page 26: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/26.jpg)
def map_get_words(self, key, line):
for word in line.split():
yield word, 1
What is Map's output?
key=1line="me gusta correr me gustas tu"
('me', 1)
('gusta', 1)
('correr', 1)
('me', 1)
('gustas', 1)
('tu', 1)
key=2line="me gusta la lluvia me gustas tu"
('me', 1),
('gusta', 1)
('la', 1)
('lluvia', 1)
('me', 1)
('gustas', 1)
('tu', 1)
![Page 27: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/27.jpg)
What about shuffle?
![Page 28: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/28.jpg)
![Page 29: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/29.jpg)
What about shuffle?
Think of it as a distributed group by operation.
In the local map instance/node:
● it sorts map output values,● groups them by their key,● send this group of key and associated values to the
reduce node responsible for this key.
In the reduce instance/node:
● the framework joins all values associated with this key in a single list - for you, for free.
![Page 30: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/30.jpg)
What's Shuffle output? orWhat's Reducer input?
Notice:
This table represents a global view.
"In real life", each reducer instance only knows about its own key and values.
Key (input) Values
correr [1]
gusta [1, 1]
gustas [1, 1]
la [1]
lluvia [1]
me [1, 1, 1, 1]
tu [1, 1]
![Page 31: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/31.jpg)
def reduce_sum_words(self, word, occurrences):
yield word, sum(occurrences)
What's Reducer output?
word occurrences output
correr [1] (correr, 1)
gusta [1, 1] (gusta, 2)
gustas [1, 1] (gustas, 2)
la [1] (la, 1)
lluvia [1] (lluvia, 1)
me [1, 1, 1, 1] (me, 4)
tu [1, 1] (tu, 2)
![Page 32: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/32.jpg)
![Page 33: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/33.jpg)
MapReduce (main) Implementations
Google MapReduce● C++● Proprietary
Apache Hadoop● Java
○ interfaces for anything that runs in the JVM○ Hadoop streamming for a pipe-like programming
language agnostic interface● Open source
Nobody really cares about the others (for now... ;)
![Page 34: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/34.jpg)
Amazon Elastic MapReduce (EMR)
Amazon Elastic MapReduce
● Uses Hadoop with extra sauces
● creates a hadoop cluster on demand
● It's magical -- except when it fails
● Can be a sort of unpredictable sometimes○ Installing python modules can fail for no clear reason
![Page 35: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/35.jpg)
MrJob
It's a python interface for hadoop streaming jobs with a really easy to use interface
● Can run jobs locally or in EMR.● Takes care of uploading your python code to
EMR.● Deals better if everything is in a single
python module.● Easy interface to chain sequences of M/R
steps.● Some basic tools to aid debugging.
![Page 36: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/36.jpg)
Counting wordsFull MrJob Examplefrom mrjob.job import MRJob
class MRWordCounter(MRJob):
def get_words(self, key, line):
for word in line.split():
yield word, 1
def sum_words(self, word, occurrences):
yield word, sum(occurrences)
def steps(self):
return [self.mr(self.get_words, self.sum_words),]
if __name__ == '__main__':
MRWordCounter.run()
![Page 37: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/37.jpg)
MrJobLauching a job
Running it locallypython countwords.py --conf-path=mrjob.conf input.txt
Running it in EMRDo not forget to set AWS_ env. vars!
python countwords.py \ --conf-path=mrjob.conf \ -r emr \ 's3://ufcgplayground/data/words/*' \ --no-output \ --output-dir=s3://ufcgplayground/tmp/bla/
![Page 38: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/38.jpg)
Install MrJob using pip or easy_installDo not, I repeat DO NOT install the version in Ubuntu/Debian.
sudo pip install mrjob
Setup your environment with AWS credentialsexport AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
Setup your environment to look for MrJob settings:
export MRJOB_CONF=<path to mrjob.conf>
MrJobInstalling and Environment setup
![Page 39: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/39.jpg)
Use our sample MrJob app as your templategit clone https://github.com/chaordic/mr101ufcg.git
Modify the sample mrjob.conf so that your jobs are labeled to your teamIt's the Right Thing © to do.
s3_logs_uri: s3://ufcgplayground/yournamehere/log/
s3_scratch_uri: s3://ufcgplayground/yournamehere/tmp/
Profit!
MrJobInstalling and Environment setup
![Page 40: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/40.jpg)
Real
![Page 41: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/41.jpg)
Target Categories
Objective: Find the most commonly viewed categories per user
Input:● views and orders
Patterns used:● simple aggregation
![Page 42: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/42.jpg)
zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [eletro, caos, furadeira]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]
Map input
![Page 43: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/43.jpg)
zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [eletro, caos, furadeira]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]
Map input
Key
![Page 44: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/44.jpg)
zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [telefone, celulares, vivo]zezin, fulano, [eletro, caos, furadeira]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]lojaX, fulano, [livros, arte, anime]
Map input
Key
Reduce Input
(zezin, fulano)[telefone, celulares, vivo][telefone, celulares, vivo][eletro, caos, furadeira]
(lojaX, fulano)[livros, arte, anime][livros, arte, anime][livros, arte, anime]
Sort + Shuffle
![Page 45: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/45.jpg)
Reduce Input
(zezin, fulano)[telefone, celulares, vivo][telefone, celulares, vivo][eletro, caos, furadeira]
(lojaX, fulano)[livros, arte, anime][livros, arte, anime][livros, arte, anime]
![Page 46: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/46.jpg)
Reduce Output
(zezin, fulano) ([telefone, celulares, vivo], 2)([eletro, caos, furadeira], 1)
(lojaX, fulano) ([livros, arte, anime], 3)
Reduce Input
(zezin, fulano)[telefone, celulares, vivo][telefone, celulares, vivo][eletro, caos, furadeira]
(lojaX, fulano)[livros, arte, anime][livros, arte, anime][livros, arte, anime]
![Page 47: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/47.jpg)
Filter Expensive Categories
Objective: List all categories where a user purchased something expensive.
Input:● Orders (for price and user information)● Products (for category information)
Patterns used:● merge using reducer
![Page 48: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/48.jpg)
lojaX livro fulano R$ 20
lojaX iphone deltrano R$ 1800
lojaX livro [livros, arte, anime]
lojaX iphone [telefone, celulares, vivo]
Pro
duct
sB
uyO
rder
s
Map
Inpu
t
We have to merge those tables above!
![Page 49: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/49.jpg)
lojaX livro fulano R$ 20
lojaX iphone deltrano R$ 1800
lojaX livro [livros, arte, anime]
lojaX iphone [telefone, celulares, vivo]
Pro
duct
sB
uyO
rder
s
Map
Inpu
t
commonKey
![Page 50: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/50.jpg)
lojaX livro fulano R$ 20 (nada, é barato)
lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”}
lojaX livro [livros, arte, anime] {“cat”: [livros...]}
lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]}
Pro
duct
sB
uyO
rder
s
Map
Inpu
t
Key Value
Map Output
![Page 51: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/51.jpg)
lojaX livro fulano R$ 20 (nada, é barato)
lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”}
lojaX livro [livros, arte, anime] {“cat”: [livros...]}
lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]}
Pro
duct
sB
uyO
rder
s
Map
Inpu
t
Key Value
Map Output
(lojaX, livro) {“cat”: [livros, arte, anime]}
(lojaX, iphone) {”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}
Red
uce
Inpu
t
![Page 52: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/52.jpg)
(lojaX, livro) {“cat”: [livros, arte, anime]}
(lojaX, iphone) {”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}
Key Values
Red
uce
Inpu
t
![Page 53: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/53.jpg)
(lojaX, livro) {“cat”: [livros, arte, anime]}
(lojaX, iphone) {”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}
Key Values
Red
uce
Inpu
t
Those are the parts we care about!
![Page 54: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/54.jpg)
(lojaX, livro) {“cat”: [livros, arte, anime]}
(lojaX, iphone) {”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}
Key Values
(lojaX, deltrano) [telefone, celulares, vivo]
Red
uce
Out
put
Red
uce
Inpu
t
![Page 55: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/55.jpg)
Real
Datasets
![Page 56: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/56.jpg)
Real datasets, real problems
In the following hour we will write code to analyse some real datasets:● Twitter Dataset (from an article published in WWW'10)● LastFM Dataset, from The Million Song Datset
Supporting code ● available at GitHub, under https://github.
com/chaordic/mr101ufcg● comes with sample data under data for
local runs.
![Page 57: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/57.jpg)
Twitter Followers Dataset
A somewhat big dataset● 41.7 million profiles● 1.47 billion social relations (who follows who)● 25 Gb of uncompressed data
Available at s3://mr101ufcg/data/twitter/ ...● splitted/*.gz
full dataset splitted in small compressed files
● numeric2screen.txtnumerid id to original screen name mapping
● followed_by.txtoriginal 25Gb dataset as a single file
![Page 58: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/58.jpg)
Twitter Followers Dataset
Each line in followed_by.txt has the following format:
user_id \t follower_id
For instance:12 \t 38
12 \t 41
13 \t 47
13 \t 52
13 \t 53
14 \t 56
![Page 59: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/59.jpg)
Million Song Dataset project'sLast.fm Dataset
A not-so-big dataset● 943,347 tracks● 1.2G of compressed data
Yeah, it is not all that big...
Available at s3://mr101ufcg/data/lastfm/ ...● metadata/*.gz
Track metadata information, in JSONProtocol format.
● similars/*.gzTrack similarity information, in JSONProtocol format.
![Page 60: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/60.jpg)
Million Song Dataset project'sLast.fm Dataset
JSONProcotol encodes key-pair information in a single line using json-encoded values separated by a tab character ( \t ).
<JSON encoded data> \t <JSON encoded data>
Exemple line:
"TRACHOZ12903CCA8B3" \t {"timestamp": "2011-09-07 22:12:47.150438", "track_id": "TRACHOZ12903CCA8B3", "tags": [], "title": "Close Up", "artist": "Charles Williams"}
![Page 61: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/61.jpg)
Questions?
![Page 62: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/62.jpg)
Stuff I didn't talk about but are sorta cool
Persistent jobs
Serialization (protocols in MrJob parlance)
Amazon EMR Console
Hadoop dashboard (and port 9100)
![Page 63: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/63.jpg)
Combiners
Are just like reducers but take place just after a Map and just before data is sent to the network during shuffle.
Combiners must...● be associative {a.(b.c) == (a.b).c}● commutative (a.b == b.a)● have the same input and output types as yours Map
output type.
Caveats:● Combiners can be executed zero, one or many times,
so don't make your MR depend on them
![Page 64: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/64.jpg)
Reference & Further reading
[1] MapReduce: A Crash Course
[2] StackOverflow: The python yield keyword explained
[3] Explicando iterables, generators e yield no python
[4] MapReduce: Simplied Data Processing on Large Clusters
![Page 65: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/65.jpg)
Reference & Further reading
[5] MrJob 4.0 - Quick start
[6] Amazon EC2 Instance Types
![Page 66: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/66.jpg)
Life beyond MapReduce
What reading about other frameworks for distributed processing with BigData?● Spark● Storm● GraphLab
And don't get me started on NoSQL...
![Page 67: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/67.jpg)
Many thanks to...
for supporting this course.You know there will be some live, intense, groovy Elastic MapReduce action right after this presentation, right?
![Page 69: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/69.jpg)
![Page 70: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/70.jpg)
So, lets write some code?
Twitter Dataset● Count how many followers each user has● Discover the user with more followers● What if I want the top-N most followed?
LastFM● Merge similarity and metadata for tracks● What is the most "plain" song?● What is the plainest rock song according only to rock
songs?
![Page 71: MapReduce: teoria e prática](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a0770b4c905507a8b55be/html5/thumbnails/71.jpg)
Extra slides