mapreduce and hadoop - uniroma2.ithadoop.pdf · mapreduce and hadoop corso di sistemi e...
TRANSCRIPT
MapReduce and Hadoop
Corso di Sistemi e Architetture per Big Data A.A. 2016/17
Valeria Cardellini
Università degli Studi di Roma “Tor Vergata”
ThereferenceBigDatastack
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
ValeriaCardellini-SABD2016/17
1
TheBigDatalandscapeVa
leria
Carde
llini-SA
BD2016/17
2
Theopen-sourcelandscapeforBigData
ValeriaCardellini-SABD2016/17
3
ProcessingBigData• Some frameworks and platforms to process Big
Data – Relational DBMS: old fashion, no longer applicable – MapReduce & Hadoop: store and process data sets at
massive scale (especially Volume+Variety), also known as batch processing
– Data stream processing: process fast data (in real-time) as data is being generated, without storing
ValeriaCardellini-SABD2016/17
4
MapReduce
ValeriaCardellini-SABD2016/17
5
Parallelprogramming:background
• Parallel programming – Break processing into parts that can be
executed concurrently on multiple processors • Challenge
– Identify tasks that can run concurrently and/or groups of data that can be processed concurrently
– Not all problems can be parallelized!
ValeriaCardellini-SABD2016/17
6
Parallelprogramming:background• Simplest environment for parallel programming
– No dependency among data • Data can be split into equal-size chunks
– Each process can work on a chunk – Master/worker approach
• Master – Initializes array and splits it according to the number of workers – Sends each worker the sub-array – Receives the results from each worker
• Worker: – Receives a sub-array from master – Performs processing – Sends results to master
• Single Program, Multiple Data (SPMD): technique to achieve parallelism – The most common style of parallel programming
ValeriaCardellini-SABD2016/17
7
KeyideabehindMapReduce:Divideandconquer
• A feasible approach to tackling large-data problems – Partition a large problem into smaller sub-problems – Independent sub-problems executed in parallel – Combine intermediate results from each individual worker
• The workers can be: – Threads in a processor
core – Cores in a multi-core
processor – Multiple processors in a
machine – Many machines in a
cluster • Implementation details of
divide and conquer are complex ValeriaCardellini-SABD2016/17
8
Divideandconquer:how?
• Decompose the original problem in smaller, parallel tasks • Schedule tasks on workers distributed in a cluster,
keeping into account: – Data locality – Resource availability
• Ensure workers get the data they need • Coordinate synchronization among workers • Share partial results • Handle failures
ValeriaCardellini-SABD2016/17
9
KeyideabehindMapReduce:Scaleout,notup!
• For data-intensive workloads, a large number of commodity servers is preferred over a small number of high-end servers – Cost of super-computers is not linear – Datacenter efficiency is a difficult problem to solve, but recent
improvements
• Processing data is quick, I/O is very slow • Sharing vs. shared nothing:
– Sharing: manage a common/global state – Shared nothing: independent entities, no common state
• Sharing is difficult: – Synchronization, deadlocks – Finite bandwidth to access data from SAN – Temporal dependencies are complicated (restarts)
ValeriaCardellini-SABD2016/17
10
MapReduce• Programming model for processing huge
amounts of data sets over thousands of servers – Originally proposed by Google in 2004:
“MapReduce: simplified data processing on large clusters” http://bit.ly/2iq7jlY
– Based on a shared nothing approach • Also an associated implementation
(framework) of the distributed system that runs the corresponding programs
• Some examples of applications for Google: – Web indexing – Reverse Web-link graph – Distributed sort – Web access statistics
ValeriaCardellini-SABD2016/17
11
MapReduce:programmerview• MapReduce hides system-level details
– Key idea: separate the what from the how – MapReduce abstracts away the “distributed” part of
the system – Such details are handled by the framework
• Programmers get simple API – Don’t have to worry about handling
• Parallelization • Data distribution • Load balancing • Fault tolerance
ValeriaCardellini-SABD2016/17
12
TypicalBigDataproblem• Iterate over a large number of records • Extract something of interest from each record • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output
Key idea: provide a functional abstraction of the two Map and Reduce operations
ValeriaCardellini-SABD2016/17
13
MapReduce:model• Processing occurs in two phases: Map and Reduce
– Functional programming roots (e.g., Lisp) • Map and Reduce are defined by the programmer • Input and output: sets of key-value pairs • Programmers specify two functions: map and reduce • map(k1, v1) [(k2, v2)] • reduce(k2, [v2]) [(k3, v3)]
– (k, v) denotes a (key, value) pair – […] denotes a list – Keys do not have to be unique: different pairs can have the
same key – Normally the keys of input elements are not relevant
ValeriaCardellini-SABD2016/17
14
Map• Execute a function on a set of key-value pairs
(input shard) to create a new list of values map (in_key, in_value) list(out_key, intermediate_value)
• Map calls are distributed across machines by automatically partitioning the input data into M shards
• MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function
ValeriaCardellini-SABD2016/17
15
Reduce• Combine values in sets to create a new value
reduce (out_key, list(intermediate_value)) list(out_value)
ValeriaCardellini-SABD2016/17
16
YourfirstMapReduceexample
• Example: sum-of-squares (sum the square of numbers from 1 to n) in MapReduce fashion
• Map function: map square [1,2,3,4,5] returns [1,4,9,16,25]
• Reduce function: reduce [1,4,9,16,25] returns 55 (the sum of the square elements)
ValeriaCardellini-SABD2016/17
17
MapReduceprogram• A MapReduce program, referred to as a job, consists of:
– Code for Map and Reduce packaged together – Configuration parameters (where the input lies, where the output
should be stored) – Input data set, stored on the underlying distributed file system
• The input will not fit on a single computer’s disk
• Each MapReduce job is divided by the system into smaller units called tasks – Map tasks – Reduce tasks
• The output of MapReduce job is also stored on the underlying distributed file system ValeriaCardellini-SABD2016/17
18
MapReducecomputaWon1. Some number of Map tasks each are given one or more chunks
of data from a distributed file system. 2. These Map tasks turn the chunk into a sequence of key-value
pairs. – The way key-value pairs are produced from the input data is
determined by the code written by the user for the Map function. 3. The key-value pairs from each Map task are collected by a master
controller and sorted by key. 4. The keys are divided among all the Reduce tasks, so all key-
value pairs with the same key wind up at the same Reduce task. 5. The Reduce tasks work on one key at a time, and combine all the
values associated with that key in some way. – The manner of combination of values is determined by the code
written by the user for the Reduce function. 6. Output key-value pairs from each reducer are written persistently
back onto the distributed file system 7. The output ends up in r files, where r is the number of reducers.
– Such output may be the input to a subsequent MapReduce phase ValeriaCardellini-SABD2016/17
19
Wherethemagichappens• Implicit between the map and reduce phases is a
distributed “group by” operation on intermediate keys – Intermediate data arrive at each reducer in order, sorted by
the key – No ordering is guaranteed across reducers
• Intermediate keys are transient – They are not stored on the distributed file system – They are “spilled” to the local disk of each machine in the
cluster
(k,v)Pairs
MapFuncWon
(k’, v’) Pairs
ReduceFuncWon
(k’’,v’’)Pairs
Input Splits Intermediate Outputs Final Outputs
ValeriaCardellini-SABD2016/17
20
MapReducecomputaWon:thecompletepicture
ValeriaCardellini-SABD2016/17
21
AsimplifiedviewofMapReduce:example
• Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs
• Reducers are applied to all intermediate values associated with the same intermediate key
• Between the map and reduce phase lies a barrier that involves a large distributed sort and group by
ValeriaCardellini-SABD2016/17
22
“HelloWorld”inMapReduce:WordCount
• Problem: count the number of occurrences for each word in a large collection of documents
• Input: repository of documents, each document is an element
• Map: read a document and emit a sequence of key-value pairs where: – Keys are words of the documents and values are equal to 1:
(w1, 1), (w2, 1), … , (wn, 1) • Shuffle and sort: group by key and generates pairs of the
form (w1, [1, 1, … , 1]) , . . . , (wn, [1, 1, … , 1]) • Reduce: add up all the values and emits (w1, k) ,…, (wn, l) • Output: (w,m) pairs where:
– w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents
ValeriaCardellini-SABD2016/17
23
WordCountinpracWce
ValeriaCardellini-SABD2016/17
24
Map Shuffle Reduce
WordCount:Map
• Map emits each word in the document with an associated value equal to “1”
Map(String key, String value): // key: document name
// value: document contents for each word w in value:
EmitIntermediate(w, “1” );
ValeriaCardellini-SABD2016/17
25
WordCount:Reduce• Reduce adds up all the “1” emitted for a given word
Reduce(String key, Iterator values): // key: a word // values: a list of counts int result=0 for each v in values: result += ParseInt(v) Emit(AsString(result))
• This is pseudo-code; for the complete code of the example see the MapReduce paper
ValeriaCardellini-SABD2016/17
26
Anotherexample:WordLengthCount• Problem: count how many words of certain lengths exist in a
collection of documents • Input: a repository of documents, each document is an
element • Map: read a document and emit a sequence of key-value
pairs where the key is the length of a word and the value is the word itself:
(i, w1), … , (j, wn) • Shuffle and sort: group by key and generate pairs of the form
(1, [w1, … , wk]) , … , (n, [wr, … , ws]) • Reduce: count the number of words in each list and emit:
(1, l) , … , (p, m) • Output: (l,n) pairs, where l is a length and n is the total
number of words of length l in the input documents ValeriaCardellini-SABD2016/17
27
MapReduce:execuWonoverview
ValeriaCardellini-SABD2016/17
28
ApacheHadoop
ValeriaCardellini-SABD2016/17
29
WhatisApacheHadoop?• Open-source software framework for reliable, scalable,
distributed data-intensive computing – Originally developed by Yahoo!
• Goal: storage and processing of data-sets at massive scale
• Infrastructure: clusters of commodity hardware • Core components:
– HDFS: Hadoop Distributed File System – Hadoop YARN – Hadoop MapReduce
• Includes a number of related projects – Among which Apache Pig, Apache Hive, Apache HBase
• Used in production by Facebook, IBM, Linkedin, Twitter, Yahoo! and many others ValeriaCardellini-SABD2016/17
30
Hadooprunsonclusters• Compute nodes are stored on racks
– 8-64 compute nodes on a rack • There can be many racks of compute nodes • The nodes on a single rack are connected by a network,
typically gigabit Ethernet • Racks are connected by another level of network or a
switch • The bandwidth of intra-rack communication is usually
much greater than that of inter-rack communication • Compute nodes can fail! Solution:
– Files are stored redundantly – Computations are divided into tasks ValeriaCardellini-SABD2016/17
31
Hadooprunsoncommodityhardware
• Pros – You are not tied to expensive, proprietary offerings
from a single vendor – You can choose standardized, commonly available
hardware from any of a large range of vendors to build your cluster
– Scale out, not up • Cons
– Significant failure rates – Solution: replication!
ValeriaCardellini-SABD2016/17
32
Commodityhardware• A typical specification of commodity hardware:
– Processor 2 quad-core 2-2.5GHz CP – Memory 16-24 GB ECC RAM – Storage 4 × 1TB SATA disks – Network Gigabit Ethernet
• Yahoo! has a huge installation of Hadoop: – > 100,000 CPUs in > 40,000 computers – Used to support research for Ad Systems and Web
Search – Also used to do scaling tests to support development of
Hadoop
ValeriaCardellini-SABD2016/17
33
Hadoopcore
• HDFS ü Already examined
• Hadoop YARN ü Already examined
• Hadoop MapReduce – System to write applications which process large
data sets in parallel on large clusters (> 1000 nodes) of commodity hardware in a reliable, fault-tolerant manner
ValeriaCardellini-SABD2016/17
34
ValeriaCardellini-SABD2016/17
35
Hadoopcore:MapReduce• ProgrammingparadigmforexpressingdistributedcomputaWonsovermulWpleservers
• Thepowerhousebehindmostoftoday’sbigdataprocessing
• AlsousedinotherMPPenvironmentsandNoSQLdatabases(e.g.,VerWcaandMongoDB)
ValeriaCardellini-SABD2016/17
36
• ImprovingHadoopusage– Improvingprogrammability:PigandHive
– Improvingdataaccess:HBase,SqoopandFlume
MapReducedataflow:noReducetask
ValeriaCardellini-SABD2016/17
37
MapReducedataflow:singleReducetask
ValeriaCardellini-SABD2016/17
38
MapReducedataflow:mulWpleReducetasks
ValeriaCardellini-SABD2016/17
39
OpWonalCombinetask• Many MapReduce jobs are limited by the cluster
bandwidth – How to minimize the data transferred between map
and reduce tasks? • Use Combine task
– An optional localized reducer that applies a user-provided method
– Performs data aggregation on intermediate data of the same key for the Map task’s output before transmitting the result to the Reduce task
• Takes each key-value pair from the Map phase, processes it, and produces the output as key-value collection pairs
ValeriaCardellini-SABD2016/17
40
ProgramminglanguagesforHadoop• Java is the default programming language • In Java you write a program with at least three
parts: 1. A Main method which configures the job, and
lauches it • Set number of reducers • Set mapper and reducer classes • Set partitioner • Set other Hadoop configurations
2. A Mapper class • Takes (k,v) inputs, writes (k,v) outputs
3. A Reducer Class • Takes k, Iterator[v] inputs, and writes (k,v) outputs
ValeriaCardellini-SABD2016/17
41
ProgramminglanguagesforHadoop• Hadoop Streaming API to write map and reduce
functions in programming languages other than Java • Uses Unix standard streams as interface between
Hadoop and MapReduce program • Allows to use any language (e.g., Python) that can read
standard input (stdin) and write to standard output (stdout) to write the MapReduce program – Standard input is stream data going into a program; it is from the
keyboard – Standard output is the stream where a program writes its output
data; it is the text terminal – Reducer interface for streaming is different than in Java: instead
of receiving reduce(k, Iterator[v]), your script is sent one line per value, including the key
ValeriaCardellini-SABD2016/17
42
WordCountonHadoop• Let’s analyze the WordCount code on Hadoop • The code in Java
hgp://bit.ly/1m9xHym
• The same code in Python – Use Hadoop Streaming API for passing data between
the Map and Reduce code via standard input and standard output
– See Michael Noll’s tutorial http://bit.ly/19YutB1
ValeriaCardellini-SABD2016/17
43
HadoopinstallaWon• Some straightforward alternatives
– A tutorial on manual single-node installation to run Hadoop on Ubuntu Linux
http://bit.ly/1eMFVFm
– Cloudera CDH, Hortonworks HDP, MapR • Integrated Hadoop-based stack containing all the
components needed for production, tested and packaged to work together
• Include at least Hadoop, YARN, HDFS (not in MapR), HBase, Hive, Pig, Sqoop, Flume, Zookeeper, Kafka, Spark
ValeriaCardellini-SABD2016/17
44
HadoopinstallaWon• Cloudera CDH
– See QuickStarts http://bit.ly/2oMHob4 – Single-node cluster: requires to have previously installed an
hypervisor (VMware, KVM, or VirtualBox) – Multi-node cluster: requires to have previously installed Docker
(container technology, https://www.docker.com) – QuickStarts not for use in production
• Hortonworks HDP – Requires 6.5 GB of disk space and Linux distribution – Can be installed either manually or using Ambari (open source
management platform for Hadoop) – Hortonworks Sandbox on a VM
• Requires either VirtualBox, VMWare or Docker
• MapR – Own distributed file system rather than HDFS
ValeriaCardellini-SABD2016/17
45
ValeriaCardellini-SABD2016/17
46
RecallYARNarchitecture• Global ResourceManager (RM) • A set of per-application ApplicationMasters (AMs) • A set of NodeManagers (NMs)
ValeriaCardellini-SABD2016/17
47
YARNjobexecuWon
ValeriaCardellini-SABD2016/17
48
DatalocalityopWmizaWon
• Hadoop tries to run the map task on a data-local node (data locality optimization) – So to not use cluster bandwidth – Otherwise rack-local – Off-rack as last choice
ValeriaCardellini-SABD2016/17
49
HadoopconfiguraWon• Tuning Hadoop clusters for good performance is
somehow magic – HW systems, SW systems (including OS), and tuning/
optimization of Hadoop infrastructure components, e.g.,
• Optimal number of disks that maximizes I/O bandwidth • BIOS parameters • File system type (on Linux, EXT4 is better) • Open file handles • Optimal HDFS block size • JVM settings for Java heap usage and garbage collections
See http://bit.ly/2nT7TIn for some detailed hints
ValeriaCardellini-SABD2016/17
50
Howmanymappersandreducers• Can be set in a configuration file (JobConfFile)
or during command line execution – For mappers, it is just an hint
• Number of mappers – Driven by the number of HDFS blocks in the input
files – You can adjust the HDFS block size to adjust the
number of maps • Number of reducers
– Can be user-defined (default is 1)
See http://bit.ly/2oK0D5A ValeriaCardellini-SABD2016/17
51
HadoopintheCloud
• Pros: – Gain Cloud scalability and elasticity – Do not need to manage and provision the
infrastructure and the platform • Main challenges:
– Move data to the Cloud • Latency is not zero (because of speed of light)! • Minor issue: network bandwidth
– Data security and privacy
ValeriaCardellini-SABD2016/17
52
AmazonElasWcMapReduce(EMR)• Distribute the computational work across a cluster of
virtual servers running in Amazon cloud (EC2 instances) • Cluster managed with Hadoop • Input and output: Amazon S3, DynamoDB • Not only Hadoop, also Hive, Pig and Spark
ValeriaCardellini-SABD2016/17
53
ValeriaCardellini-SABD2016/17
54
CreateEMRcluster
Steps: 1. Provide cluster
name 2. Select EMR release
and applications to install
3. Select instance type
4. Select number of instances
5. Select EC2 key-pair to connect securely to the cluster
Seehttp://amzn.to/2nnujp5
EMRclusterdetails
ValeriaCardellini-SABD2016/17
55
HadooponGoogleCloudPlalorm
ValeriaCardellini-SABD2016/17
56
• Distribute the computational work across a cluster of virtual servers running in the Google Cloud Platform
• Cluster managed with Hadoop • Input and output: Google Cloud Storage, HDFS • Not only Hadoop, also Spark
References
• Dean and Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI 2004. http://static.googleusercontent.com/media/research.google.com/it//archive/mapreduce-osdi04.pdf
• White, “Hadoop: The Definitive Guide, 4th edition”, O’Reilly, 2015.
• Miner and Shook, “MapReduce Design Patterns”, O’Reilly, 2012.
• Leskovec, Rajaraman, and Ullman, “Mining of Massive Datasets”, chapter 2, 2014. http://www.mmds.org
ValeriaCardellini-SABD2016/17
57