mapreduce and hadoop - uniroma2.ithadoop.pdf · mapreduce and hadoop corso di sistemi e...

MapReduce and Hadoop

Corso di Sistemi e Architetture per Big Data A.A. 2016/17

Valeria Cardellini

Università degli Studi di Roma “Tor Vergata”

ThereferenceBigDatastack

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

ValeriaCardellini-SABD2016/17

1

TheBigDatalandscapeVa

leria

Carde

llini-SA

BD2016/17

2

Theopen-sourcelandscapeforBigData


3

ProcessingBigData•  Some frameworks and platforms to process Big

Data –  Relational DBMS: old fashion, no longer applicable –  MapReduce & Hadoop: store and process data sets at

massive scale (especially Volume+Variety), also known as batch processing

–  Data stream processing: process fast data (in real-time) as data is being generated, without storing


4

MapReduce


5

Parallelprogramming:background

•  Parallel programming – Break processing into parts that can be

executed concurrently on multiple processors •  Challenge

–  Identify tasks that can run concurrently and/or groups of data that can be processed concurrently

– Not all problems can be parallelized!


6

Parallelprogramming:background•  Simplest environment for parallel programming

–  No dependency among data •  Data can be split into equal-size chunks

–  Each process can work on a chunk –  Master/worker approach

•  Master –  Initializes array and splits it according to the number of workers –  Sends each worker the sub-array –  Receives the results from each worker

•  Worker: –  Receives a sub-array from master –  Performs processing –  Sends results to master

•  Single Program, Multiple Data (SPMD): technique to achieve parallelism –  The most common style of parallel programming


7

KeyideabehindMapReduce:Divideandconquer

•  A feasible approach to tackling large-data problems –  Partition a large problem into smaller sub-problems –  Independent sub-problems executed in parallel –  Combine intermediate results from each individual worker

•  The workers can be: –  Threads in a processor

core –  Cores in a multi-core

processor –  Multiple processors in a

machine –  Many machines in a

cluster •  Implementation details of

divide and conquer are complex ValeriaCardellini-SABD2016/17

8

Divideandconquer:how?

•  Decompose the original problem in smaller, parallel tasks •  Schedule tasks on workers distributed in a cluster,

keeping into account: –  Data locality –  Resource availability

•  Ensure workers get the data they need •  Coordinate synchronization among workers •  Share partial results •  Handle failures


9

KeyideabehindMapReduce:Scaleout,notup!

•  For data-intensive workloads, a large number of commodity servers is preferred over a small number of high-end servers –  Cost of super-computers is not linear –  Datacenter efficiency is a difficult problem to solve, but recent

improvements

•  Processing data is quick, I/O is very slow •  Sharing vs. shared nothing:

–  Sharing: manage a common/global state –  Shared nothing: independent entities, no common state

•  Sharing is difficult: –  Synchronization, deadlocks –  Finite bandwidth to access data from SAN –  Temporal dependencies are complicated (restarts)


10

MapReduce•  Programming model for processing huge

amounts of data sets over thousands of servers –  Originally proposed by Google in 2004:

“MapReduce: simplified data processing on large clusters” http://bit.ly/2iq7jlY

–  Based on a shared nothing approach •  Also an associated implementation

(framework) of the distributed system that runs the corresponding programs

•  Some examples of applications for Google: –  Web indexing –  Reverse Web-link graph –  Distributed sort –  Web access statistics


11

MapReduce:programmerview•  MapReduce hides system-level details

–  Key idea: separate the what from the how –  MapReduce abstracts away the “distributed” part of

the system –  Such details are handled by the framework

•  Programmers get simple API –  Don’t have to worry about handling

•  Parallelization •  Data distribution •  Load balancing •  Fault tolerance


12

TypicalBigDataproblem•  Iterate over a large number of records •  Extract something of interest from each record •  Shuffle and sort intermediate results •  Aggregate intermediate results •  Generate final output

Key idea: provide a functional abstraction of the two Map and Reduce operations


13

MapReduce:model•  Processing occurs in two phases: Map and Reduce

–  Functional programming roots (e.g., Lisp) •  Map and Reduce are defined by the programmer •  Input and output: sets of key-value pairs •  Programmers specify two functions: map and reduce •  map(k1, v1) [(k2, v2)] •  reduce(k2, [v2]) [(k3, v3)]

–  (k, v) denotes a (key, value) pair –  […] denotes a list –  Keys do not have to be unique: different pairs can have the

same key –  Normally the keys of input elements are not relevant


14

Map•  Execute a function on a set of key-value pairs

(input shard) to create a new list of values map (in_key, in_value) list(out_key, intermediate_value)

•  Map calls are distributed across machines by automatically partitioning the input data into M shards

•  MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function


15

Reduce•  Combine values in sets to create a new value

reduce (out_key, list(intermediate_value)) list(out_value)


16

YourfirstMapReduceexample

•  Example: sum-of-squares (sum the square of numbers from 1 to n) in MapReduce fashion

•  Map function: map square [1,2,3,4,5] returns [1,4,9,16,25]

•  Reduce function: reduce [1,4,9,16,25] returns 55 (the sum of the square elements)


17

MapReduceprogram•  A MapReduce program, referred to as a job, consists of:

–  Code for Map and Reduce packaged together –  Configuration parameters (where the input lies, where the output

should be stored) –  Input data set, stored on the underlying distributed file system

•  The input will not fit on a single computer’s disk

•  Each MapReduce job is divided by the system into smaller units called tasks –  Map tasks –  Reduce tasks

•  The output of MapReduce job is also stored on the underlying distributed file system ValeriaCardellini-SABD2016/17

18

MapReducecomputaWon1.  Some number of Map tasks each are given one or more chunks

of data from a distributed file system. 2.  These Map tasks turn the chunk into a sequence of key-value

pairs. –  The way key-value pairs are produced from the input data is

determined by the code written by the user for the Map function. 3.  The key-value pairs from each Map task are collected by a master

controller and sorted by key. 4.  The keys are divided among all the Reduce tasks, so all key-

value pairs with the same key wind up at the same Reduce task. 5.  The Reduce tasks work on one key at a time, and combine all the

values associated with that key in some way. –  The manner of combination of values is determined by the code

written by the user for the Reduce function. 6.  Output key-value pairs from each reducer are written persistently

back onto the distributed file system 7.  The output ends up in r files, where r is the number of reducers.

–  Such output may be the input to a subsequent MapReduce phase ValeriaCardellini-SABD2016/17

19

Wherethemagichappens•  Implicit between the map and reduce phases is a

distributed “group by” operation on intermediate keys –  Intermediate data arrive at each reducer in order, sorted by

the key –  No ordering is guaranteed across reducers

•  Intermediate keys are transient –  They are not stored on the distributed file system –  They are “spilled” to the local disk of each machine in the

cluster

(k,v)Pairs

MapFuncWon

(k’, v’) Pairs

ReduceFuncWon

(k’’,v’’)Pairs

Input Splits Intermediate Outputs Final Outputs


20

MapReducecomputaWon:thecompletepicture


21

AsimplifiedviewofMapReduce:example

•  Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs

•  Reducers are applied to all intermediate values associated with the same intermediate key

•  Between the map and reduce phase lies a barrier that involves a large distributed sort and group by


22

“HelloWorld”inMapReduce:WordCount

•  Problem: count the number of occurrences for each word in a large collection of documents

•  Input: repository of documents, each document is an element

•  Map: read a document and emit a sequence of key-value pairs where: –  Keys are words of the documents and values are equal to 1:

(w1, 1), (w2, 1), … , (wn, 1) •  Shuffle and sort: group by key and generates pairs of the

form (w1, [1, 1, … , 1]) , . . . , (wn, [1, 1, … , 1]) •  Reduce: add up all the values and emits (w1, k) ,…, (wn, l) •  Output: (w,m) pairs where:

–  w is a word that appears at least once among all the input documents and m is the total number of occurrences of w among all those documents


23

WordCountinpracWce


24

Map Shuffle Reduce

WordCount:Map

•  Map emits each word in the document with an associated value equal to “1”

Map(String key, String value): // key: document name

// value: document contents for each word w in value:

EmitIntermediate(w, “1” );


25

WordCount:Reduce•  Reduce adds up all the “1” emitted for a given word

Reduce(String key, Iterator values): // key: a word // values: a list of counts int result=0 for each v in values: result += ParseInt(v) Emit(AsString(result))

•  This is pseudo-code; for the complete code of the example see the MapReduce paper


26

Anotherexample:WordLengthCount•  Problem: count how many words of certain lengths exist in a

collection of documents •  Input: a repository of documents, each document is an

element •  Map: read a document and emit a sequence of key-value

pairs where the key is the length of a word and the value is the word itself:

(i, w1), … , (j, wn) •  Shuffle and sort: group by key and generate pairs of the form

(1, [w1, … , wk]) , … , (n, [wr, … , ws]) •  Reduce: count the number of words in each list and emit:

(1, l) , … , (p, m) •  Output: (l,n) pairs, where l is a length and n is the total

number of words of length l in the input documents ValeriaCardellini-SABD2016/17

27

MapReduce:execuWonoverview


28

ApacheHadoop


29

WhatisApacheHadoop?•  Open-source software framework for reliable, scalable,

distributed data-intensive computing –  Originally developed by Yahoo!

•  Goal: storage and processing of data-sets at massive scale

•  Infrastructure: clusters of commodity hardware •  Core components:

–  HDFS: Hadoop Distributed File System –  Hadoop YARN –  Hadoop MapReduce

•  Includes a number of related projects –  Among which Apache Pig, Apache Hive, Apache HBase

•  Used in production by Facebook, IBM, Linkedin, Twitter, Yahoo! and many others ValeriaCardellini-SABD2016/17

30

Hadooprunsonclusters•  Compute nodes are stored on racks

–  8-64 compute nodes on a rack •  There can be many racks of compute nodes •  The nodes on a single rack are connected by a network,

typically gigabit Ethernet •  Racks are connected by another level of network or a

switch •  The bandwidth of intra-rack communication is usually

much greater than that of inter-rack communication •  Compute nodes can fail! Solution:

–  Files are stored redundantly –  Computations are divided into tasks ValeriaCardellini-SABD2016/17

31

Hadooprunsoncommodityhardware

•  Pros –  You are not tied to expensive, proprietary offerings

from a single vendor –  You can choose standardized, commonly available

hardware from any of a large range of vendors to build your cluster

–  Scale out, not up •  Cons

–  Significant failure rates –  Solution: replication!


32

Commodityhardware•  A typical specification of commodity hardware:

–  Processor 2 quad-core 2-2.5GHz CP –  Memory 16-24 GB ECC RAM –  Storage 4 × 1TB SATA disks –  Network Gigabit Ethernet

•  Yahoo! has a huge installation of Hadoop: –  > 100,000 CPUs in > 40,000 computers –  Used to support research for Ad Systems and Web

Search –  Also used to do scaling tests to support development of

Hadoop


33

Hadoopcore

•  HDFS ü Already examined

•  Hadoop YARN ü Already examined

•  Hadoop MapReduce –  System to write applications which process large

data sets in parallel on large clusters (> 1000 nodes) of commodity hardware in a reliable, fault-tolerant manner


34


35

Hadoopcore:MapReduce•  ProgrammingparadigmforexpressingdistributedcomputaWonsovermulWpleservers

•  Thepowerhousebehindmostoftoday’sbigdataprocessing

•  AlsousedinotherMPPenvironmentsandNoSQLdatabases(e.g.,VerWcaandMongoDB)


36

•  ImprovingHadoopusage–  Improvingprogrammability:PigandHive

–  Improvingdataaccess:HBase,SqoopandFlume

MapReducedataflow:noReducetask


37

MapReducedataflow:singleReducetask


38

MapReducedataflow:mulWpleReducetasks


39

OpWonalCombinetask•  Many MapReduce jobs are limited by the cluster

bandwidth –  How to minimize the data transferred between map

and reduce tasks? •  Use Combine task

–  An optional localized reducer that applies a user-provided method

–  Performs data aggregation on intermediate data of the same key for the Map task’s output before transmitting the result to the Reduce task

•  Takes each key-value pair from the Map phase, processes it, and produces the output as key-value collection pairs


40

ProgramminglanguagesforHadoop•  Java is the default programming language •  In Java you write a program with at least three

parts: 1.  A Main method which configures the job, and

lauches it •  Set number of reducers •  Set mapper and reducer classes •  Set partitioner •  Set other Hadoop configurations

2.  A Mapper class •  Takes (k,v) inputs, writes (k,v) outputs

3.  A Reducer Class •  Takes k, Iterator[v] inputs, and writes (k,v) outputs


41

ProgramminglanguagesforHadoop•  Hadoop Streaming API to write map and reduce

functions in programming languages other than Java •  Uses Unix standard streams as interface between

Hadoop and MapReduce program •  Allows to use any language (e.g., Python) that can read

standard input (stdin) and write to standard output (stdout) to write the MapReduce program –  Standard input is stream data going into a program; it is from the

keyboard –  Standard output is the stream where a program writes its output

data; it is the text terminal –  Reducer interface for streaming is different than in Java: instead

of receiving reduce(k, Iterator[v]), your script is sent one line per value, including the key


42

WordCountonHadoop•  Let’s analyze the WordCount code on Hadoop •  The code in Java

hgp://bit.ly/1m9xHym

•  The same code in Python –  Use Hadoop Streaming API for passing data between

the Map and Reduce code via standard input and standard output

–  See Michael Noll’s tutorial http://bit.ly/19YutB1


43

HadoopinstallaWon•  Some straightforward alternatives

–  A tutorial on manual single-node installation to run Hadoop on Ubuntu Linux

http://bit.ly/1eMFVFm

–  Cloudera CDH, Hortonworks HDP, MapR •  Integrated Hadoop-based stack containing all the

components needed for production, tested and packaged to work together

•  Include at least Hadoop, YARN, HDFS (not in MapR), HBase, Hive, Pig, Sqoop, Flume, Zookeeper, Kafka, Spark


44

HadoopinstallaWon•  Cloudera CDH

–  See QuickStarts http://bit.ly/2oMHob4 –  Single-node cluster: requires to have previously installed an

hypervisor (VMware, KVM, or VirtualBox) –  Multi-node cluster: requires to have previously installed Docker

(container technology, https://www.docker.com) –  QuickStarts not for use in production

•  Hortonworks HDP –  Requires 6.5 GB of disk space and Linux distribution –  Can be installed either manually or using Ambari (open source

management platform for Hadoop) –  Hortonworks Sandbox on a VM

•  Requires either VirtualBox, VMWare or Docker

•  MapR –  Own distributed file system rather than HDFS


45


46

RecallYARNarchitecture•  Global ResourceManager (RM) •  A set of per-application ApplicationMasters (AMs) •  A set of NodeManagers (NMs)


47

YARNjobexecuWon


48

DatalocalityopWmizaWon

•  Hadoop tries to run the map task on a data-local node (data locality optimization) –  So to not use cluster bandwidth –  Otherwise rack-local –  Off-rack as last choice


49

HadoopconfiguraWon•  Tuning Hadoop clusters for good performance is

somehow magic –  HW systems, SW systems (including OS), and tuning/

optimization of Hadoop infrastructure components, e.g.,

•  Optimal number of disks that maximizes I/O bandwidth •  BIOS parameters •  File system type (on Linux, EXT4 is better) •  Open file handles •  Optimal HDFS block size •  JVM settings for Java heap usage and garbage collections

See http://bit.ly/2nT7TIn for some detailed hints


50

Howmanymappersandreducers•  Can be set in a configuration file (JobConfFile)

or during command line execution –  For mappers, it is just an hint

•  Number of mappers –  Driven by the number of HDFS blocks in the input

files –  You can adjust the HDFS block size to adjust the

number of maps •  Number of reducers

–  Can be user-defined (default is 1)

See http://bit.ly/2oK0D5A ValeriaCardellini-SABD2016/17

51

HadoopintheCloud

•  Pros: –  Gain Cloud scalability and elasticity –  Do not need to manage and provision the

infrastructure and the platform •  Main challenges:

–  Move data to the Cloud •  Latency is not zero (because of speed of light)! •  Minor issue: network bandwidth

–  Data security and privacy


52

AmazonElasWcMapReduce(EMR)•  Distribute the computational work across a cluster of

virtual servers running in Amazon cloud (EC2 instances) •  Cluster managed with Hadoop •  Input and output: Amazon S3, DynamoDB •  Not only Hadoop, also Hive, Pig and Spark


53


54

CreateEMRcluster

Steps: 1.  Provide cluster

name 2.  Select EMR release

and applications to install

3.  Select instance type

4.  Select number of instances

5.  Select EC2 key-pair to connect securely to the cluster

Seehttp://amzn.to/2nnujp5

EMRclusterdetails


55

HadooponGoogleCloudPlalorm


56

•  Distribute the computational work across a cluster of virtual servers running in the Google Cloud Platform

•  Cluster managed with Hadoop •  Input and output: Google Cloud Storage, HDFS •  Not only Hadoop, also Spark

References

•  Dean and Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI 2004. http://static.googleusercontent.com/media/research.google.com/it//archive/mapreduce-osdi04.pdf

•  White, “Hadoop: The Definitive Guide, 4th edition”, O’Reilly, 2015.

•  Miner and Shook, “MapReduce Design Patterns”, O’Reilly, 2012.

•  Leskovec, Rajaraman, and Ullman, “Mining of Massive Datasets”, chapter 2, 2014. http://www.mmds.org


57

mapreduce and hadoop - uniroma2.ithadoop.pdf · mapreduce and hadoop corso di sistemi e...

Documents