distributed architecture: map/reduce -...
TRANSCRIPT
Distributed Architecture: Map/ReduceSoftware Architecture VO/KU (707.023/707.024)
Roman Kern
KMI, TU Graz
Dec 19, 2012
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 1 / 61
Outline
1 Introduction
2 Independent operations
3 Distributed operations
4 Summary
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 2 / 61
Section
Recap
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 3 / 61
Recap
Figure: Client-server style
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 4 / 61
Recap
Figure: Layered system
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 5 / 61
Recap
Peer-to-Peer
Separation between client and server is removed
Each client is a server at the same time, called peer
The goal is to distribute the processing or data among many peers
No central administration or coordination
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 6 / 61
Recap
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 7 / 61
Introduction
Section
Introduction
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 8 / 61
Introduction
Distributed Architectures
Goal is to achieve a scalable infrastructure
⇒ scale horizontally (scale out)
Different levels of complexity
Depends on the systems and the required attributes
Certain approaches have evolved
Frameworks have been developed
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 9 / 61
Introduction
Distributed Architectures
Parallel computing vs. distributed computing
In parallel computing all component share a common memory,typically threads within a single program
In distributed computing each component has it own memory
Typically in distributed computing the individual components areconnected over a network
Dedicated programming languages (or extensions) for parallelcomputing
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 10 / 61
Introduction
Distributed Architectures
http://nighthacks.com/roller/jag/resource/Fallacies.html
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 11 / 61
Introduction
Distributed Architectures
Different levels of complexity
Lowest complexity for operations, which can easily be distributed
If they are independent and short enough be to executed independentfrom each other
Higher degree of complexity for operations, which compute a singleresult on multiple nodes
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 12 / 61
Independent operations
Section
Independent operations
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 13 / 61
Independent operations
Independent operations
In a simple scenario, the system just contains of separate,independent operations
No operation do not require complex interaction
Input data are typically small chunks
Shared repository - all the data is available on all nodes
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 14 / 61
Independent operations
Distributed Architectures
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 15 / 61
Independent operations
Independent operations
Still a number of issues to address
1 Group membership
2 Leader election
3 Queues - distribution of workload
4 Distributed locks
5 Barriers
6 Shared resources
7 Configuration
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 16 / 61
Independent operations
Independent operations - Group membership
Group membership
When a single node comes online...
How does it know where to connect to?
How do the other members know of an added node?
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 17 / 61
Independent operations
Independent operations - Group membership
⇒ Peer-to-peer architectural style
Each node is client, as well as server
Parts of the bootstrapping mechanism
Dynamic vs. static
Fully dynamic via broadcast/multicast within local area networks(UDP)
Centralised P2P - e.g. central login components/servers
Static lists of group members (needs to be configurable)
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 18 / 61
Independent operations
Independent operations - Leader Election
Leader Election
Not all nodes are equal, e.g. centralised components in P2P networks
Single node acts as master, others are workers
Some nodes have additional responsibilities (supernodes)
Having centralised components makes some functionality easier toimplement
E.g. assign work-load
Disadvantage: might lead to a single point of failure
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 19 / 61
Independent operations
Independent operations - Leader Election
⇒ Client-server architectural style
Once the leader has been elected, it takes over the role of the server
All other group members then act as clients
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 20 / 61
Independent operations
Independent operations - Leader Election
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 21 / 61
Independent operations
Independent operations - Leader Election
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 22 / 61
Independent operations
Independent operations - Leader Election
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 23 / 61
Independent operations
Independent operations - Queues
Queues
Important component in many distributed systems
Two types of nodes: manager of the queue, workers
Incoming requests are collected at a single point
And are stored as items in a queue
Many client node consume items from the queue
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 24 / 61
Independent operations
Independent operations - Queues
Queues are often FIFO (first-in, first-out)
Sometimes specific items are of higher priority
Crucial aspect is the coordinated access to the queue
Each item is only processed by a single client
What if the client crashes while processing an item from the queue?
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 25 / 61
Independent operations
Independent operations - Queues
⇒ Publish-subscribe architectural style
Basically a producer-consumer pattern
Each worker client registers itself
Queue manager notifies the worker of new items
How to schedule the workers, which should be picked next?
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 26 / 61
Independent operations
Independent operations - Locks
Distributed Locks
Restrict access to shared resources to only a single node at a time
E.g. allow only a single node to write to a file
May yield many non-trivial problems, for example deadlocks or raceconditions
Distributed locks without central component are very complex torealise
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 27 / 61
Independent operations
Independent operations - Locks
⇒ Blackboard architectural style
The shared repository is responsible to orchestrate the access to alocks
Notifies waiting nodes once the lock has been lifted
This functionality is often coupled with the elected leader
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 28 / 61
Independent operations
Independent operations - Barriers
Barriers
Specific type of distributed lock
Sychronise multiple nodes
E.g. multiple nodes should wait until a certain state has been reached
Used when a part of the processing can be done in parallel and someparts cannot be distributed
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 29 / 61
Independent operations
Independent operations - Shared Resources
Shared Resources
If all nodes need to be able to access a common data-structure
Read-only vs. read-write
If read-write, the complexity rises due to synchronisation issues
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 30 / 61
Independent operations
Apache Zookeeper
Apache Zookeeper is a framework/library to
Used by Yahoo!, LinkedIn, Facebook
Initially developed by Yahoo!
Now managed by Apache
Alternative approaches: Google Chubby, Microsoft Centrifuge
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 31 / 61
Independent operations
Apache Zookeeper
Components of Zookeeper
Coordination kernel
File-system like API
Synchronisation, Watches, Locks
Configuration
Shared data
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 32 / 61
Independent operations
Example of a Barrier with Zookeeper
B a r r i e r ( S t r i n g a d d r e s s , S t r i n g name , i n t s i z e ) {super ( a d d r e s s ) ;t h i s . r o o t = name ;t h i s . s i z e = s i z e ;
S t a t s = zk . e x i s t s ( root , f a l s e ) ;i f ( s == nu l l )
zk . c r e a t e ( root , new byte [ 0 ] ,I d s . OPEN ACL UNSAFE , 0 ) ;
// My node namename = new S t r i n g ( I n e t A d d r e s s . g e t L o c a l H o s t ( )
. getCanonica lHostName ( ) . t o S t r i n g ( ) ) ;}
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 33 / 61
Independent operations
Example of a Barrier with Zookeeper
boolean e n t e r ( ) {zk . c r e a t e ( r o o t + ”/” + name , new byte [ 0 ] ,
I d s . OPEN ACL UNSAFE , C r e a t e F l a g s .EPHEMERAL ) ;whi le ( true ) {
synchronized ( mutex ) {A r r a y L i s t <S t r i n g> l i s t = zk . g e t C h i l d r e n (root , true ) ;
i f ( l i s t . s i z e ( ) < s i z e )mutex . w a i t ( ) ;
e l s ereturn true ;
}}
}
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 34 / 61
Independent operations
Example of a Barrier with Zookeeper
i n t consume ( ) throws KeeperExcept ion , I n t e r r u p t e dE x c e p t i o n {i n t r e s u l t = −1;S ta t s t a t = n u l l ;
wh i l e ( t rue ) { // Get the f i r s t e l ement a v a i l a b l es ynch ron i zed (mutex ) {
Ar r a yL i s t<St r i ng> l i s t = zk . g e tCh i l d r e n ( root , t rue ) ;i f ( ! l i s t . i sEmpty ( ) ) {
I n t e g e r min = new I n t e g e r ( l i s t . ge t ( 0 ) . s u b s t r i n g ( 7 ) ) ;f o r ( S t r i n g s : l i s t ) {
I n t e g e r tempValue = new I n t e g e r ( s . s u b s t r i n g ( 7 ) ) ;i f ( tempValue < min ) min = tempValue ;
}byte [ ] b = zk . getData ( r oo t + ”/ e lement ” + min ,
f a l s e , s t a t ) ;zk . d e l e t e ( r oo t + ”/ e lement ” + min , 0 ) ;By t eBu f f e r b u f f e r = Byt eBu f f e r . wrap ( b ) ;r e s u l t = b u f f e r . g e t I n t ( ) ;
r e t u r n r e s u l t ;}mutex . wa i t ( ) ; // Going to wa i t
}}
}Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 35 / 61
Distributed operations
Section
Distributed operations
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 36 / 61
Distributed operations
Distributed Operations
If the processing cannot be split into separate, independent operations
If the data it too big to fit on a single machine
Need for a distributed processing of a single operation
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 37 / 61
Distributed operations
Contemporary Computing Environment
Hardware basics
Access to data in memory is much faster than access to data on disk(or online).
Disk seeks: No data is transferred from disk while the disk head isbeing positioned.
Therefore: Transferring one large chunk of data from disk tomemory is faster than transferring many small chunks.
Disk I/O is block-based: Reading and writing of entire blocks (asopposed to smaller chunks).
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 38 / 61
Distributed operations
Map/Reduce
Distributed indexing at Google
For web-scale indexing
Must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
Based on distributed file system
Files are stored among different machinesRedundant storageInformation about storage is available to other components
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 39 / 61
Distributed operations
Map/Reduce
MapReduce
MapReduce (Dean and Ghemawat 2004) is a robust andconceptually simple framework for distributed computing
Motivated by indexing system at Google, which consists of a numberof phases, each implemented in MapReduce
Approach: Bring the code to the data
distributed computing...... without having to write code for the distribution part.
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 40 / 61
Distributed operations
Google Infrastructure
Google data centres mainly contain commodity machines
Data centres are distributed around the world.
Estimate: a total of 1 million servers, 3 million processors/cores(Gartner 2007)
Estimate: Google installs 100,000 servers each quarter.
Based on expenditures of 200-250 million dollars per year
This would be 10% of the computing capacity of the world
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 41 / 61
Distributed operations
Map/Reduce
Input Data
Map Worker Intermediate Data
Reduce Worker Output Data
Output1
Output2
Reduce1
Reduce2
Map2
Map3
Map1Split1
Split2
Split3
Split4
Split5
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 42 / 61
Distributed operations
Map/Reduce
Task of the mapper: read a chunk of the input data and generate aintermediate key plus values
Task of the reducer: process a tuple of intermediate key plus valuesand write the output
Note: Often a number of additional functions need to be provided aswell
Input OutputMapper k1, v1 list(k2, v2)Reducer k2, list(v2) list(k3, v3)
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 43 / 61
Distributed operations
Example of a Mapper
vo id countWordsOldSchool ( ) {Map<St r i ng , I n t e g e r> wordToCountMap =
new HashMap<St r i ng , I n t e g e r >() ;L i s t<F i l e> f i l e L i s t = d i r . l i s t F i l e s ( ) ;f o r ( F i l e f i l e : f i l e L i s t ) {
S t r i n g con t en t = IOU t i l s . r e a dF i l eToS t r i n g ( f i l e ) ;L i s t<St r i ng> wordL i s t = token i z e I n t oWord s ( con t en t ) ;f o r ( S t r i n g word : wo rdL i s t ) {
i n c r ement ( word , 1 ) ;}
}w r i t eToF i l e (wordToCountMap ) ;
}
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 44 / 61
Distributed operations
Example of a Mapper
vo id map( i n t documentId , S t r i n g con t en t ) {L i s t<St r i ng> wordL i s t = token i z e I n t oWord s ( con t en t ) ;f o r ( S t r i n g word : wo rdL i s t ) {
y i e l d ( word , 1 ) ;}
}
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 45 / 61
Distributed operations
Example of a Reducer
vo id r educe ( S t r i n g word , L i s t<I n t e g e r> c o u n t L i s t ) {i n t coun t e r = 0 ;f o r ( I n t e g e r count : c o u n t L i s t ) {
coun t e r += count ;}w r i t e ( word , coun t e r ) ;
}
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 46 / 61
Distributed operations
Overview Inverted Index
Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.
Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen
Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen
Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...
friends 7→ 1 → 3 → 7 ...
romans 7→ 3 → 9 ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61
Distributed operations
Overview Inverted Index
Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.
Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen
Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen
Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...
friends 7→ 1 → 3 → 7 ...
romans 7→ 3 → 9 ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61
Distributed operations
Overview Inverted Index
Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.
Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen
Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen
Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...
friends 7→ 1 → 3 → 7 ...
romans 7→ 3 → 9 ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61
Distributed operations
Overview Inverted Index
Input: Documents to be indexed, input documents are parsed andtext is extracted3 7→ Friends, Romans, countrymen.
Tokenizer: Produces a token stream from the text3 7→ Friends Romans countrymen
Linguistic models: Analyses and modifies the tokens3 7→ friends romans countrymen
Indexer: Collects the tokens and inverts the data-structurecountrymen 7→ 2 → 3 ...
friends 7→ 1 → 3 → 7 ...
romans 7→ 3 → 9 ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 47 / 61
Distributed operations
Detail Inverted Index
Document 1
I did enact Julius
Caesar I was killed
in the Capitol;
Brutus killed me.
Step 1: Build term-document table
Document 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term Doc #
i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 48 / 61
Distributed operations
Detail Inverted Index
Document 1
I did enact Julius
Caesar I was killed
in the Capitol;
Brutus killed me.
Step 1: Build term-document table
Document 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term Doc #
i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 48 / 61
Distributed operations
Detail Inverted Index
Step 2: Sort by terms
Term Doc #
i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...
Term Doc #
ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 49 / 61
Distributed operations
Detail Inverted Index
Step 2: Sort by terms
Term Doc #
i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...
Term Doc #
ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 49 / 61
Distributed operations
Detail Inverted Index
Step 3: Addtermfrequency,multipleentries fromsingledocument getmerged
Term Doc #
ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...
Term Doc # TF
ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1... ... ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 50 / 61
Distributed operations
Detail Inverted Index
Step 3: Addtermfrequency,multipleentries fromsingledocument getmerged
Term Doc #
ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...
Term Doc # TF
ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1so 2 1the 1 1... ... ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 50 / 61
Distributed operations
Detail Inverted Index
Step 4: Result is split into dictionary file and postings file.
Term Doc # TF
ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1... ... ...
Dictionary# Term DF CF
0 ambitious 1 11 be 1 12 brutus 2 23 capitol 1 14 caesar 2 35 did 1 16 enact 1 17 hath 1 18 i 1 2
... ... ...
PostingsTerm# 7→ {Doc#,TF}
0 7→ 2,1
1 7→ 2,1
2 7→ 1,1 → 2,1
3 7→ 1,1
4 7→ 1,1 → 2,2
5 7→ 1,1
6 7→ 1,1
7 7→ 2,1
8 7→ 1,2
...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 51 / 61
Distributed operations
Detail Inverted Index
Step 4: Result is split into dictionary file and postings file.
Term Doc # TF
ambitious 2 1be 2 1brutus 1 1brutus 2 1capitol 1 1caesar 1 1caesar 2 2did 1 1enact 1 1hath 2 1i 1 2in 1 1it 2 1julius 1 1killed 1 2let 2 1me 1 1noble 2 1... ... ...
Dictionary# Term DF CF
0 ambitious 1 11 be 1 12 brutus 2 23 capitol 1 14 caesar 2 35 did 1 16 enact 1 17 hath 1 18 i 1 2
... ... ...
PostingsTerm# 7→ {Doc#,TF}
0 7→ 2,1
1 7→ 2,1
2 7→ 1,1 → 2,1
3 7→ 1,1
4 7→ 1,1 → 2,2
5 7→ 1,1
6 7→ 1,1
7 7→ 2,1
8 7→ 1,2
...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 51 / 61
Distributed operations
Index Construction
What is the role of the Map/Reduce framework when building suchan index?
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 52 / 61
Distributed operations
Index Construction
Document 1
I did enact Julius
Caesar I was killed
in the Capitol;
Brutus killed me.
Recall step 1 of inverted index creation.
Document 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term Doc #
i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 53 / 61
Distributed operations
Index Creation
After all documentshave been parsed, theinverted file is sortedby terms.
There might be many itemsto sort.
Term Doc #
i 1did 1enact 1julius 1caesar 1i 1was 1killed 1in 1the 1capitol 1brutus 1killed 1me 1so 2let 2it 2be 2with 2caesar 2... ...
Term Doc #
ambitious 2be 2brutus 1brutus 2capitol 1caesar 1caesar 2caesar 2did 1enact 1hath 1i 1i 1in 1it 2julius 1killed 1killed 1let 2me 1... ...
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 54 / 61
Distributed operations
Index Construction
Map step: parse the documents and yield terms as keys
Framework: Sort the keys from the mappers
Reduce: Collect all keys and write out the inverted index
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 55 / 61
Distributed operations
Map/Reduce Framework
Existing open-source framework: Apache Hadoop
Implemented in Java
Initially developed by Yahoo!
Now used by many companies and organisations
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 56 / 61
Summary
Section
Summary
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 57 / 61
Summary
Summary
If the system needs to be scalable, it needs to be appropriatelydesigned
In a simple scenario, the load is distributed via individual operations
For more demanding operations, specific approaches are necessary
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 58 / 61
Summary
Summary
The simple scenario
Scalability limited often limited by dedicated central components
E.g. the master node
Performance bottlenecks for shared resources
No guarantee on execution order
Limited suitable for interactive applications
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 59 / 61
Summary
Summary
The scenario with a complex operation
Scalability is very good
High complexity when implementing
Not suited for interactive applications
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 60 / 61
Summary
Section
Questions?
Roman Kern (KMI, TU Graz) Distributed Architecture: Map/Reduce Dec 19, 2012 61 / 61