a bigdata tour –hdfs, ceph and mapreduce › courses › comp4300 › lectures17 ›...
TRANSCRIPT
A BigData Tour – HDFS, Ceph and MapReduce
These slides are possible thanks to these sources –
Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in
Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce
Tutorial
• Data volumes increasing massively!
• Clusters, storage capacity increasing massively!
• Disk speeds are not keeping pace.!
• Seek speeds even worse than read/write
Mahout!data mining
Dis
k (M
B/s)
, CPU
(M
IPS) 1000x!
Data Intensive Computing
Scale-Out
• Disk streaming speed ~ 50MB/s!
• 3TB =17.5 hrs!
• 1PB = 8 months!
• Scale-out (weak scaling) - filesystem distributes data on ingest
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Scale-Out• Seeking too slow!
• ~10ms for a seek!
• Enough time to read half a megabyte!
• Batch processing!
• Go through entire data set in one (or small number) of passes
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Combining results
• Each node pre-processes its local data!
• Shuffles its data to a small number of other nodes!
• Final processing, output is done there
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Fault Tolerance
• Data also replicated upon ingest!
• Runtime watches for dead tasks, restarts them on live nodes!
• Re-replicates
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Data Distribution: Disk
• Hadoop and similar architectures handle the hardest part of parallelism for you - data distribution.!
• On disk: HDFS distributes, replicates data as it comes in!
• Keeps track; computations local to data
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Data Distribution: Network
• On network: Map Reduce (eg) works in terms of key-value pairs.!
• Preprocessing (map) phase ingests data, emits (k,v) pairs!
• Shuffle phase assigns reducers, gets all pairs with same key onto that reducer.!
• Programmer does not have to design communication patterns
(key1,83) (key2, 9)(key1,99) (key2, 12)(key1,17) (key5, 23)
(key1,[17,99]) (key5,[23,83]) (key2,[12,9])
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Big Data Analytics Stack
Big Data Analytics Stack
Amir H. Payberah (SICS) Introduction April 8, 2014 23 / 36
Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Storage (sans POSIX)
Big Data - Storage (Filesystem)
I Traditional filesystems are not well-designed for large-scale dataprocessing systems.
I E�ciency has a higher priority than other features, e.g., directoryservice.
I Massive size of data tends to store it across multiple machines in adistributed way.
I HDFS, Amazon S3, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 24 / 36
Big Data - Storage (Filesystem)
I Traditional filesystems are not well-designed for large-scale dataprocessing systems.
I E�ciency has a higher priority than other features, e.g., directoryservice.
I Massive size of data tends to store it across multiple machines in adistributed way.
I HDFS, Amazon S3, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 24 / 36
Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data - Databases
Big Data - Database
I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.
I NoSQL databases relax one or more of the ACID properties: BASE
I Di↵erent data models: key/value, column-family, graph, document.
I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 25 / 36
Big Data - Database
I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.
I NoSQL databases relax one or more of the ACID properties: BASE
I Di↵erent data models: key/value, column-family, graph, document.
I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 25 / 36
Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Resource ManagementBig Data - Resource Management
I Di↵erent frameworks require di↵erent computing resources.
I Large organizations need the ability to share data and resourcesbetween multiple frameworks.
I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.
I Mesos, YARN, Quincy, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 26 / 36
Big Data - Resource Management
I Di↵erent frameworks require di↵erent computing resources.
I Large organizations need the ability to share data and resourcesbetween multiple frameworks.
I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.
I Mesos, YARN, Quincy, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 26 / 36
Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Execution Engine
Big Data - Execution Engine
I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.
I Data-parallel programming model for clusters of commodity ma-chines.
I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 27 / 36
Big Data - Execution Engine
I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.
I Data-parallel programming model for clusters of commodity ma-chines.
I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 27 / 36
Amir Payberah https://www.sics.se/~amir/dic.htm
Big Data – Query/Scripting Languages
Big Data - Query/Scripting Language
I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.
I Need high-level language to improve the query capabilities of exe-cution engines.
I It translates user-defined functions to low-level API of the executionengines.
I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 28 / 36
Big Data - Query/Scripting Language
I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.
I Need high-level language to improve the query capabilities of exe-cution engines.
I It translates user-defined functions to low-level API of the executionengines.
I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 28 / 36Amir Payberah https://www.sics.se/~amir/dic.htm
Hadoop Ecosystem
• 2008 onwards –
usage exploded
• Creation of many
tools on top of
Hadoop
infrastructure
The Need For Filesystems
What is Filesystem?
I Controls how data is stored in and retrieved from disk.
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 2 / 32
What is Filesystem?
I Controls how data is stored in and retrieved from disk.
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 2 / 32
Amir Payberah https://www.sics.se/~amir/dic.htm
Distributed Filesystems
Distributed Filesystems
I When data outgrows the storage capacity of a single machine: par-tition it across a number of separate machines.
I Distributed filesystems: manage the storage across a network ofmachines.
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 3 / 32
Amir Payberah https://www.sics.se/~amir/dic.htm
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 4 / 32
What HDFS is not good forHDFS is Not Good for ...
I Low-latency reads• High-throughput rather than low latency for small chunks of data.• HBase addresses this issue.
I Large amount of small files• Better for millions of large files instead of billions of small files.
I Multiple writers• Single writer per file.• Writes only at the end of file, no-support for arbitrary o↵set.
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 7 / 32
Amir Payberah https://www.sics.se/~amir/dic.htm
HDFS Architecture
• The Hadoop Distributed File System (HDFS)
• Offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files
• HDFS is designed to be fault-tolerant
• Using data replication and distribution of data
• When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data
• These blocks are stored across the cluster nodes designated for storage, a.k.a. DataNodes.
http://www.revelytix.com/?q=content/hadoop-ecosystem
Files and Blocks – 1/3
Files and Blocks (1/3)
I Files are split into blocks.
I Blocks• Single unit of storage: a contiguous piece of information on a disk.• Transparent to user.• Managed by Namenode, stored by Datanode.• Blocks are traditionally either 64MB or 128MB: default is 64MB.
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 10 / 32Amir Payberah https://www.sics.se/~amir/dic.htm
Files and Blocks – 2/3Files and Blocks (2/3)
I Why is a block in HDFS so large?• To minimize the cost of seeks.
I Time to read a block = seek time + transfer time
I Keeping the ratio seektimetransfertime small: we are reading data from the
disk almost as fast as the physical limit imposed by the disk.
I Example: if seek time is 10ms and the transfer rate is 100MB/s, tomake the seek time 1% of the transfer time, we need to make theblock size around 100MB.
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 11 / 32
Amir Payberah https://www.sics.se/~amir/dic.htm
Files and Blocks – 3/3Files and Blocks (3/3)
I Same block is replicated on multiple machines: default is 3• Replica placements are rack aware.• 1st replica on the local rack.• 2nd replica on the local rack but di↵erent machine.• 3rd replica on the di↵erent rack.
I Namenode determines replica placement.
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 12 / 32
Amir Payberah https://www.sics.se/~amir/dic.htm
HDFS Daemons (2/2)
Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 9 / 32
HDFS Daemons
• HDFS cluster is manager by three types of processes
• Namenode
• Manages the filesystem, e.g., namespace, meta-data, and file
blocks
• Metadata is stored in memory
• Datanode
• Stores and retrieves data blocks
• Reports to Namenode
• Runs on many machines
• Secondary Namenode
• Only for checkpointing.
• Not a backup for Namenode
Amir Payberah https://www.sics.se/~amir/dic.htm
Reading a file
• Reading a file shorter !
• Get block locations!
• Read from a replica
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Read lines 1...1000 from bigdata.dat
1. Open
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Reading a file
• Reading a file shorter !
• Get block locations!
• Read from a replica
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Read lines 1...1000 from bigdata.dat
2. Get block locations
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Reading a file
• Reading a file shorter !
• Get block locations!
• Read from a replica
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Read lines 1...1000 from bigdata.dat
3. read blocks
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Writing a file
• Writing a file multiple stage process!
• Create file!
• Get nodes for blocks!
• Start writing!
• Data nodes coordinate replication!
• Get ack back !
• Complete
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Write newdata.dat
1. create
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Writing a file
• Writing a file multiple stage process!
• Create file!
• Get nodes for blocks!
• Start writing!
• Data nodes coordinate replication!
• Get ack back !
• Complete
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Write newdata.dat
2. get nodes
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Writing a file
• Writing a file multiple stage process!
• Create file!
• Get nodes for blocks!
• Start writing!
• Data nodes coordinate replication!
• Get ack back !
• Complete
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Write newdata.dat
3. start writing
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Writing a file
• Writing a file multiple stage process!
• Create file!
• Get nodes for blocks!
• Start writing!
• Data nodes coordinate replication!
• Get ack back !
• Complete
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Write newdata.dat
4. repl
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Writing a file
• Writing a file multiple stage process!
• Create file!
• Get nodes for blocks!
• Start writing!
• Data nodes coordinate replication!
• Get ack back (while writing)!
• Complete
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Write newdata.dat
5. ack
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Writing a file
• Writing a file multiple stage process!
• Create file!
• Get nodes for blocks!
• Start writing!
• Data nodes coordinate replication!
• Get ack back !
• Complete
Namenode
/user/ljdursi/diffuse
bigdata.dat
datanode1 datanode2 datanode3
Client: !Write newdata.dat
6. complete
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Communication Protocol
• All HDFS communication protocols are layered on top of the
TCP/IP protocol
• A client establishes a connection to a configurable TCP port on
the NameNode machine and uses ClientProtocol
• DataNodes talk to the NameNode using DataNode protocol
• A Remote Procedure Call (RPC) abstraction wraps both the
ClientProtocol and DataNode protocol
• NameNode never initiates a RPC, instead it only responds to
RPC requests issued by DataNodes or clients
Robustness
• Primary objective of HDFS is to store data reliably even during
failures
• Three common types of failures: NameNode, DataNode and network
partitions
• Data disk failure
• Heartbeat messages to track the health of DataNodes
• NameNodes performs necessary re-replication on DataNode
unavailability, replica corruption or disk fault
• Cluster rebalancing
• Automatically move data between DataNodes, if the free space on a
DataNode falls below a threshold or during sudden high demand
• Data integrity
• Checksum checking on HDFS files, during file creation and retrieval
• Metadata disk failure
• Manual intervention – no auto recovery, restart or failover
MAP-REDUCE
MapReduce
I A shared nothing architecture for processing large data sets with aparallel/distributed algorithm on clusters.
Amir H. Payberah (SICS) MapReduce April 22, 2014 6 / 44
What is it?
MapReduce Definition
I A programming model: to batch process large data sets (inspiredby functional programming).
I An execution framework: to run parallel algorithms on clusters ofcommodity hardware.
Amir H. Payberah (SICS) MapReduce April 22, 2014 7 / 44
Simplicity
I Don’t worry about parallelization, fault tolerance, data distribution,and load balancing (MapReduce takes care of these).
I Hide system-level details from programmers.
Simplicity!
Amir H. Payberah (SICS) MapReduce April 22, 2014 8 / 44
Amir Payberah https://www.sics.se/~amir/dic.htm
MapReduce Simple DataflowMapReduce Dataflow
I map function: processes data and generates a set of intermediatekey/value pairs.
I reduce function: merges all intermediate values associated with thesame intermediate key.
Amir H. Payberah (SICS) MapReduce April 22, 2014 10 / 44
Amir Payberah https://www.sics.se/~amir/dic.htm
Word Count
• Was used as an example in the original MapReduce paper!
• Now basically the “hello world” of map reduce!
• Do a count of words of some set of documents.!
• A simple model of many actual web analytics problem
Hello World !Bye World
file01
Hello Hadoop Goodbye Hadoop
file02
Hello 2!World 2!Bye 1!Hadoop 2!Goodbye 1
output/part-00000
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
High-Level Structure of a MR Program – 1/2
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)
reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)
https://developer.yahoo.com/hadoop/tutorial
High-Level Structure of a MR Program – 2/2
• Several instances of the mapper function are created on the different machines in a Hadoop cluster
• Each instance receives a different input file (it is assumed that there are many such files)
• The mappers output (word, 1) pairs which are then forwarded to the reducers
• Several instances of the reducer method are also instantiated on the different machines
• Each reducer is responsible for processing the list of values associated with a different word
• The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file.
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)
reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)
https://developer.yahoo.com/hadoop/tutorial
Word Count
• How would you do this with a huge document?!
• Each time you see a word, if it’s a new word, add a tick mark beside it, otherwise add a new word with a tick!
• ...But hard to parallelize (updating the list)
Hello World !Bye World
file01
Hello Hadoop Goodbye Hadoop
file02
Hello 2!World 2!Bye 1!Hadoop 2!Goodbye 1
output/part-00000
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Word Count
• MapReduce way - all hard work is done by the shuffle - eg, automatically.!
• Map: just emit a 1 for each word you see
Hello World !Bye World
file01Hello Hadoop
Goodbye Hadoop
file02
(Hello,1)!(World,1)!(Bye, 1)!(World,1)
(Hello, 1)!(Hadoop, 1)!(Goodbye,1)!(Hadoop, 1)
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
Word Count
• Shuffle assigns keys (words) to each reducer, sends (k,v) pairs to appropriate reducer!
• Reducer just has to sum up the ones
(Hello, 1)!(Hadoop, 1)!(Goodbye,1)!(Hadoop, 1)
(Hello,1)!(World,1)!(Bye, 1)!(World,1)
Hello 2!World 1!Bye 1
Hadoop 2!Goodbye 1
(Hello,[1,1])!(World,[1,1])!
(Bye, 1)(Hadoop, [1,1])!
(Goodbye,1)Example: Word Count - shu✏e
I The shu✏e phase between map and reduce phase creates a list ofvalues associated with each key.
I The reduce function input is:
(Bye, (1))
(Goodbye, (1))
(Hadoop, (1, 1)
(Hello, (1, 1))
(World, (1, 1))
Amir H. Payberah (SICS) MapReduce April 22, 2014 13 / 44
Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial
MapReduce and HDFSHadoop MapReduce and HDFS
Amir H. Payberah (SICS) MapReduce April 22, 2014 30 / 44
Amir Payberah https://www.sics.se/~amir/dic.htm