a bigdata tour –hdfs, ceph and mapreduce › courses › comp4300 › lectures17 ›...

A BigData Tour – HDFS, Ceph and MapReduce

These slides are possible thanks to these sources –

Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in

Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce

Tutorial

• Data volumes increasing massively!

• Clusters, storage capacity increasing massively!

• Disk speeds are not keeping pace.!

• Seek speeds even worse than read/write

Mahout!data mining

Dis

k (M

B/s)

, CPU

(M

IPS) 1000x!

Data Intensive Computing

Scale-Out

• Disk streaming speed ~ 50MB/s!

• 3TB =17.5 hrs!

• 1PB = 8 months!

• Scale-out (weak scaling) - filesystem distributes data on ingest

Jonathan Dursi https://github.com/ljdursi/hadoop-for-hpcers-tutorial

Scale-Out• Seeking too slow!

• ~10ms for a seek!

• Enough time to read half a megabyte!

• Batch processing!

• Go through entire data set in one (or small number) of passes


Combining results

• Each node pre-processes its local data!

• Shuffles its data to a small number of other nodes!

• Final processing, output is done there


Fault Tolerance

• Data also replicated upon ingest!

• Runtime watches for dead tasks, restarts them on live nodes!

• Re-replicates


Data Distribution: Disk

• Hadoop and similar architectures handle the hardest part of parallelism for you - data distribution.!

• On disk: HDFS distributes, replicates data as it comes in!

• Keeps track; computations local to data


Data Distribution: Network

• On network: Map Reduce (eg) works in terms of key-value pairs.!

• Preprocessing (map) phase ingests data, emits (k,v) pairs!

• Shuffle phase assigns reducers, gets all pairs with same key onto that reducer.!

• Programmer does not have to design communication patterns

(key1,83) (key2, 9)(key1,99) (key2, 12)(key1,17) (key5, 23)

(key1,[17,99]) (key5,[23,83]) (key2,[12,9])


Big Data Analytics Stack

Big Data Analytics Stack

Amir H. Payberah (SICS) Introduction April 8, 2014 23 / 36

Amir Payberah https://www.sics.se/~amir/dic.htm

Big Data – Storage (sans POSIX)

Big Data - Storage (Filesystem)

I Traditional filesystems are not well-designed for large-scale dataprocessing systems.

I E�ciency has a higher priority than other features, e.g., directoryservice.

I Massive size of data tends to store it across multiple machines in adistributed way.

I HDFS, Amazon S3, ...


Big Data - Storage (Filesystem)

I Traditional filesystems are not well-designed for large-scale dataprocessing systems.

I E�ciency has a higher priority than other features, e.g., directoryservice.

I Massive size of data tends to store it across multiple machines in adistributed way.

I HDFS, Amazon S3, ...



Big Data - Databases

Big Data - Database

I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.

I NoSQL databases relax one or more of the ACID properties: BASE

I Di↵erent data models: key/value, column-family, graph, document.

I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...


Big Data - Database

I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.

I NoSQL databases relax one or more of the ACID properties: BASE

I Di↵erent data models: key/value, column-family, graph, document.

I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...



Big Data – Resource ManagementBig Data - Resource Management

I Di↵erent frameworks require di↵erent computing resources.

I Large organizations need the ability to share data and resourcesbetween multiple frameworks.

I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.

I Mesos, YARN, Quincy, ...


Big Data - Resource Management

I Di↵erent frameworks require di↵erent computing resources.

I Large organizations need the ability to share data and resourcesbetween multiple frameworks.

I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.

I Mesos, YARN, Quincy, ...



Big Data – Execution Engine

Big Data - Execution Engine

I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.

I Data-parallel programming model for clusters of commodity ma-chines.

I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...


Big Data - Execution Engine

I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.

I Data-parallel programming model for clusters of commodity ma-chines.

I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...



Big Data – Query/Scripting Languages

Big Data - Query/Scripting Language

I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.

I Need high-level language to improve the query capabilities of exe-cution engines.

I It translates user-defined functions to low-level API of the executionengines.

I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...


Big Data - Query/Scripting Language

I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.

I Need high-level language to improve the query capabilities of exe-cution engines.

I It translates user-defined functions to low-level API of the executionengines.

I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...

Amir H. Payberah (SICS) Introduction April 8, 2014 28 / 36Amir Payberah https://www.sics.se/~amir/dic.htm

Hadoop Ecosystem

• 2008 onwards –

usage exploded

• Creation of many

tools on top of

Hadoop

infrastructure

The Need For Filesystems

What is Filesystem?

I Controls how data is stored in and retrieved from disk.

Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 2 / 32

What is Filesystem?

I Controls how data is stored in and retrieved from disk.



Distributed Filesystems

Distributed Filesystems

I When data outgrows the storage capacity of a single machine: par-tition it across a number of separate machines.

I Distributed filesystems: manage the storage across a network ofmachines.



What HDFS is not good forHDFS is Not Good for ...

I Low-latency reads• High-throughput rather than low latency for small chunks of data.• HBase addresses this issue.

I Large amount of small files• Better for millions of large files instead of billions of small files.

I Multiple writers• Single writer per file.• Writes only at the end of file, no-support for arbitrary o↵set.



HDFS Architecture

• The Hadoop Distributed File System (HDFS)

• Offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files

• HDFS is designed to be fault-tolerant

• Using data replication and distribution of data

• When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data

• These blocks are stored across the cluster nodes designated for storage, a.k.a. DataNodes.

http://www.revelytix.com/?q=content/hadoop-ecosystem

Files and Blocks – 1/3

Files and Blocks (1/3)

I Files are split into blocks.

I Blocks• Single unit of storage: a contiguous piece of information on a disk.• Transparent to user.• Managed by Namenode, stored by Datanode.• Blocks are traditionally either 64MB or 128MB: default is 64MB.

Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 10 / 32Amir Payberah https://www.sics.se/~amir/dic.htm

Files and Blocks – 2/3Files and Blocks (2/3)

I Why is a block in HDFS so large?• To minimize the cost of seeks.

I Time to read a block = seek time + transfer time

I Keeping the ratio seektimetransfertime small: we are reading data from the

disk almost as fast as the physical limit imposed by the disk.

I Example: if seek time is 10ms and the transfer rate is 100MB/s, tomake the seek time 1% of the transfer time, we need to make theblock size around 100MB.



Files and Blocks – 3/3Files and Blocks (3/3)

I Same block is replicated on multiple machines: default is 3• Replica placements are rack aware.• 1st replica on the local rack.• 2nd replica on the local rack but di↵erent machine.• 3rd replica on the di↵erent rack.

I Namenode determines replica placement.



HDFS Daemons (2/2)


HDFS Daemons

• HDFS cluster is manager by three types of processes

• Namenode

• Manages the filesystem, e.g., namespace, meta-data, and file

blocks

• Metadata is stored in memory

• Datanode

• Stores and retrieves data blocks

• Reports to Namenode

• Runs on many machines

• Secondary Namenode

• Only for checkpointing.

• Not a backup for Namenode


Reading a file

• Reading a file shorter !

• Get block locations!

• Read from a replica

Namenode

/user/ljdursi/diffuse

bigdata.dat

datanode1 datanode2 datanode3

Client: !Read lines 1...1000 from bigdata.dat

1. Open


Reading a file




Namenode


bigdata.dat



2. Get block locations


Reading a file




Namenode


bigdata.dat



3. read blocks


Writing a file

• Writing a file multiple stage process!

• Create file!

• Get nodes for blocks!

• Start writing!

• Data nodes coordinate replication!

• Get ack back !

• Complete

Namenode


bigdata.dat


Client: !Write newdata.dat

1. create


Writing a file


• Create file!


• Start writing!


• Get ack back !

• Complete

Namenode


bigdata.dat



2. get nodes


Writing a file


• Create file!


• Start writing!


• Get ack back !

• Complete

Namenode


bigdata.dat



3. start writing


Writing a file


• Create file!


• Start writing!


• Get ack back !

• Complete

Namenode


bigdata.dat



4. repl


Writing a file


• Create file!


• Start writing!


• Get ack back (while writing)!

• Complete

Namenode


bigdata.dat



5. ack


Writing a file


• Create file!


• Start writing!


• Get ack back !

• Complete

Namenode


bigdata.dat



6. complete


Communication Protocol

• All HDFS communication protocols are layered on top of the

TCP/IP protocol

• A client establishes a connection to a configurable TCP port on

the NameNode machine and uses ClientProtocol

• DataNodes talk to the NameNode using DataNode protocol

• A Remote Procedure Call (RPC) abstraction wraps both the

ClientProtocol and DataNode protocol

• NameNode never initiates a RPC, instead it only responds to

RPC requests issued by DataNodes or clients

Robustness

• Primary objective of HDFS is to store data reliably even during

failures

• Three common types of failures: NameNode, DataNode and network

partitions

• Data disk failure

• Heartbeat messages to track the health of DataNodes

• NameNodes performs necessary re-replication on DataNode

unavailability, replica corruption or disk fault

• Cluster rebalancing

• Automatically move data between DataNodes, if the free space on a

DataNode falls below a threshold or during sudden high demand

• Data integrity

• Checksum checking on HDFS files, during file creation and retrieval

• Metadata disk failure

• Manual intervention – no auto recovery, restart or failover

MAP-REDUCE

MapReduce

I A shared nothing architecture for processing large data sets with aparallel/distributed algorithm on clusters.

Amir H. Payberah (SICS) MapReduce April 22, 2014 6 / 44

What is it?

MapReduce Definition

I A programming model: to batch process large data sets (inspiredby functional programming).

I An execution framework: to run parallel algorithms on clusters ofcommodity hardware.


Simplicity

I Don’t worry about parallelization, fault tolerance, data distribution,and load balancing (MapReduce takes care of these).

I Hide system-level details from programmers.

Simplicity!



MapReduce Simple DataflowMapReduce Dataflow

I map function: processes data and generates a set of intermediatekey/value pairs.

I reduce function: merges all intermediate values associated with thesame intermediate key.



Word Count

• Was used as an example in the original MapReduce paper!

• Now basically the “hello world” of map reduce!

• Do a count of words of some set of documents.!

• A simple model of many actual web analytics problem

Hello World !Bye World

file01

Hello Hadoop Goodbye Hadoop

file02

Hello 2!World 2!Bye 1!Hadoop 2!Goodbye 1

output/part-00000


High-Level Structure of a MR Program – 1/2

mapper (filename, file-contents):

for each word in file-contents:

emit (word, 1)

reducer (word, values):

sum = 0

for each value in values:

sum = sum + value

emit (word, sum)

https://developer.yahoo.com/hadoop/tutorial

High-Level Structure of a MR Program – 2/2

• Several instances of the mapper function are created on the different machines in a Hadoop cluster

• Each instance receives a different input file (it is assumed that there are many such files)

• The mappers output (word, 1) pairs which are then forwarded to the reducers

• Several instances of the reducer method are also instantiated on the different machines

• Each reducer is responsible for processing the list of values associated with a different word

• The list of values will be a list of 1's; the reducer sums up those ones into a final count associated with a single word. The reducer then emits the final (word, count) output which is written to an output file.

mapper (filename, file-contents):

for each word in file-contents:

emit (word, 1)

reducer (word, values):

sum = 0

for each value in values:

sum = sum + value

emit (word, sum)

https://developer.yahoo.com/hadoop/tutorial

Word Count

• How would you do this with a huge document?!

• Each time you see a word, if it’s a new word, add a tick mark beside it, otherwise add a new word with a tick!

• ...But hard to parallelize (updating the list)


file01

Hello Hadoop Goodbye Hadoop

file02

Hello 2!World 2!Bye 1!Hadoop 2!Goodbye 1

output/part-00000


Word Count

• MapReduce way - all hard work is done by the shuffle - eg, automatically.!

• Map: just emit a 1 for each word you see


file01Hello Hadoop

Goodbye Hadoop

file02

(Hello,1)!(World,1)!(Bye, 1)!(World,1)

(Hello, 1)!(Hadoop, 1)!(Goodbye,1)!(Hadoop, 1)


Word Count

• Shuffle assigns keys (words) to each reducer, sends (k,v) pairs to appropriate reducer!

• Reducer just has to sum up the ones

(Hello, 1)!(Hadoop, 1)!(Goodbye,1)!(Hadoop, 1)

(Hello,1)!(World,1)!(Bye, 1)!(World,1)

Hello 2!World 1!Bye 1

Hadoop 2!Goodbye 1

(Hello,[1,1])!(World,[1,1])!

(Bye, 1)(Hadoop, [1,1])!

(Goodbye,1)Example: Word Count - shu✏e

I The shu✏e phase between map and reduce phase creates a list ofvalues associated with each key.

I The reduce function input is:

(Bye, (1))

(Goodbye, (1))

(Hadoop, (1, 1)

(Hello, (1, 1))

(World, (1, 1))



MapReduce and HDFSHadoop MapReduce and HDFS



a bigdata tour –hdfs, ceph and mapreduce › courses › comp4300 › lectures17 ›...

Documents