hadoop: the elephant in the room
TRANSCRIPT
![Page 1: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/1.jpg)
Apache Hadoop
The elephant in the room
C. Aaron Cois, Ph.D.
![Page 3: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/3.jpg)
The Problem
![Page 4: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/4.jpg)
Large-Scale Computation
• Traditionally, large computation was focused on– Complex, CPU-intensive calculations– On relatively small data sets
• Examples:– Calculate complex differential equations– Calculate digits of Pi
![Page 5: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/5.jpg)
Parallel Processing
• Distributed systems allow scalable computation (more processors, working simultaneously)
INPUT OUTPUT
![Page 6: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/6.jpg)
Data Storage
• Data is often stored on a SAN• Data is copied to each compute node
at compute time• This works well for small amounts of
data, but requires significant copy time for large data sets
![Page 7: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/7.jpg)
SAN
Compute Nodes
Data
![Page 8: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/8.jpg)
SAN
Calculating…
![Page 9: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/9.jpg)
You must first distribute data each time you run a computation…
![Page 10: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/10.jpg)
How much data?
![Page 11: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/11.jpg)
How much data?
over 25 PB of data
![Page 12: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/12.jpg)
How much data?
over 25 PB of data
over 100 PB of data
![Page 13: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/13.jpg)
The internet
IDC estimates[2] the internet contains at least:
1 Zetabyteor
1,000 Exabytesor
1,000,000 Petabytes
2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)
![Page 14: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/14.jpg)
How much time?
Disk Transfer Rates:• Standard 7200 RPM drive
128.75 MB/s=> 7.7 secs/GB=> 13 mins/100 GB=> > 2 hours/TB=> 90 days/PB
1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate
![Page 15: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/15.jpg)
How much time?
Fastest Network Xfer rate:• iSCSI over 1000GB ethernet (theor.)– 12.5 Gb/S => 80 sec/TB, 1333 min/PB
Ok, ignore network bottleneck:• Hypertransport Bus– 51.2 Gb/S => 19 sec/TB, 325 min/PB
1 http://en.wikipedia.org/wiki/List_of_device_bit_rates
![Page 16: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/16.jpg)
We need a better plan
• Sending data to distributed processors is the bottleneck
• So what if we sent the processors to the data?
Core concept: Pre-distribute and store the data.Assign compute nodes to operate on local data.
![Page 17: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/17.jpg)
The Solution
![Page 18: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/18.jpg)
Distributed Data Servers
![Page 19: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/19.jpg)
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
Distribute the Data
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
![Page 20: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/20.jpg)
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
Send computation code to servers containing relevant data
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
0101100110
10
![Page 21: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/21.jpg)
Hadoop Origin
• Hadoop was modeled after innovative systems created by Google
• Designed to handle massive (web-scale) amounts of data
Fun Fact: Hadoop’s creator named it after his son’s stuffed elephant
![Page 22: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/22.jpg)
Hadoop Goals
• Store massive data sets • Enable distributed computation
• Heavy focus on – Fault tolerance – Data integrity– Commodity hardware
![Page 23: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/23.jpg)
Hadoop System
GFS
MapReduce
BigTable
HDFS
Hadoop MapReduce
HBase
![Page 24: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/24.jpg)
Hadoop System
GFS
MapReduce
BigTable
HDFS
Hadoop MapReduce
HBase
Hadoop
![Page 25: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/25.jpg)
Components
![Page 26: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/26.jpg)
HDFS
• “Hadoop Distributed File System”• Sits on top of native filesystem– ext3, etc
• Stores data in files, replicated and distributed across data nodes
• Files are “write once”• Performs best with millions of
~100MB+ files
![Page 27: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/27.jpg)
HDFS
Files are split into blocks for storage
Datanodes– Data blocks are distributed/replicated
across datanodes
Namenode – The master node– Keeps track of location of data blocks
![Page 28: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/28.jpg)
HDFS
Multi-Node Cluster
Master Slave
Name Node
Data NodeData Node
![Page 29: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/29.jpg)
MapReduce
A programming model– Designed to make programming parallel
computation over large distributed data sets easy
– Each node processes data already residing on it (when possible)
– Inspired by functional programming map and reduce functions
![Page 30: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/30.jpg)
MapReduce
JobTracker– Runs on a master node– Clients submit jobs to the JobTracker– Assigns Map and Reduce tasks to slave
nodes
TaskTracker– Runs on every slave node– Daemon that instantiates Map or Reduce
tasks and reports results to JobTracker
![Page 31: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/31.jpg)
MapReduce
Multi-Node Cluster
Master Slave
JobTracker
TaskTrackerTaskTracker
![Page 32: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/32.jpg)
MapReduce Layer
HDFS Layer
Multi-Node Cluster
Master Slave
NameNode
DataNodeDataNode
JobTracker
TaskTracker
TaskTracker
![Page 33: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/33.jpg)
HBase
• Hadoop’s Database• Sits on top of HDFS• Provides random read/write access to
Very LargeTM tables– Billions of rows, billions of columns
• Access via Java, Jython, Groovy, Scala, or REST web service
![Page 34: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/34.jpg)
A Typical Hadoop Cluster
• Consists entirely of commodity ~$5k servers
• 1 master, 1 -> 1000+ slaves• Scales linearly as more processing
nodes are added
![Page 35: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/35.jpg)
How it works
![Page 36: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/36.jpg)
http://en.wikipedia.org/wiki/MapReduce
Traditional MapReduce
![Page 37: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/37.jpg)
Hadoop MapReduce
Image Credit: http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854
![Page 38: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/38.jpg)
MapReduce Example
function map(Str name, Str document):for each word w in document:
increment_count(w, 1) function reduce(Str word, Iter partialCounts):
sum = 0for each pc in partialCounts:
sum += ParseInt(pc)return (word, sum)
![Page 39: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/39.jpg)
What didn’t I worry about?
• Data distribution• Node management• Concurrency• Error handling• Node failure• Load balancing• Data replication/integrity
![Page 40: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/40.jpg)
Demo
![Page 41: Hadoop: The elephant in the room](https://reader033.vdocuments.site/reader033/viewer/2022042714/554a15b8b4c9058c5d8b4e77/html5/thumbnails/41.jpg)
Try the demo yourself!
Go to:
https://github.com/cacois/vagrant-hadoop-cluster
Follow the instructions in the README