an introduction to hadoop
DESCRIPTION
An Introduction to Hadoop and the MapReduce paradigm. (A presentation that I did in mid-2010.)TRANSCRIPT
![Page 1: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/1.jpg)
An introduction to
![Page 2: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/2.jpg)
Hello
• Processing against a 156 node cluster
•Certified Hadoop Developer
•Certified Hadoop System Administrator
![Page 3: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/3.jpg)
Goals
•Why should you care?
•What is it?
•How does it work?
![Page 4: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/4.jpg)
Data Everywhere“Every two days now we create as much information as we did from the dawn of civilization up until 2003”
-Eric Schmidtthen CEO of GoogleAug 4, 2010
![Page 5: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/5.jpg)
Data Everywhere
![Page 6: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/6.jpg)
Data Everywhere
![Page 7: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/7.jpg)
Data Everywhere
![Page 8: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/8.jpg)
The Hadoop Project
•Originally based on papers published by Google in 2003 and 2004
•Hadoop started in 2006 at Yahoo!
•Top level Apache Foundation project
•Large, active user base, user groups
•Very active development, strong development team
![Page 9: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/9.jpg)
Who Uses Hadoop?
![Page 10: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/10.jpg)
Hadoop Components
HDFSHDFS
Storage
Self-healinghigh-bandwidth
clustered storage
MapReducMapReducee
Processing
Fault-tolerantdistributedprocessing
![Page 11: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/11.jpg)
Typical Cluster•3-4000 commodity servers
•Each server
•2x quad-core
•16-24 GB ram
•4-12 TB disk space
•20-30 servers per rack
![Page 12: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/12.jpg)
2 Kinds of Nodes
Master Nodes Slave Nodes
![Page 13: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/13.jpg)
Master Nodes•NameNode
•only 1 per cluster
•metadata server and database
•SecondaryNameNode helps with some housekeeping
•JobTracker
• only 1 per cluster
• job scheduler
![Page 14: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/14.jpg)
Slave Nodes• DataNodes
• 1-4000 per cluster
• block data storage
•TaskTrackers
• 1-4000 per cluster
• task execution
![Page 15: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/15.jpg)
HDFS Basics•HDFS is a filesystem written in Java
•Sits on top of a native filesystem
•Provides redundant storage for massive amounts of data
•Use cheap(ish), unreliable computers
![Page 16: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/16.jpg)
HDFS Data•Data is split into blocks and stored on
multiple nodes in the cluster
•Each block is usually 64 MB or 128 MB (conf)
•Each block is replicated multiple times (conf)
•Replicas stored on different data nodes
•Large files, 100 MB+
![Page 17: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/17.jpg)
NameNode•A single NameNode stores all
metadata
•Filenames, locations on DataNodes of each block, owner, group, etc.
•All information maintained in RAM for fast lookup
•Filesystem metadata size is limited to the amount of available RAM on the NameNode
![Page 18: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/18.jpg)
SecondaryNameNode
•The Secondary NameNode is not a failover NameNode
•Does memory-intensive administrative functions for the NameNode
•Should run on a separate machine
![Page 19: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/19.jpg)
Data Node•DataNodes store file contents
•Stored as opaque ‘blocks’ on the underlying filesystem
•Different blocks of the same file will be stored on different DataNodes
•Same block is stored on three (or more) DataNodes for redundancy
![Page 20: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/20.jpg)
Self-healing•DataNodes send heartbeats to the
NameNode
• After a period without any heartbeats, a DataNode is assumed to be lost
• NameNode determines which blocks were on the lost node
• NameNode finds other DataNodes with copies of these blocks
• These DataNodes are instructed to copy the blocks to other nodes
• Replication is actively maintained
![Page 21: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/21.jpg)
HDFS Data Storage•NameNode holds
file metadata
•DataNodes hold the actual data
•Block size is 64 MB, 128 MB, etc
•Each block replicated three times
NameNodefoo.txt: blk_1, blk_2, blk_3bar.txt: blk_4, blk_5
DataNodesblk_1 blk_2
blk_3 blk_5
blk_1 blk_3
blk_4
blk_1 blk_4
blk_5
blk_2 blk_4
blk_2 blk_3
blk_5
![Page 22: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/22.jpg)
What is MapReduce?
•MapReduce is a method for distributing a task across multiple nodes
•Automatic parallelization and distribution
•Each node processes data stored on that node (processing goes to the data)
![Page 23: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/23.jpg)
Features of MapReduce
•Fault-tolerance
•Status and monitoring tools
•A clean abstraction for programmers
![Page 24: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/24.jpg)
JobTracker•MapReduce jobs are controlled by a
software daemon known as the JobTracker
•The JobTracker resides on a master node• Assigns Map and Reduce tasks to other nodes on the cluster
• These nodes each run a software daemon known as the TaskTracker
• The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker
![Page 25: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/25.jpg)
Two Parts
•Developer specifies two functions:
•map()
•reduce()
•The framework does the rest
![Page 26: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/26.jpg)
map()
•The Mapper reads data in the form of key/value pairs
•It outputs zero or more key/value pairs
map(key_in, value_in) ->
(key_out, value_out)
![Page 27: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/27.jpg)
reduce()•After the Map phase all the
intermediate values for a given intermediate key are combined together into a list
•This list is given to one or more Reducers
•The Reducer outputs zero or more final key/value pairs
•These are written to HDFS
![Page 28: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/28.jpg)
map() Word Countmap(String input_key, String input_value) foreach word w in input_value emit(w, 1)
(1234, “to be or not to be”)(5678, “to see or not to see”)
(“to”,1),(“be”,1),(“or”,1),(“not”,1),(“to”,1),(“be”,1), (“to”,1),(“see”,1),(“or”,1),(“not”,1),(“to”,1),(“see”,1)
![Page 29: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/29.jpg)
reduce() Word Countreduce(String output_key, List
middle_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)
(“to”, [1,1,1,1])(“be”,[1,1])(“or”,[1,1])(“not”,[1,1])(“see”,[1,1])
(“to”, 4)(“be”,2)(“or”,2)(“not”,2)(“see”,2)
![Page 30: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/30.jpg)
Resources
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/
http://www.cloudera.com/resources/?media=Video
![Page 31: An Introduction to Hadoop](https://reader036.vdocuments.site/reader036/viewer/2022062404/554a3beeb4c905293a8b4bdc/html5/thumbnails/31.jpg)
Questions?