hadoop distributed computing framework for big data
DESCRIPTION
TRANSCRIPT
![Page 1: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/1.jpg)
HADOOP⼤大数据分布式 计算框架
Hadoop Distributed Computing Framework for Big Data http://www.cyanny.com/2013/12/05/hadoop-overview/
![Page 2: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/2.jpg)
The Motivation for Hadoop
• Hadoop is an open source distributed computing framework for large-scale data sets processing.
• Created by Doug Cutting, origins in Apache Nutch, moved out from Nutch in 2006
• Based on Google GFS paper (2003) and MapReduce Paper (Jeff Dean, 2004), Google 200 clusters, each has 1000+ nodes
• Yahoo : 42000nodes,LinkedIn: 4100 nodes, Facebook: 1400, eBay: 500, TaoBao: 2000(biggest in CN)
• Echosystem: HBase, Hive, Pig, Zookeeper, Oozie, Mahout….
![Page 3: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/3.jpg)
Why Hadoop?
• Problems in traditional big data processing(MPI, Grid Computing, Volunteer Computing):
✴It’s difficult to deal with partial failures of the system.
✴Finite and precious bandwidth must be available to combine data from different disks and transfer time is very slow for big data volume.
✴Data exchange requires synchronization.
✴Temporal dependencies are complicated.
![Page 4: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/4.jpg)
How Hadoop Save Big Data
• Hadoop provide partial failure support. Hadoop Distributed File System (HDFS) can store large data sets with high reliability and scalability.
• HDFS provide great fault tolerance. Partial Failure will not result in the failure of the entire system. And HDFS provide data recoverability for partial failure.
• Hadoop introduce MapReduce, which spares programmers from low-level details, like partial failure. The MapReduce framework will detect failed tasks and reschedule them automatically.
• Hadoop provide data locality. The MapReduce framework tries to collocate data with the compute nodes. Data is local, and tasks are separated with no dependence on each other. So the shared-nothing and data locality architecture can save more bandwidth and solve the complicated dependence problem
![Page 5: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/5.jpg)
Hadoop Basic Concepts
• The core concepts for Hadoop are to distribute the data as it is initially stored in the system. That is data locality.
• Applications are written in high-level code.
• Nodes Dependency as little as possible.
• Data Replica, data is spread among machines in advance
![Page 6: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/6.jpg)
Hadoop High-Level Overview
• HDFS (Hadoop Distributed File System), which is a distributed file system designed to store large data sets and streaming data sets on commodity hardware with high scalability, reliability and availability.
• MapReduce is a parallel programming model and an associated implementation for processing and generating large data sets. It provides a clean abstraction for programmers.
![Page 7: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/7.jpg)
Master-Slave Architecture
• NameNode: HDFS namespace and metadata.
• Secondary NameNode, which performs housekeeping functions for NameNode, and isn’t a backup or hot standby for the NameNode.
• DataNode, which stores actual HDFS data blocks. In Hadoop, a large file is split into 64M or 128M blocks.
• JobTracker, which manages MapReduce jobs, distributes individual tasks to machines running.
• TaskTracker, which initiates and monitors each individual Map and Reduce tasks.
Each Daemon Runs its own JVM
![Page 8: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/8.jpg)
POSIX: Portable Operating System Interface
An Example
• WordCount
• fs -copyFromLocal conf input
• bin/hadoop jar hadoop-examples-1.2.1.jar grep input output 'dfs[a-z.]+'
• bin/hadoop fs -cat output/*
• localhost:50030, check MapReduce status
• localhost:50070, check HDFS status
![Page 9: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/9.jpg)
HDFS: Basic Concepts
• Highly fault-tolerant: handle partial failure
• Streaming Data Access: Block Data(64 MB, 128MB), “Write-once-read-many-times”
• Large data sets: GB, TB,PB
![Page 10: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/10.jpg)
HDFS Architecture
• NameNode: namespace tree(logical file location and physical location in RAM)
• DataNode: store actual data blocks
• Communication: RPC
![Page 11: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/11.jpg)
Secondary NameNode
• NameNode Data Persistent: FSImage and EditLog
✤ FSImage persistent for filesystem tree, mapping of files and blocks, filesystem properties
✤No persistent for block physical locations, which are in RAM
• Checkpoint: Merge Editlog with FSImage
• Secondary NameNode Housekeeping: Periodically checkpoint
![Page 12: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/12.jpg)
HDFS: Data Replica
• 3 Replica: high reliability
• one replica on one node in the local rack
• the second one on a node in a different remote rack
• the third one on a different node in the same remote rack.
![Page 13: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/13.jpg)
SPOF: HDFS Federation
• Scale NameNode
• Each NameNode has Namespace Volume:
✴NameSpace
✴Block Pool
• DataNode: Stores blocks from different NN.
SPOF: Single Point of Failure
![Page 14: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/14.jpg)
SPOF: HDFS High Availability(HA)
• A ad-hoc standby NameNode
• Active NN write update to shared NFS
• Standby NN pulls and merges logs, up-to-date in memory
• DataNodes: sends Block reports to both NN
• Failover in tens of seconds
![Page 15: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/15.jpg)
MapReduce
• Map task is to process a key/value pair to generate a set of intermediate key/value pairs.
✴ Input: key is the offset of each line, value is each line
✴ Output: <apple, 1>…<pear, 1>, <peach, 1>, written to local disk not HDFS
• Reduce task is to merge all intermediate values associated with the same intermediated key
• Shuffle and sort
• Input: the output from map task, with the same key, like : <apple, 1> … <apple, 1>
• Output: <apple, 5>, written to HDFS
• No reduce task can start until every map task has finished (Speculative Execution)
![Page 16: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/16.jpg)
MapReduce
![Page 17: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/17.jpg)
MapReduce v1 Framework
![Page 18: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/18.jpg)
![Page 19: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/19.jpg)
MapReduce v2 Framework
YARN(Yet Another Resource Negotiator)
Scheduler Applications Manager
Application Master:
monitor task
![Page 20: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/20.jpg)
Memory dynamic grained(1G~10G), not fixed slots
No JVM reuse, each task runs on each JVM
MapReduce is kind of Application
App Master Aggregates Job status, not Resource Manager
YARN’s Beauty
![Page 21: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/21.jpg)
![Page 22: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/22.jpg)
When not use Hadoop?
• Low-latency Data Access: real-time needs, HBase
• Structured Data: RDBMS, ad-hoc sql query
• When data isn’t that big: Hadoop needs TB and PB, not GB
• Too many small files
• Write more than read
• MapReduce may be not the best choice: data no dependency, and parallel.
![Page 23: Hadoop distributed computing framework for big data](https://reader030.vdocuments.site/reader030/viewer/2022013100/54c67c854a7959a4368b4661/html5/thumbnails/23.jpg)
Thank You!