hadoop distributed file system
TRANSCRIPT
Hadoop Distributed File System(HDFS)
Big Data Concepts
• Volume– No more GBs of data– TB,PB,EB,ZB
• Velocity– High frequency data like
in stocks • Variety– Structure and
Unstructured data
Challenges In Big Data
• Complex– No proper understanding of
the underlying data • Storage
– How to accommodate large amount of data in single physical machine
• Performance– How to process large amount
of data efficiently and effectively so as to increase the performance
Challenges in Traditional Application
• Network– Limited bandwidth
• Data– Growth of data can’t be
controlled• Efficiency & Performance
– How fast data can be read• Processing capacity of machine
– Processor, RAM is a bottleneck
StatisticsApplication Size(MB) Data Size Total Round trip time(sec)
10 10 MB 1+1 = 2
10 100MB 10+10 = 20
10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min)
10 1000 GB= 1TB 100000 + 100000 = ~55 Hour
• Calculation is done under ideal condition• No processing time is taken into consideration
Assuming N/W bandwidth is 10MBPS
• How data is read ? • Line by Line reading
• Depends on seek rate and disc latency Average Data Transfer rate = 75MB/sec Total Time to read 100GB = 22 min Total time to read 1TB = 3 hours
How much time you take to sort 1TB of data??
Enough time to watch a movie, while
data is being read
Statistics(Contd.)
Observation
• Large amount of data takes lot of time to read
• Data is moved back and forth over the low latency network where application is running – 90% of the time is consumed in
data transfer • Application size is constant
Conclusion
• Achieving Data Localization– Move application close to data
Or– Move data close to application
Summary
• Storage is problem – Cannot store large amount of data – Upgrading the hard disk will also not solve the problem
(Hardware limitation) • Performance degradation – Upgrading RAM will not solve the problem (Hardware
limitation) • Reading – Larger data requires larger time to read
Solution Approach• Distributed Framework
– Storing the data across several machine
– Performing computation parallel across several machines
• Should Support– Partial failures– Recoverability– Data availability– Consistency– Data reliability– Upgrading
Introducing Hadoop
Distributed framework that provides scaling in : • Storage • Performance • IO Bandwidth
What makes Hadoop special?
• No high end or expensive systems are required • Can run on Linux, Mac OS/X, Windows, Solaris • Fault tolerant system
– Execution of the job continues even of nodes are failing • Highly reliable and efficient storage system• In built intelligence to speed up the application
– Speculative execution• Fit for lot of applications:
– Web log processing – Page Indexing, page ranking – Complex event processing
Features of Hadoop
• Partition, replicate and distributes the data – Data availability, consistency
• Performs Computation closer to the data – Data Localization
• Performs computation across several hosts – MapReduce framework
Hadoop Components
• Hadoop is bundled with two independent components – HDFS (Hadoop Distributed File System) • Designed for scaling in terms of storage and IO
bandwidth – MR framework (MapReduce) • Designed for scaling in terms of performance
Understanding file structure
1 GB file
File is split into
blocks
Each block is typically
64MB
Each block is stored as two files – one holding
data and second for metadata, checksum
Block
Hadoop Processes
• Processes running on Hadoop– NameNode– DataNode– Secondary NameNode– Task Tracker– Job Tracker
NameNode
• Single point of contact• HDFS master• Holds meta information– List of files and directories– Location of blocks
• Single node per cluster– Cluster can have thousands of
DataNodes and tens of thousands of HDFS client.
NameNode
DataNode
• Can execute multiple tasks concurrently• Holds actual data blocks, checksum and generation
stamp• If block is half full, needs only half of the space of
full block• At start-up, connects to NameNode and perform
handshake• No binding to IP address or port, uses Storage ID• Sends heartbeat to NameNode
DataNodeStorage ID: XYZ001
Communication• Total Storage Capacity• Fraction of storage in
use• No of data transfer
currently in progress
• Instructs DataNode• Replicate block to other node• Remove local block replica• Send immediate block report• Shut down the node
Every 3 seconds.
“I AM ALIVE”
NameNode
DataNodeStorage ID: XYZ001 DataNode
Storage ID: XYZ002
DataNodeStorage ID: XYZ003
Reply
No heartbeat
for 10 minutes
Heartbeat
Overview of HDFS
HDFS Client