hadoop administrationfiles.meetup.com/11583652/hadoop_presentation.pdf · why hadoop ? we are...
TRANSCRIPT
Hadoop Administration
Case for Hadoop
Why Hadoop is needed
How Hadoop originated
What problems Hadoop Solve
Why Hadoop ?
We are generating more data then ever before
Financial transactions
Sensor networks
Server logs
Analytics
Social Media
It’s not just about the size of data, but the frequency of data. We are generating data faster then ever before.
We need to make sense out of data.
The 3 V's
Web logs
Images
Videos
Audios
Sensor Data
Volume Velocity Variety
Two Big problems at hand
Large scale data storage
Large scale data analysis
- Traditional ways of moving data to the compute node, does not scale well at this large scale.
- More time spent coping data then actually processing it.
What is Hadoop Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage
& distributed processing.
Hadoop Eco-System
Hadoop Components
It has two main components:
HDFS – Hadoop Distributed File System (Storage)
Distributed across “nodes”
Natively redundant
NameNode tracks locations.
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Main Components Of HDFS NameNode
- Master Node
- Stores MetaData
DataNode
- Stores the Actual Data Blocks
- Serves Read/Write Requests
NameNode Metadata Meta-data in Memory
- The entire metadata is in main memory
- No demand paging of FS meta-data
Types of Metadata
- List of files
- List of Blocks for each file
- List of DataNode for each block
- File attributes, e.g. access time, replication factor
A Transaction Log
- Records file creations, file deletions. etc
HDFS Architecture
File Split
File Split
Replication
Write Operation
Write Operation
Rack Awareness
Pipelined Write
Thank You!