hadoop administrationfiles.meetup.com/11583652/hadoop_presentation.pdf · why hadoop ? we are...

Post on 14-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hadoop Administration

Case for Hadoop

Why Hadoop is needed

How Hadoop originated

What problems Hadoop Solve

Why Hadoop ?

We are generating more data then ever before

Financial transactions

Sensor networks

Server logs

Analytics

Social Media

It’s not just about the size of data, but the frequency of data. We are generating data faster then ever before.

We need to make sense out of data.

The 3 V's

Web logs

Images

Videos

Audios

Sensor Data

Volume Velocity Variety

Two Big problems at hand

Large scale data storage

Large scale data analysis

- Traditional ways of moving data to the compute node, does not scale well at this large scale.

- More time spent coping data then actually processing it.

What is Hadoop Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

It is an Open-source Data Management with scale-out storage

& distributed processing.

Hadoop Eco-System

Hadoop Components

It has two main components:

HDFS – Hadoop Distributed File System (Storage)

Distributed across “nodes”

Natively redundant

NameNode tracks locations.

MapReduce (Processing)

Splits a task across processors

“near” the data & assembles results

Main Components Of HDFS NameNode

- Master Node

- Stores MetaData

DataNode

- Stores the Actual Data Blocks

- Serves Read/Write Requests

NameNode Metadata Meta-data in Memory

- The entire metadata is in main memory

- No demand paging of FS meta-data

Types of Metadata

- List of files

- List of Blocks for each file

- List of DataNode for each block

- File attributes, e.g. access time, replication factor

A Transaction Log

- Records file creations, file deletions. etc

HDFS Architecture

File Split

File Split

Replication

Write Operation

Write Operation

Rack Awareness

Pipelined Write

Thank You!

top related