hadoop bigdata overview
DESCRIPTION
Hadoop and Bigdata basicsTRANSCRIPT
![Page 1: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/1.jpg)
Haritha K
Hadoop
![Page 2: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/2.jpg)
What is BigData?
![Page 3: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/3.jpg)
What attributes to BigData?
![Page 4: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/4.jpg)
What attributes to BigData…
Velocity
Variety
Volume
![Page 5: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/5.jpg)
Solution ? Hadoop
Hadoop is an open source framework for writing and running distributed applications that process large amounts of data on clusters of commodity hardware using simple programming model.
History: Google – 2004 Apache and Yahoo - 2009 Project Creator - Doug Cutting , named “hadoop” after his
son’s yellow elephant doll.
![Page 6: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/6.jpg)
Who are using Hadoop?
![Page 7: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/7.jpg)
Why distributed computing ?
![Page 8: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/8.jpg)
Why distributed computing ?......
![Page 9: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/9.jpg)
Hadoop Assumptions
It is written with large clusters of computers in mind and is built around the following assumptions: Hardware will fail. Processing will be run in batches. Applications that run on HDFS have large data sets. A
typical file in HDFS is gigabytes to terabytes in size. It should provide high aggregate data bandwidth and
scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
Applications need a write-once-read-many access model.
Moving Computation is Cheaper than Moving Data.
![Page 10: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/10.jpg)
Hadoop Core Components
HDFS o Hadoop Distributed File Systemo Storage
Map Reduceo Execution engine o Computation.
![Page 11: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/11.jpg)
Hadoop Architecture
![Page 12: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/12.jpg)
Hadoop - Master/Slave
Hadoop is designed as a master-slave shared-nothing architecture
Master node (single node)
Many slave nodes
![Page 13: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/13.jpg)
HDFS Components
Name Node Master of the system Maintains and manages the blocks which are present in the
data nodes.
Data Nodes Slaves which are deployed on each machine Provides the actual storage. Responsible for providing read and write requests from client.
![Page 14: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/14.jpg)
Rack Awareness
![Page 15: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/15.jpg)
Main Properties of HDFS
Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data
Replication: Each data block is replicated many times (default is 3)
Failure: Failure is the norm rather than exception Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
![Page 16: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/16.jpg)
Map Reduce
Programming model developed at Google Sort/merge based distributed computing The underlying system takes care of the partitioning of
the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success)
![Page 17: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/17.jpg)
Map Reduce Components
Job Tracker is the master node (runs with the namenode) Receives the user’s job Decides on how many tasks will run (number of mappers) Decides on where to run each mapper (concept of locality)
Task Tracker is the slave node (runs on each datanode) Receives the task from Job Tracker Runs the task until completion (either map or reduce task) Always in communication with the Job Tracker reporting
progress
![Page 18: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/18.jpg)
How Map Reduce works ?
The run time partitions the input and provides it to different Map instances;
Map (key, value) (key’, value’) The run time collects the (key’, value’) pairs and
distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.
Each Reduce produces a single (or zero) file output. Map and Reduce are user written functions
![Page 19: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/19.jpg)
Map Reduce Phases
Deciding on what will be the key and what will be the value developer’s responsibility
![Page 20: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/20.jpg)
Example : Color Count
Shuffle & Sorting based
on k
Input blocks on HDFS
Produces (k, v) ( , 1)
Consumes(k, [v]) ( , [1,1,1,1,1,1..])
Produces(k’, v’) ( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it has 3 parts on probably 3 different machines
![Page 21: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/21.jpg)
Hadoop vs. Other Systems
Distributed Databases Hadoop
Computing Model
- Notion of transactions- Transaction is the unit of work- ACID properties, Concurrency
control
- Notion of jobs- Job is the unit of work- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format - (un)(semi)structured- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare- Recovery mechanisms
- Failures are common over thousands of machines
- Simple yet efficient fault tolerance
Key Characteristics
- Efficiency, optimizations, fine-tuning
- Scalability, flexibility, fault tolerance
![Page 22: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/22.jpg)
Advantages
A Reliable shared storage. Simple analysis system. Distributed File System. Tasks are independent. Easy to handle partial failures - entire nodes can fail
and restart.
![Page 23: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/23.jpg)
Disadvantages
Lack of central data. Single master node. Managing job flow isn’t trivial when intermediate data
should be kept.
![Page 24: Hadoop bigdata overview](https://reader036.vdocuments.site/reader036/viewer/2022062514/5585a377d8b42ae3228b47bf/html5/thumbnails/24.jpg)
Thank You………..