hadoop module 3.2

57
www.edureka.in/hadoop Slide 1

Upload: sainath-reddy

Post on 22-Nov-2015

80 views

Category:

Documents


1 download

DESCRIPTION

This contains module 3.2

TRANSCRIPT

  • www.edureka.in/hadoopSlide 1

  • www.edureka.in/hadoopSlide 2

    How It Works

    LIVE On-Line classes

    Class recordings in Learning Management System (LMS)

    Module wise Quizzes, Coding Assignments

    24x7 on-demand technical support

    Project work on large Datasets

    Online certification exam

    Lifetime access to the LMS

    Complimentary Java Classes

  • www.edureka.in/hadoopSlide 3

    Module 1 Understanding Big Data

    Hadoop Architecture

    Module 2 Hadoop Cluster Configuration

    Data loading Techniques

    Hadoop Project Environment

    Module 3 Hadoop MapReduce framework

    Programming in Map Reduce

    Module 4 Advance MapReduce

    MRUnit testing framework

    Module 5 Analytics using Pig

    Understanding Pig Latin

    Module 6 Analytics using Hive

    Understanding HIVE QL

    Module 7 Advance Hive

    NoSQL Databases and HBASE

    Module 8 Advance HBASE

    Zookeeper Service

    Module 9 Hadoop 2.0 New Features

    Programming in MRv2

    Module 10 Apache Oozie

    Real world Datasets and Analysis

    Project Discussion

    Course Topics

  • www.edureka.in/hadoopSlide 4

    What is Big Data?

    Limitations of the existing solutions

    Solving the problem with Hadoop

    Introduction to Hadoop

    Hadoop Eco-System

    Hadoop Core Components

    HDFS Architecture

    MapRedcue Job execution

    Anatomy of a File Write and Read

    Hadoop 2.0 (YARN or MRv2) Architecture

    Topics for Today

  • www.edureka.in/hadoopSlide 5

    Lots of Data (Terabytes or Petabytes)

    Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

    Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.

    NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.

    What Is Big Data?

  • www.edureka.in/hadoopSlide 6

    Un-Structured Data is Exploding

  • www.edureka.in/hadoopSlide 7

    IBMs Definition Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

    Web logs

    Images

    Videos

    Audios

    Sensor Data

    Volume Velocity Variety

    IBMs Definition

  • www.edureka.in/hadoopSlide 8

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Annies Introduction

  • www.edureka.in/hadoopSlide 9

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Annies Question

    Map the following to corresponding data type:

    - XML Files

    - Word Docs, PDF files, Text files

    - E-Mail body

    - Data from Enterprise systems (ERP, CRM etc.)

  • www.edureka.in/hadoopSlide 10

    XML Files -> Semi-structured data

    Word Docs, PDF files, Text files -> Unstructured Data

    E-Mail body -> Unstructured Data

    Data from Enterprise systems (ERP, CRM etc.) -> Structured Data

    Annies Answer

  • www.edureka.in/hadoopSlide 11

    More on Big Datahttp://www.edureka.in/blog/the-hype-behind-big-data/

    Why Hadoophttp://www.edureka.in/blog/why-hadoop/

    Opportunities in Hadoophttp://www.edureka.in/blog/jobs-in-hadoop/

    Big Data http://en.wikipedia.org/wiki/Big_Data

    IBMs definition Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

    Further Reading

  • www.edureka.in/hadoopSlide 12

    Web and e-tailing Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection

    Telecommunications Customer Churn Prevention Network Performance

    Optimization Calling Data Record (CDR)

    Analysis Analyzing Network to Predict

    Failure

    http://wiki.apache.org/hadoop/PoweredBy

    Common Big Data Customer Scenarios

  • www.edureka.in/hadoopSlide 13

    Government Fraud Detection And Cyber Security Welfare schemes Justice

    Healthcare & Life Sciences Health information exchange Gene sequencing Serialization Healthcare service quality

    improvements Drug Safety

    http://wiki.apache.org/hadoop/PoweredBy

    Common Big Data Customer Scenarios (Contd.)

  • www.edureka.in/hadoopSlide 14

    Banks and Financial services Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis

    Retail Point of sales Transaction

    Analysis Customer Churn Analysis Sentiment Analysis

    http://wiki.apache.org/hadoop/PoweredBy

    Common Big Data Customer Scenarios (Contd.)

  • www.edureka.in/hadoopSlide 15

    Insight into data can provide Business Advantage.

    Some key early indicators can mean Fortunes to Business.

    More Precise Analysis with more data.

    Case Study: Sears Holding Corporation

    X

    *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data.

    Hidden Treasure

  • www.edureka.in/hadoopSlide 16

    http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

    90% of the ~2PB Archived

    Storage

    Processing

    Instrumentation

    BI Reports + Interactive Apps

    RDBMS (Aggregated Data)

    ETL Compute Grid

    3. Premature data death

    1. Cant explore original high fidelity raw data

    2. Moving data to compute doesnt scale

    Mostly Append

    A meagre 10% of the

    ~2PB Data is available for

    BI

    Storage only Grid (original Raw Data)

    Collection

    Limitations of Existing Data Analytics Architecture

  • www.edureka.in/hadoopSlide 17

    *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

    No Data Archiving

    1. Data Exploration & Advanced analytics

    2. Scalable throughput for ETL & aggregation

    3. Keep data alive forever

    Mostly Append

    Instrumentation

    BI Reports + Interactive Apps

    RDBMS (Aggregated Data)

    Collection

    Hadoop : Storage + Compute Grid

    Entire ~2PB Data is

    available for processing

    Both Storage

    And Processing

    Solution: A Combined Storage Computer Layer

  • www.edureka.in/hadoopSlide 18

    Accessible

    RobustSimple

    Scalable

    DifferentiatingFactors

    Hadoop Differentiating Factors

  • www.edureka.in/hadoopSlide 19

    Structured Data Types Multi and Unstructured

    Limited, No Data Processing Processing Processing coupled with Data

    Standards & Structured Governance Loosely Structured

    Required On write Schema Required On Read

    Reads are Fast Speed Writes are Fast

    Software License Cost Support Only

    Known Entity Resources Growing, Complexities, Wide

    Interactive OLAP AnalyticsComplex ACID TransactionsOperational Data Store

    Best Fit Use Data DiscoveryProcessing Unstructured DataMassive Storage/Processing

    RDBMSRDBMS EDW MPP NoSQL HADOOP

    Hadoop Its about Scale And Structure

  • www.edureka.in/hadoopSlide 20

    Read 1 TB Data

    4 I/O ChannelsEach Channel 100 MB/s

    1 Machine

    4 I/O ChannelsEach Channel 100 MB/s

    10 Machines

    Why DFS?

  • www.edureka.in/hadoopSlide 21

    45 Minutes

    Read 1 TB Data

    4 I/O ChannelsEach Channel 100 MB/s

    1 Machine

    4 I/O ChannelsEach Channel 100 MB/s

    10 Machines

    Why DFS?

  • www.edureka.in/hadoopSlide 22

    4.5 Minutes45 Minutes

    Read 1 TB Data

    4 I/O ChannelsEach Channel 100 MB/s

    1 Machine

    4 I/O ChannelsEach Channel 100 MB/s

    10 Machines

    Why DFS?

  • www.edureka.in/hadoopSlide 23

    Apache Hadoop is a framework that allows for the distributed processing of large data sets

    across clusters of commodity computers using a simple programming model.

    It is an Open-source Data Management with scale-out storage & distributed processing.

    What Is Hadoop?

  • www.edureka.in/hadoopSlide 24

    Reliable

    EconomicalFlexible

    Scalable

    HadoopFeatures

    Hadoop Key Characteristics

  • www.edureka.in/hadoopSlide 25

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Hadoop is a framework that allows for the distributed processing of:

    - Small Data Sets

    - Large Data Sets

    Annies Question

  • www.edureka.in/hadoopSlide 26

    Annies Answer

    Large Data Sets. It is also capable to process small data-sets

    however to experience the true power of Hadoop one needs to have

    data in TBs because this where RDBMS takes hours and fails

    whereas Hadoop does the same in couple of minutes.

  • www.edureka.in/hadoopSlide 27

    Apache Oozie (Workflow)

    HDFS (Hadoop Distributed File System)

    Pig LatinData Analysis

    MahoutMachine Learning

    HiveDW System

    MapReduce Framework

    HBase

    Flume Sqoop

    Import Or Export

    Unstructured orSemi-Structured data Structured Data

    Hadoop Eco-System

  • www.edureka.in/hadoopSlide 28

    Hadoop and MapReduce magic in

    action

    Write intelligent applications using Apache Mahout

    https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

    LinkedIn Recommendations

    Machine Learning with Mahout

  • www.edureka.in/hadoopSlide 29

    Hadoop is a system for large scale data processing.

    It has two main components:

    HDFS Hadoop Distributed File System (Storage)

    Distributed across nodes

    Natively redundant

    NameNode tracks locations.

    MapReduce (Processing)

    Splits a task across processors

    near the data & assembles results

    Self-Healing, High Bandwidth

    Clustered storage

    JobTracker manages the TaskTrackers

    Hadoop Core Components

  • www.edureka.in/hadoopSlide 30

    Data Node

    TaskTracker

    Data Node

    TaskTracker

    Data Node

    TaskTracker

    Data Node

    TaskTracker

    MapReduceEngine

    HDFSCluster

    Job Tracker

    Admin Node

    Name node

    Hadoop Core Components (Contd.)

  • www.edureka.in/hadoopSlide 31

    Metadata (Name, replicas,):/home/foo/data, 3,

    Rack 1

    Blocks

    DatanodesBlock ops

    Replication

    Write

    Datanodes

    Metadata ops

    Client

    Read

    NameNode

    Rack 2Client

    HDFS Architecture

  • www.edureka.in/hadoopSlide 32

    NameNode:

    master of the system

    maintains and manages the blocks which are present on the

    DataNodes

    DataNodes:

    slaves which are deployed on each machine and provide the

    actual storage

    responsible for serving read and write requests for the clients

    Main Components Of HDFS

  • www.edureka.in/hadoopSlide 33

    Meta-data in Memory The entire metadata is in main memory No demand paging of FS meta-data

    Types of Metadata List of files List of Blocks for each file List of DataNode for each block File attributes, e.g. access time, replication factor

    A Transaction Log Records file creations, file deletions. etc

    Name Node(Stores metadata only)

    METADATA:/user/doug/hinfo -> 1 3 5/user/doug/pdetail -> 4 2

    Name Node:Keeps track of overall file directory structure and the placement of Data Block

    NameNode Metadata

  • www.edureka.in/hadoopSlide 34

    Secondary NameNode:

    Not a hot standby for the NameNode

    Connects to NameNode every hour*

    Housekeeping, backup of NemeNode metadata

    Saved metadata can build a failed NameNode

    You give me metadata every hour, I will make

    it secure

    Single Point Failure

    Secondary NameNode

    NameNode

    metadata

    metadata

    Secondary Name Node

  • www.edureka.in/hadoopSlide 35

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    NameNode?

    a) is the Single Point of Failure in a cluster

    b) runs on Enterprise-class hardware

    c) stores meta-data

    d) All of the above

    Annies Question

  • www.edureka.in/hadoopSlide 36

    Annies Answer

    All of the above. NameNode Stores meta-data and runs on reliable

    high quality hardware because its a Single Point of failure in a

    hadoop Cluster.

  • www.edureka.in/hadoopSlide 37

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Annies Question

    When the NameNode fails, Secondary NameNode takes over

    instantly and prevents Cluster Failure:

    a) TRUE

    b) FALSE

  • www.edureka.in/hadoopSlide 38

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Annies Answer

    False. Secondary NameNode is used for creating NameNode

    Checkpoints. NameNode can be manually recovered using edits

    and FSImage stored in Secondary NameNode.

  • www.edureka.in/hadoopSlide 39

    DFS

    Job Tracker

    1. Copy Input Files

    User

    2. Submit Job

    3. Get Input Files Info

    6. Submit Job

    4. Create Splits

    5. Upload Job Information

    Input Files

    Client

    Job.xml.Job.jar.

    JobTracker

  • www.edureka.in/hadoopSlide 40

    DFS

    Client Job.xml.Job.jar.

    6. Submit Job

    8. Read Job Files

    7. Initialize JobJob Queue

    As many maps as splitsInput Spilts

    Maps Reduces

    9. Create maps and reduces

    Job Tracker

    JobTracker (Contd.)

  • www.edureka.in/hadoopSlide 41

    Job Tracker

    Job QueueH3

    H4

    H5

    H1

    Task Tracker -H2

    Task Tracker -H4

    10. Heartbeat

    12. Assign Tasks

    10. Heartbeat

    10. Heartbeat

    10. Heartbeat

    11. Picks Tasks(Data Local if possible)

    Task Tracker -H3

    Task Tracker -H1

    JobTracker (Contd.)

  • www.edureka.in/hadoopSlide 42

    Hadoop framework picks which of the following daemon

    for scheduling a task ?

    a) namenode

    b) datanode

    c) task tracker

    d) job tracker

    Annies Question

  • www.edureka.in/hadoopSlide 43

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Annies Answer

    JobTracker takes care of all the job scheduling and

    assign tasks to TaskTrackers.

  • www.edureka.in/hadoopSlide 44

    NameNode

    NameNode

    DataNode

    DataNode

    DataNode

    DataNode

    2. Create

    7. Complete

    5. ack Packet4. Write Packet

    Pipeline of Data nodes

    DataNode

    DataNode

    Distributed File System

    HDFSClient

    1. Create

    3. Write

    4

    5

    4

    5

    Anatomy of A File Write

    6. Close

  • www.edureka.in/hadoopSlide 45

    NameNode

    NameNode

    DataNode

    DataNode

    DataNode

    DataNode

    2. Get Block locations

    4. Read

    DataNode

    DataNode

    Distributed File System

    HDFSClient

    1. Open

    3. Read

    5. Read

    Anatomy of A File Read

    FS DataInput Stream

    6. Close

    Client JVM

    Client Node

  • www.edureka.in/hadoopSlide 46

    Replication and Rack Awareness

  • www.edureka.in/hadoopSlide 47

    In HDFS, blocks of a file are written in parallel, however

    the replication of the blocks are done sequentially:

    a) TRUE

    b) FALSE

    Annies Question

  • www.edureka.in/hadoopSlide 48

    Annies Answer

    True. A files is divided into Blocks, these blocks are

    written in parallel but the block replication happen in

    sequence.

  • www.edureka.in/hadoopSlide 49

    Annies Question

    A file of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file:

    a) can read up to block that's successfully written.b) can read up to last bit successfully written.c) Will throw an throw an exception.d) Cannot see that file until its finished copying.

  • www.edureka.in/hadoopSlide 50

    Annies Answer

    Client can read up to the successfully written data block,

    Answer is (a)

  • www.edureka.in/hadoopSlide 51

    Apache Hadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

    Apache Hadoop HDFS Architecturehttp://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

    Further Reading

  • www.edureka.in/hadoopSlide 52

    Setup the Hadoop development environment using the documents present in the LMS.

    Hadoop Installation Setup Cloudera CDH3 Demo VM

    Execute Linux Basic Commands

    Execute HDFS Hands On commands

    Attempt the Module-1 Assignments present in the LMS.

    Module-2 Pre-work

  • www.edureka.in/hadoopSlide 53

    Whats within the LMS

    This section will give you an insight of Big Data andHadoop course

    This section will give some prerequisites on Java to understand the course better

    Old Batch recordings Handy for you This section will give

    an insight of Hadoop Installation

  • www.edureka.in/hadoopSlide 54

    Whats within the LMS

    Click here to expand and view all the elements of this Module

  • www.edureka.in/hadoopSlide 55

    Whats within the LMS

    Recording of the Class

    Presentation

    Hands-on Guides

  • www.edureka.in/hadoopSlide 56

    Whats within the LMS

    Quiz

    Assignment

  • Thank You See You in Class Next Week