what is hadoop - cimec-120208170829-phpapp01

Upload: romanzotti

Post on 02-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    1/28

    HadoopA Hands-on Introduction

    Claudio MartellaElia Bruni

    9 November 2011

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    2/28

    Outline

    What is Hadoop

    Why is Hadoop

    How is Hadoop

    Hadoop & Python

    Some NLP code

    A more complicated problem: Eva

    2

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    3/28

    A bit of Context

    2003: first MapReduce library @ Google

    2003: GFS paper

    2004: MapReduce paper

    2005: Apache Nutch uses MapReduce

    2006: Hadoop was born

    2007: first 1000 nodes cluster at Y!

    3

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    4/28

    An Ecosystem

    HDFS & MapReduce

    Zookeeper

    HBase

    Pig & Hive

    Mahout

    Giraph

    Nutch 4

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    5/28

    Traditional way

    Design a high-level Schema

    You store data in a RDBMS

    Which has very poor write throughput

    And doesnt scale very much

    When you talk about Terabyte of data

    Expensive Data Warehouse

    5

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    6/28

    BigData & NoSQL

    Store first, think later

    Schema-less storage

    Analytics

    Petabyte scale

    Offline processing

    6

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    7/28

    Vertical Scalability

    Extremely expensive

    Requires expertise in distributed systemsand concurrent programming

    Lacks of real fault-tolerance

    7

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    8/28

    Horizontal Scalability

    Built on top of commodity hardware

    Easy to use programming paradigms

    Fault-tolerance through replication

    8

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    9/28

    1st Assumptions

    Data to process does not fit on one node.

    Each node is commodity hardware.

    Failure happens.

    Spread your data among your nodes

    and replicate it.

    9

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    10/28

    2nd Assumptions

    Moving computation is cheap.

    Moving data is expensive.

    Distributed computing is hard.

    Move computation to data,

    with simple paradigm.

    10

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    11/28

    3rd Assumptions

    Systems run on spinning hard disks.

    Disk seek >> disk scan.

    Many small files are expensive.

    Base the paradigm on scanning large files.

    11

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    12/28

    Typical Problem

    Collect and iterate over many records

    Filter and extract something from each

    Shuffle & sort these intermediate results

    Group-by and aggregate them

    Produce final output set

    12

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    13/28

    Typical Problem

    Collect and iterate over many records

    Filter and extract something from each

    Shuffle & sort these intermediate results

    Group-by and aggregate them

    Produce final output set

    MA

    P

    R

    EDUCE

    13

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    14/28

    Quick example

    127.0.0.1 - frank[10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en](Win98; I ;Nav)"

    (frank, index.html)

    (index.html, 10/Oct/2000)

    (index.html, http://www.example.com/start.html)

    14

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    15/28

    MapReduce

    Programmers define two functions:

    map (key, value) (key, value)* reduce (key, [value+]) (key, value)*

    Can also define:

    combine (key, value) (key, value)*

    partitioner: k partition

    15

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    16/28

    k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

    mapmap map map

    Shuffle and Sort: aggregate values by keys

    ba 1 2 c c3 6 a c5 2 b c7 9

    a 1 5 b 2 7 c 2 3 6 9

    reduce reduce reduce

    r1 s1 r2 s2 r3 s3

    16

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    17/28

    MapReduce daemons

    JobTracker: its the Master, it runs theschedule of the jobs, assigns tasks tonodes, collects hearth-beats from workers,reschedules for fault-tolerance.

    TaskTracker: its the Worker, it runs on

    each slave, runs (multiple) Mappers andReducers each in their JVM.

    17

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    18/28

    User

    Program

    (1) fork (1) fork (1) fork

    split 0

    split 1

    split 2

    split 3

    split 4

    worker

    worker

    worker

    worker

    Master

    output

    file 0

    output

    file 1

    (2) assign map(2) assign reduce

    (3) read(4) local write

    (5) remote read

    (6) write

    worker

    Input

    files

    Map

    phase

    Intermediate files

    (on local disk)

    Reduce

    phase

    Output

    files

    18

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    19/28

    HDFS daemons

    NameNode: its the Master, it keeps thefilesystem metadata (in-memory), the file-

    block-node mapping, decides replicationand block placement, collects heart-beatsfrom nodes.

    DataNode: its the Slave, it stores theblocks (64MB) of the files and servesdirectly reads and writes.

    19

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    20/28

    GSF Client

    File namespace

    /foo/bar

    chunk 2ef0

    GFS chunkserver GFS chunkserver

    (file name, chunk index)

    (chunk handle, chunk location)

    Instructions to chunkserver

    Chunkserver state(chunk handle, byte range)

    Linux file system

    Linux file system

    chunk data

    20

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    21/28

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    22/28

    Take home recipe

    Scan-based computation (no random I/O)

    Big datasets

    Divide-and-conquer class algorithms

    No communication between tasks

    22

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    23/28

    Not good for

    Real-time / Stream processing

    Graph processing

    Computation without locality

    Small datasets

    23

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    24/28

    Questions?

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    25/28

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    26/28

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    27/28

    Our solution

    line format:[]*

    0 1.3 0 0 7.1 1.1

    1.2 0 0 0 0 3.4

    0 5.7 0 0 1.1 2

    5.1 0 0 4.6 0 10

    0 0 0 1.6 0 0

    1.3 7.1

    1.2 3.4

    5.7 1.1

    5.1 4.6

    1.6

    2

    1.1

    for example: cat12.131305.134.6510

    10

    27

    Tuesday, November 8, 11

  • 7/27/2019 What is Hadoop - Cimec-120208170829-Phpapp01

    28/28

    Benchmarking

    serial python (single-core): 7 minutes

    java+hadoop (single-core): 2 minutes

    serial python (big file): 18 days

    java+hadoop (parallel, big file): 8 hours

    it makes sense: 18d / 3.5 = 5.14d / 14 = 8h

    28