3.2-hadoopdb

Upload: surredd

Post on 07-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 3.2-HadoopDB

    1/44

    HadoopDB: An ArchitecturalHybrid of MapReduce and DBMS

     Technologies for AnalyticalWorkloads

    Azza Abouzeid1 !a"il Ba#da$a%liko%ski1

    Daniel Abadi1 A&i Silberschatz1 Ale'ander Rasin(

    1)ale *ni&ersity (Bro%n *ni&ersity

    +azzakba#dadnaa&i,-cs.yale.edu/ale'r-cs.bro%n.edu

    Presented by Ying Yang

    10/01/2012

  • 8/18/2019 3.2-HadoopDB

    2/44

    Outline

    • Introduction, desired propertiesand background

    • HadoopDB rc!itecture

    • "esults

    • #onclusions

  • 8/18/2019 3.2-HadoopDB

    3/44

    Introduction

    • Analytics are i"portant today

    • Data a"ount is e'ploding

    • $re&ious proble" 0 DBMS on Shared-

    nothing architectures

    a collection of independent possibly &irtual"achines each %ith local disk and local 

    "ain "e"ory connected together on ahigh2speed net%ork.3

  • 8/18/2019 3.2-HadoopDB

    4/44

    Introduction

  • 8/18/2019 3.2-HadoopDB

    5/44

    Introduction

  • 8/18/2019 3.2-HadoopDB

    6/44

    Introduction

    • Approachs4

    5$arallel databasesanalytical DBMSsyste"s that deploy on a shared2nothing

    architecture35Map6Reduce syste"s

  • 8/18/2019 3.2-HadoopDB

    7/44

    Introduction

    $ales "ecord %&a'ple

    • 7onsider a large data set of sales log records eachconsisting of sales infor"ation including4

    1 a date of sale 2 a price

    • We %ould like to take the log records and generate a reportsho%ing the total sales for each year.

    • $%(%#) Y%"*date+ $ year, $-*price+

    • ."O- sales "OP BY year

    uestion:

    • Ho% do %e generate this report e8ciently and cheaply o&er"assi&e data contained in a shared2nothing cluster of 1999sof "achines:

  • 8/18/2019 3.2-HadoopDB

    8/44

    Introduction

    $ales "ecord %&a'ple using Hadoop:

    ;uery4 7alculate total sales for each year.

    We %rite a MapReduce progra"4

    Map4 Takes log records and e'tracts a key2&alue pair ofyear and sale price in dollars.

  • 8/18/2019 3.2-HadoopDB

    9/44

    Introduction

  • 8/18/2019 3.2-HadoopDB

    10/44

    Introduction

    Parallel Databases:

    $arallel Databases are like single2node databases e'cept4

    Data is partitioned across nodes

    >ndi&idual relational operations can be e'ecuted in parallel

    S?@?7T )?ARdate3 AS year S*Mprice3R

  • 8/18/2019 3.2-HadoopDB

    11/44

    Introduction

    • Parallel Databases:

    ault tolerance – Eot score so %ell

     – Assu"ption4 dozens of nodes in clustersfailures arerare

    • -ap"educe

    Satises fault tolerance

    Works on heterogeneus en&iron"entDra%back4 perfor"ance – Eot pre&ious "odeling

     – Eo enhacing perfor"ance techniFues

  • 8/18/2019 3.2-HadoopDB

    12/44

    Introduction

    • >n su""ary

  • 8/18/2019 3.2-HadoopDB

    13/44

    Desired properties

    • $erfor"ance$arallel DBMS3

    • ault toleranceMapReduce3

    • Heterogeneus en&iron"entsMapReduce3

    • le'ible Fuery interfaceboth3

  • 8/18/2019 3.2-HadoopDB

    14/44

    Outline

    • Introduction, desired propertiesand background

    • HadoopDB rc!itecture

    • "esults

    • #onclusions

  • 8/18/2019 3.2-HadoopDB

    15/44

    • Main goal4 achie&e the properties describedbefore

    • 7onnect "ultiple single2datanode syste"s

     – Hadoop reponsible for task coordination andnet%ork layer

     – ;ueries parallelized along the nodes

    • ault tolerant and %ork in heterogeneus

    nodes• $arallel databases perfor"ance

     – ;uery processing in database engine

    HadoopDB rc!itecture

  • 8/18/2019 3.2-HadoopDB

    16/44

    HadoopDB rc!itecture

  • 8/18/2019 3.2-HadoopDB

    17/44

    HadoopDB rc!itecture

  • 8/18/2019 3.2-HadoopDB

    18/44

    HadoopDB3s #o'ponents

    Database connector

    • replace the functionality of HDS %ith the

    database connector. i&e Mappers the ability

    to read the results of a database Fuery• Responsabilities

     – 7onnect to the database

     – ?'ecute the S;@ Fuery

     – Return the results as key2&alue pairs• Achie&ed goal

     – Datasources are si"ilar to datablocks in HDS

  • 8/18/2019 3.2-HadoopDB

    19/44

    HadoopDB3s #o'ponents

    7atalog

    • Metadata about databases

    • Database location dri&er class

    credentials

    • Datasets in cluster replica or partitioning

    • 7atalog stored as '"l le in HDS• $lan to deploy as separated

    ser&icesi"ilar to Ea"eEode3

  • 8/18/2019 3.2-HadoopDB

    20/44

    HadoopDB3s #o'ponents

    Data loader

    • Responsibilities4 – lobally repartitioning data

     – Breaking single data node in chunks

     – Bulk2load data in single data node chunks

    •  T%o "ain co"ponents4 – lobal hasher

    • Map6Reduce #ob read fro" HDS and repartitionsplits data

    across nodes i.e. relational database instances33

     – @ocal Hasher• 7opies fro" HDS to local le syste"subdi&ides data

    %ithin each node3

  • 8/18/2019 3.2-HadoopDB

    21/44

    HadoopDB3s #o'ponents

    Data loader

    lobal Hasher local Hasher

     

    4

    "a5data

    6les

    BlockA

    AaA

    A

    B

    B

    B

    Ab

    Az

  • 8/18/2019 3.2-HadoopDB

    22/44

    HadoopDB3s #o'ponents

    SMS S;@ to MapReduce to S;@3$lanner

    • SMS planner e'tends Hi&e.

    • Hi&e processing Steps4

     – Abstract Synta' Tree building – Se"antic analyzer connects to catalog

     – DA of relational operatorsDirected acyclicgraph3

     –

  • 8/18/2019 3.2-HadoopDB

    23/44

    HadoopDB3s #o'ponents

    S?@?7T

     )?ARdate3 ASyearS*Mprice3

    R

  • 8/18/2019 3.2-HadoopDB

    24/44

    HadoopDB3s #o'ponents

    SMS $lanner e'tensions

     T%o phases before e'ecution – Retrie&e data elds to deter"ine

    partitioning keys

     – Tra&erse DA botto" up3. Rule basedS;@ generator

  • 8/18/2019 3.2-HadoopDB

    25/44

    HadoopDB3s #o'ponents

    MapReduce #ob generated by SMS assu"ing sales ispartitioned by )?ARsaleDate3. This feature is stillunsupported

  • 8/18/2019 3.2-HadoopDB

    26/44

    HadoopDB3s #o'ponents

    MapReduce #ob generated by SMS assu"ing no partitioning of sales

  • 8/18/2019 3.2-HadoopDB

    27/44

    Outline

    • Introduction, desired propertiesand background

    • HadoopDB rc!itecture

    • "esults

    • #onclusions

  • 8/18/2019 3.2-HadoopDB

    28/44

    ?n&iron"ent4

    • A"azon ?7( Glarge instances

    •?ach instance – IJ B "e"ory – ( &irtual cores

     – KJ9 B storage

     – L bits @inu' edora K

    Benck'arks

  • 8/18/2019 3.2-HadoopDB

    29/44

    Bench"arked syste"s4

    • Hadoop – (JLMB data blocks

     – 19( MB heap size

     –  (99Mb sort buNer

    • HadoopDB

     – Si"ilar to Hadoop conf – $ostgreS;@ K.(.J

     – Eo co'press data

    Benck'arks

  • 8/18/2019 3.2-HadoopDB

    30/44

    Bench"arked syste"s4

    • Oertica – Ee% parallel database colu'n store3

     – *sed a cloud edition

     – All data is co'pressed

    • DBMS2P – 7o"ercial parallel ro%

     – Run on ?7( not cloud edition a&ailable3

    Benck'arks

  • 8/18/2019 3.2-HadoopDB

    31/44

    *sed data4

    • Http log les ht"l pages ranking

    •Sizes per node34 – 1JJ "illions user &isits Q (9igabytes3 – 1K "illions ranking Q1igabyte3

     – Stored as plain te't in HDS

    Benck'arks

  • 8/18/2019 3.2-HadoopDB

    32/44

    (oading data

  • 8/18/2019 3.2-HadoopDB

    33/44

    rep )ask 

    13ull table scan highly

    selecti&e lter

     (3Rando" data no

    roo" for inde'ing

     3Hadoop o&erhead

    out%eighs Fuery

    processing ti"e in

    single2node databases

    S?@?7T R!? U'yzUV/

  • 8/18/2019 3.2-HadoopDB

    34/44

    Query:

    Select pageUrl,

    pageRank

    from Rankings

    where pageRank > 10

     All except Hadoop used

    clustered indices on the

     pageRank column.

    HadoopDB each chunk is

    50MB,oerhead scheduling

    o! small data leads to

     "ad per!ormance#

    $election )ask 

  • 8/18/2019 3.2-HadoopDB

    35/44

    Smaller query

    SELECT SUBSTRso!rce"#,1,

    $%,SU&a'Re(en!e%

    )R*& User+isits

    R*U# B- SUBSTRso!rce"#,

    1,$%

    Larger query

    SELECT so!rce"#,

    SU&a'Re(en!e%

    )R*& User+isits

    R*U# B- so!rce"#

    ggregation )ask 

  • 8/18/2019 3.2-HadoopDB

    36/44

    Query

    SELECT so!rce"#, C*U/TpageRank%,

    SU&pageRank%,SU&a'Re(en!e%

    )R*& Rankings S R, User+isits S

    U+

    2ERE R.pageURL 3 U+.'estURL /4  U+.(isit4ate BETEE/ 560007017189

    /4 560007017669

    R*U# B- U+.so!rce"#

    No full table scan due to

    clustered indexing Hash partitioning and efficient

     join algorithm

     7oin )ask 

  • 8/18/2019 3.2-HadoopDB

    37/44

    D. ggregation )ask 

  • 8/18/2019 3.2-HadoopDB

    38/44

    ● HaddopDB approach parallel databases in

    absence of failures

     –

    PostgreSQL not column store – DBMS-X !" o#erl$ optimistic

     – %o compression in PosgreSQL data

     – erhead in the interaction bet'een Hadoop and

    PostgreSQL(

    ● &utperforms Hadoop

    $u''ary

  • 8/18/2019 3.2-HadoopDB

    39/44

    ● )ault tolerance and

    heterogeneus en#ironments

    Benck'arks

  • 8/18/2019 3.2-HadoopDB

    40/44

    ● )ault tolerance is important

    .ault tolerance and!eterogeneus en8iron'ents

  • 8/18/2019 3.2-HadoopDB

    41/44

    ● *ertica is faster 

    ● *ertica reduces the number of

    nodes to achie#e the same order

    of magnitude

    ● )ault tolerance is important

    Discussion

  • 8/18/2019 3.2-HadoopDB

    42/44

    ● +pproach parallel databases and

    fault tolerance

    ● PostgreSQL is not a column store

    ● Hadoop and hi#e relati#el$ ne'

    open source pro,ects● HadoopDB is fleible and etensible

    #onclusion

  • 8/18/2019 3.2-HadoopDB

    43/44

    ● Hadoop 'eb page

    ● HadoopDB article

    ● HadoopDB pro,ect● *ertica

    ●  +pache Hi#e

    "eerences

  • 8/18/2019 3.2-HadoopDB

    44/44

    ● .han/ $ou0