hive* – a petabyte scale data warehouse using hadoopkmsalem/courses/cs743/f14/slides/naoua… ·...

13
1 Hive –  A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH, Nets&Dist Sys Authors Facebook Data Infrastructure Team Conference Data Engineering (ICDE), 2010 IEEE CS 743, Fall 2014 UNIVERSITY OF WATERLOO November 13 th  , 2014

Upload: others

Post on 25-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

1

Hive*  –  A Petabyte Scale Data Warehouse Using Hadoop

PresenterMalek NAOUACH, Nets&Dist Sys

AuthorsFacebook Data Infrastructure Team

ConferenceData Engineering (ICDE), 2010 IEEE

CS 743, Fall 2014

UNIVERSITY OF

WATERLOO

November 13th , 2014

Page 2: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

2

Overview*

MassivelyParallel

FaultTolerant

LinearlyScalable

Big DataProcessing

MapReduce

Familiarity

DecisionsMaking Hadoop

Hive

Page 3: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

3

Complex Data Types

Associative arrays | Lists | Structs

Hive Data Structure*Primitive Data Types

INT | TINYINT | SMALLINT | BIGINT | BOOLEAN | FLOAT

Complex Datatypes Compositionlist<map<string, struct<p1:int, p2:int>>>

Complex Schema Creation CREATE TABLE t1(st string, fl float, li list<map<string, struct<p1:int, p2:int>>>)

Hive Data Incorporation

+ SerDe Interface+ ObjectInspector Interface+ getObjectInspector method 

**Serialization

Process of translating data structures or object  state  into  a  format  that  can  be stored and reconstructed later.

Page 4: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

4

NOT HiveQL Semantics

INSERT | UPDATE | DELETE

Hive Query Language*HiveQL Semantics (SQL)

SUBQUERIES | INNER, LEFT & RIGHT OUTER JOINS | CARTESIAN PROD | GROUP By | AGGREGATION | UNION | CREATE TABLE HiveQL Supports Map­Red Programs

FROM (   MAP stocks USING 'python ce_mapper.py'  AS (company,value)  FROM stocksStat  CLUSTER BY value) aReduce company,value USING 'python ce_reduce.py'

HiveQL Data Insertion

INSERT OVERWRITE 

**HQL

Hibernate Query Language

Page 5: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

5

Data Storage*

Logical Partitioning

Buckets

HDFS

…...

MetaData

Stocks

Hive MetaStore

/hive/stocks//hive/stocks/2014­11­13//hive/stocks/2014­11­13/10/hive/stocks/2014­11­13/11/hive/stocks/2014­11­13/12

Prune/BucketData

Library

Schema

CREATE TABLE Stocks (Company STRING, val DOUBLE)PARTITIONED BY (day STRING, hr INT);

Page 6: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

6

System Architecture(1/3)*

HADOOP(MAP­REDUCE + HDFS)

Job Tracker Name Node

Data Node+

Task Tracker

Hive

MetaStoreDriver

(Compiler, Optimizer, Executor)

Thrift Server

ODBCJDBC

WebInterface

CLI

Page 7: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

7

System Architecture (2/3)*

**Interoperability

is  the  ability  of  a  system  to  work  with other  systems  without  special  effort  on the customer side.

**Logical/Physical Plan

Abstract  Syntax  Tree  (AST)  for  the query,  Query  Block  Tree,  Involved Interfaces, Directed Acyclic Graph

E.Client

Web UI

CLI

Thrift Interf.

ODBC Interf.

JDBC Interf.

Driver1. exeHiveQuery

7. fetchResults

QueryCompiler

2. getExePhysPlan

MetaStore3. getMetaData

4. sendMetaData

5. sendExePhysPlan

ExecutionEngine

6.1. metaDataOpsforDDLs

5. exePhysPlan

8. sendResults

6.1. exeJob

6.2. jobDone

Hive HADOOP

Page 8: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

8

System Architecture (3/3)*

JobTracker6.1. exeJob

6.2. jobDone

Map Op.Tree

SerDe

Map Op.Tree

SerDe

Task Trackers(MAP)

Task Trackers(Reduce)

MapReduceTasks

DataNodes

HADOOP

MapReduce

HDFS

Page 9: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

9

HiveQL to Phys. Plan Exp. (1/3)*

FROM(SELECT a.status, b.school, b.genderFROM status_updates a JOIN profiles bON (a.userid = b.userid AND a.ds='2009­03­20')) subq1

INSERT OVERWRITE TABLE gender_summary PARTITION (ds='2009­03­20')

SELECT subq1.gender, COUNT(1)GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary PARTITION (ds='2009­03­20')

SELECT subq1.school, COUNT(1)GROUP BY subq1.school

Page 10: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

10

HiveQL to Phys. Plan Exp. (2/3)*

status_updates(userid, status, ds)

profiles(userid, school, gender)

Page 11: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

11

SELECT subq1.gender, COUNT(1)GROUP BY subq1.gender

SELECT subq1.school, COUNT(1)GROUP BY subq1.school

HiveQL to Phys. Plan Exp. (3/3)*

Page 12: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

12

Brief Recap.*

http://hive.apache.org/

http://hadoop.apache.org/

✔ Hive is created to simplify big data analysis. (1hour for new users to master)

✔ Hive is improving the performance of Hadoop. (+20% efficiency)

✔ Hive enables data processing at a fraction of the cost of more traditional WD.

✔ Hive is  working towards to subsume SQL syntax.

✔ Hive is enhancing the Query Complier and the interoperability. 

Page 13: Hive* – A Petabyte Scale Data Warehouse Using Hadoopkmsalem/courses/cs743/F14/slides/Naoua… · Hive* – A Petabyte Scale Data Warehouse Using Hadoop Presenter Malek NAOUACH,

13

Thanks!*

Questions?