hive

Data Warehousing on Hadoop

HIVE

Hadoop is great for large-data processing!But writing Java programs for

everything is verbose and slowAnalysts don’t want to (or can’t) write

JavaSolution: develop higher-level data

processing languagesHive: HQL is like SQL Pig: Pig Latin is a bit like Perl

Need for High-Level Languages

Problem: Data, data and more data200GB per day in March 2008 2+TB(compressed) raw data per day today

The Hadoop ExperimentMuch superior to availability and scalability of

commercial DBsEfficiency not that great and required more

hardwarePartial Availability/resilience/scale more

important than ACIDProblem: Programmability and Metadata

Map-reduce hard to program (users know sql/bash/python)

Need to publish data in well known schemas

Why Hive??

HIVE: Components

Shell: allows interactive queriesDriver: session handles, fetch, executeCompiler: parse, plan, optimizeExecution engine: DAG of stages (MR,

HDFS, metadata)Metastore: schema, location in HDFS,

SerDe

HIVE: Components

TablesTyped columns (int, float, string, boolean)Also, list: map (for JSON-like data)

PartitionsFor example, range-partition tables by dateCommand : PARTITIONED BY

BucketsHash partitions within ranges (useful for

sampling, join optimization)Command : CLUSTERED BY

Data Model

Database: namespace containing a set of tables

Holds table definitions (column types, physical layout)

Holds partitioning informationCan be stored in Derby, MySQL,

and many other relational databases

Metastore

Warehouse directory in HDFSE.g., /user/hive/warehouse

Tables stored in subdirectories of warehousePartitions form subdirectories of tables

Actual data stored in flat filesControl char-delimited text, or SequenceFiles

With custom SerDe, can use arbitrary format

Physical Layout

HDFS

Hive CLI

DDLQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t. W

eb

UI

HIVE: Components

CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s';DESCRIBE sample;ALTER TABLE sample ADD COLUMNS (new_col INT);DROP TABLE sample;

Examples – DDL Operations

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

Examples – DML Operations

SELECT * FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid,

pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map

INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING

'reduce_script' AS (date, count);

Running Custom Map/Reduce Scripts

Machine 2

Machine 1

<k1, v1><k2, v2><k3, v3>

<k4, v4><k5, v5><k6, v6>

(Simplified) Map Reduce Review

<nk1, nv1><nk2, nv2><nk3, nv3>


LocalMap



GlobalShuffle



LocalSort

<nk2, 3>

<nk1, 2><nk3, 1>

LocalReduce

• SQL:INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid =

u.userid);

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid

age gender

111 25 female

222 32 male

pageid

age

1 25

2 25

1 32

X =

page_viewuser

pv_users

Hive QL – Join

key value

111 <1,1>

111 <1,2>

222 <1,1>

key value

111 <2,25>

222 <2,32>

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14useri

dage gende

r

111 25 female

222 32 male

page_view

user Map

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

ShuffleSort

pageid

age

1 25

2 25

pageid

age

1 32

Reduce

Hive QL – Join in Map Reduce

Outer JoinsINSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)

WHERE pv.date = 2008-03-03;

Joins

Only Equality Joins with conjunctions supported

Future Pruning of values send from map to

reduce on the basis of projections Make Cartesian product more memory

efficient Map side joins Hash Joins if one of the tables is very smallExploit pre-sorted data by doing map-side

merge join

Join To Map Reduce

SQL:FROM (a join b on a.key = b.key) join c on

a.key = c.key SELECT …

key

av bv

1 111

222

key av

1 111

A

Map Reducekey bv

1 222

B

key cv

1 333

C

AB

Map Reducekey

av bv cv

1 111

222 333

ABC

Hive Optimizations – Merge Sequential Map Reduce Jobs

SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;

pageid

age

1 25

2 25

1 32

2 25

pv_users

pageid

age count

1 25 1

2 25 2

1 32 1

Hive QL – Group By

pageid

age count

2 25 2

pageid

age

1 25

2 25

pv_users

pageid

age count

1 25 1

1 32 1

pageid

age

1 32

2 25

Map

key value

<1,25>

1

<2,25>

1

key value

<1,32>

1

<2,25>

1

key value

<1,25>

1

<1,32>

1

key value

<2,25>

1

<2,25>

1

ShuffleSort

Reduce

Hive QL – Group By in Map Reduce

SELECT pageid, COUNT(DISTINCT userid)FROM page_view GROUP BY pageid

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

2 111 9:08:20

page_view

pageid count_distinct_userid

1 2

2 1

Hive QL – Group By with Distinct

pageid

count

1 1

page_view

pageid

count

1 1

2 1ShuffleandSort

Reduce

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

pageid

userid

time

1 222 9:08:14

2 111 9:08:20

key v

<1,111>

<2,111>

<2,111>key v

<1,222>

MapReduce

Hive QL – Group By with Distinct in Map Reduce

FROM pv_users INSERT INTO TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender)

INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)

INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Inserts into Files, Tables and Local Files

Thank You

hive

Education

pageid pageid userid

userid pageid userid

users pageid age count

users group

age pageid age

users pa pageid age

view pageid count

metadata map