hive

Data Warehousing on Hadoop

Hadoop is great for large-data processing!But writing Java programs for

everything is verbose and slowAnalysts don’t want to (or can’t) write

JavaSolution: develop higher-level data

processing languagesHive: HQL is like SQL Pig: Pig Latin is a bit like Perl

Need for High-Level Languages

Problem: Data, data and more data200GB per day in March 2008 2+TB(compressed) raw data per day today

The Hadoop ExperimentMuch superior to availability and scalability of

commercial DBsEfficiency not that great and required more

hardwarePartial Availability/resilience/scale more

important than ACIDProblem: Programmability and Metadata

Map-reduce hard to program (users know sql/bash/python)

Need to publish data in well known schemas

Why Hive??

HIVE: Components

Shell: allows interactive queriesDriver: session handles, fetch, executeCompiler: parse, plan, optimizeExecution engine: DAG of stages (MR,

HDFS, metadata)Metastore: schema, location in HDFS,

HIVE: Components

TablesTyped columns (int, float, string, boolean)Also, list: map (for JSON-like data)

PartitionsFor example, range-partition tables by dateCommand : PARTITIONED BY

BucketsHash partitions within ranges (useful for

sampling, join optimization)Command : CLUSTERED BY

Data Model

Database: namespace containing a set of tables

Holds table definitions (column types, physical layout)

Holds partitioning informationCan be stored in Derby, MySQL,

and many other relational databases

Metastore

Warehouse directory in HDFSE.g., /user/hive/warehouse

Tables stored in subdirectories of warehousePartitions form subdirectories of tables

Actual data stored in flat filesControl char-delimited text, or SequenceFiles

With custom SerDe, can use arbitrary format

Physical Layout

Hive CLI

DDLQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

HIVE: Components

CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s';DESCRIBE sample;ALTER TABLE sample ADD COLUMNS (new_col INT);DROP TABLE sample;

Examples – DDL Operations

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

Examples – DML Operations

SELECT * FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid,

pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map

INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING

'reduce_script' AS (date, count);

Running Custom Map/Reduce Scripts

Machine 2

Machine 1

<k1, v1><k2, v2><k3, v3>

<k4, v4><k5, v5><k6, v6>

(Simplified) Map Reduce Review

<nk1, nv1><nk2, nv2><nk3, nv3>

LocalMap

GlobalShuffle

LocalSort

<nk2, 3>

<nk1, 2><nk3, 1>

LocalReduce

• SQL:INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid =

u.userid);

pageid

userid

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid

age gender

111 25 female

222 32 male

pageid

page_viewuser

pv_users

Hive QL – Join

key value

111 <1,1>

111 <1,2>

222 <1,1>

key value

111 <2,25>

222 <2,32>

pageid

userid

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14useri

dage gende

111 25 female

222 32 male

page_view

user Map

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

ShuffleSort

pageid

Reduce

Hive QL – Join in Map Reduce

Outer JoinsINSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)

WHERE pv.date = 2008-03-03;

Only Equality Joins with conjunctions supported

Future Pruning of values send from map to

reduce on the basis of projections Make Cartesian product more memory

efficient Map side joins Hash Joins if one of the tables is very smallExploit pre-sorted data by doing map-side

merge join

Join To Map Reduce

SQL:FROM (a join b on a.key = b.key) join c on

a.key = c.key SELECT …

key av

Map Reducekey bv

key cv

Map Reducekey

av bv cv

222 333

Hive Optimizations – Merge Sequential Map Reduce Jobs

SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;

pageid

pv_users

pageid

age count

1 25 1

2 25 2

1 32 1

Hive QL – Group By

pageid

age count

2 25 2

pageid

pv_users

pageid

age count

1 25 1

1 32 1

pageid

key value

<1,25>

<2,25>

key value

<1,32>

<2,25>

key value

<1,25>

<1,32>

key value

<2,25>

ShuffleSort

Reduce

Hive QL – Group By in Map Reduce

SELECT pageid, COUNT(DISTINCT userid)FROM page_view GROUP BY pageid

pageid

userid

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

2 111 9:08:20

page_view

pageid count_distinct_userid

Hive QL – Group By with Distinct

pageid

page_view

pageid

2 1ShuffleandSort

Reduce

pageid

userid

1 111 9:08:01

2 111 9:08:13

pageid

userid

1 222 9:08:14

2 111 9:08:20

<1,111>

<2,111>

<2,111>key v

<1,222>

MapReduce

Hive QL – Group By with Distinct in Map Reduce

FROM pv_users INSERT INTO TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender)

INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)

INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Inserts into Files, Tables and Local Files

Thank You

hive

pageid pageid userid

userid pageid userid

users pageid age count

users group

age pageid age

users pa pageid age

view pageid count

metadata map

Education

building a bee hive: the hive stand - michigan bees€¦ ·...

hive research lab interim brief › 2014 › 04 ›...

hive inspection sheet -...

hive Муравейник

hive anatomy

hive inspections

the hive@mansfield - stopford associates · the...

internal hive

aloha hive buzz · hive, oh hive never so alive oh how i...

-hive- hive insulation valuation experiment

hive rules

offentliginformasjon hive

hive honeyscribe hive - princesshay

dlm installation and upgrade - cloudera...hive for...

hive sqlforhadoop

hive tuning

hive 101: hive query language

hive global

integrating apache hive with kafka, spark, and...

hive products