hive

24
Data Warehousing on Hadoop HIVE

Upload: srinath-reddy

Post on 12-May-2015

267 views

Category:

Education


1 download

DESCRIPTION

Data Warehousing on Hadooop

TRANSCRIPT

Page 1: Hive

Data Warehousing on Hadoop

HIVE

Page 2: Hive

Hadoop is great for large-data processing!But writing Java programs for

everything is verbose and slowAnalysts don’t want to (or can’t) write

JavaSolution: develop higher-level data

processing languagesHive: HQL is like SQL Pig: Pig Latin is a bit like Perl

Need for High-Level Languages

Page 3: Hive

Problem: Data, data and more data200GB per day in March 2008 2+TB(compressed) raw data per day today

The Hadoop ExperimentMuch superior to availability and scalability of

commercial DBsEfficiency not that great and required more

hardwarePartial Availability/resilience/scale more

important than ACIDProblem: Programmability and Metadata

Map-reduce hard to program (users know sql/bash/python)

Need to publish data in well known schemas

Why Hive??

Page 4: Hive

HIVE: Components

Page 5: Hive

Shell: allows interactive queriesDriver: session handles, fetch, executeCompiler: parse, plan, optimizeExecution engine: DAG of stages (MR,

HDFS, metadata)Metastore: schema, location in HDFS,

SerDe

HIVE: Components

Page 6: Hive

TablesTyped columns (int, float, string, boolean)Also, list: map (for JSON-like data)

PartitionsFor example, range-partition tables by dateCommand : PARTITIONED BY

BucketsHash partitions within ranges (useful for

sampling, join optimization)Command : CLUSTERED BY

Data Model

Page 7: Hive

Database: namespace containing a set of tables

Holds table definitions (column types, physical layout)

Holds partitioning informationCan be stored in Derby, MySQL,

and many other relational databases

Metastore

Page 8: Hive

Warehouse directory in HDFSE.g., /user/hive/warehouse

Tables stored in subdirectories of warehousePartitions form subdirectories of tables

Actual data stored in flat filesControl char-delimited text, or SequenceFiles

With custom SerDe, can use arbitrary format

Physical Layout

Page 9: Hive

HDFS

Hive CLI

DDLQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t. W

eb

UI

HIVE: Components

Page 10: Hive

CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s';DESCRIBE sample;ALTER TABLE sample ADD COLUMNS (new_col INT);DROP TABLE sample;

Examples – DDL Operations

Page 11: Hive

LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');

Examples – DML Operations

Page 12: Hive

SELECT * FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid,

pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map

INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING

'reduce_script' AS (date, count);

Running Custom Map/Reduce Scripts

Page 13: Hive

Machine 2

Machine 1

<k1, v1><k2, v2><k3, v3>

<k4, v4><k5, v5><k6, v6>

(Simplified) Map Reduce Review

<nk1, nv1><nk2, nv2><nk3, nv3>

<nk2, nv4><nk2, nv5><nk1, nv6>

LocalMap

<nk2, nv4><nk2, nv5><nk2, nv2>

<nk1, nv1><nk3, nv3><nk1, nv6>

GlobalShuffle

<nk1, nv1><nk1, nv6><nk3, nv3>

<nk2, nv4><nk2, nv5><nk2, nv2>

LocalSort

<nk2, 3>

<nk1, 2><nk3, 1>

LocalReduce

Page 14: Hive

• SQL:INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid =

u.userid);

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid

age gender

111 25 female

222 32 male

pageid

age

1 25

2 25

1 32

X =

page_viewuser

pv_users

Hive QL – Join

Page 15: Hive

key value

111 <1,1>

111 <1,2>

222 <1,1>

key value

111 <2,25>

222 <2,32>

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14useri

dage gende

r

111 25 female

222 32 male

page_view

user Map

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

ShuffleSort

pageid

age

1 25

2 25

pageid

age

1 32

Reduce

Hive QL – Join in Map Reduce

Page 16: Hive

Outer JoinsINSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)

WHERE pv.date = 2008-03-03;

Joins

Page 17: Hive

Only Equality Joins with conjunctions supported

Future Pruning of values send from map to

reduce on the basis of projections Make Cartesian product more memory

efficient Map side joins Hash Joins if one of the tables is very smallExploit pre-sorted data by doing map-side

merge join

Join To Map Reduce

Page 18: Hive

SQL:FROM (a join b on a.key = b.key) join c on

a.key = c.key SELECT …

key

av bv

1 111

222

key av

1 111

A

Map Reducekey bv

1 222

B

key cv

1 333

C

AB

Map Reducekey

av bv cv

1 111

222 333

ABC

Hive Optimizations – Merge Sequential Map Reduce Jobs

Page 19: Hive

SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;

pageid

age

1 25

2 25

1 32

2 25

pv_users

pageid

age count

1 25 1

2 25 2

1 32 1

Hive QL – Group By

Page 20: Hive

pageid

age count

2 25 2

pageid

age

1 25

2 25

pv_users

pageid

age count

1 25 1

1 32 1

pageid

age

1 32

2 25

Map

key value

<1,25>

1

<2,25>

1

key value

<1,32>

1

<2,25>

1

key value

<1,25>

1

<1,32>

1

key value

<2,25>

1

<2,25>

1

ShuffleSort

Reduce

Hive QL – Group By in Map Reduce

Page 21: Hive

SELECT pageid, COUNT(DISTINCT userid)FROM page_view GROUP BY pageid

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

2 111 9:08:20

page_view

pageid count_distinct_userid

1 2

2 1

Hive QL – Group By with Distinct

Page 22: Hive

pageid

count

1 1

page_view

pageid

count

1 1

2 1ShuffleandSort

Reduce

pageid

userid

time

1 111 9:08:01

2 111 9:08:13

pageid

userid

time

1 222 9:08:14

2 111 9:08:20

key v

<1,111>

<2,111>

<2,111>key v

<1,222>

MapReduce

Hive QL – Group By with Distinct in Map Reduce

Page 23: Hive

FROM pv_users INSERT INTO TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender)

INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)

INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Inserts into Files, Tables and Local Files

Page 24: Hive

Thank You