hive
DESCRIPTION
Data Warehousing on HadooopTRANSCRIPT
Data Warehousing on Hadoop
HIVE
Hadoop is great for large-data processing!But writing Java programs for
everything is verbose and slowAnalysts don’t want to (or can’t) write
JavaSolution: develop higher-level data
processing languagesHive: HQL is like SQL Pig: Pig Latin is a bit like Perl
Need for High-Level Languages
Problem: Data, data and more data200GB per day in March 2008 2+TB(compressed) raw data per day today
The Hadoop ExperimentMuch superior to availability and scalability of
commercial DBsEfficiency not that great and required more
hardwarePartial Availability/resilience/scale more
important than ACIDProblem: Programmability and Metadata
Map-reduce hard to program (users know sql/bash/python)
Need to publish data in well known schemas
Why Hive??
HIVE: Components
Shell: allows interactive queriesDriver: session handles, fetch, executeCompiler: parse, plan, optimizeExecution engine: DAG of stages (MR,
HDFS, metadata)Metastore: schema, location in HDFS,
SerDe
HIVE: Components
TablesTyped columns (int, float, string, boolean)Also, list: map (for JSON-like data)
PartitionsFor example, range-partition tables by dateCommand : PARTITIONED BY
BucketsHash partitions within ranges (useful for
sampling, join optimization)Command : CLUSTERED BY
Data Model
Database: namespace containing a set of tables
Holds table definitions (column types, physical layout)
Holds partitioning informationCan be stored in Derby, MySQL,
and many other relational databases
Metastore
Warehouse directory in HDFSE.g., /user/hive/warehouse
Tables stored in subdirectories of warehousePartitions form subdirectories of tables
Actual data stored in flat filesControl char-delimited text, or SequenceFiles
With custom SerDe, can use arbitrary format
Physical Layout
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgm
t. W
eb
UI
HIVE: Components
CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); SHOW TABLES '.*s';DESCRIBE sample;ALTER TABLE sample ADD COLUMNS (new_col INT);DROP TABLE sample;
Examples – DDL Operations
LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');LOAD DATA INPATH '/user/falvariz/hive/sample.txt' OVERWRITE INTO TABLE sample PARTITION (ds='2012-02-24');
Examples – DML Operations
SELECT * FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid,
pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map
INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING
'reduce_script' AS (date, count);
Running Custom Map/Reduce Scripts
Machine 2
Machine 1
<k1, v1><k2, v2><k3, v3>
<k4, v4><k5, v5><k6, v6>
(Simplified) Map Reduce Review
<nk1, nv1><nk2, nv2><nk3, nv3>
<nk2, nv4><nk2, nv5><nk1, nv6>
LocalMap
<nk2, nv4><nk2, nv5><nk2, nv2>
<nk1, nv1><nk3, nv3><nk1, nv6>
GlobalShuffle
<nk1, nv1><nk1, nv6><nk3, nv3>
<nk2, nv4><nk2, nv5><nk2, nv2>
LocalSort
<nk2, 3>
<nk1, 2><nk3, 1>
LocalReduce
• SQL:INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid =
u.userid);
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid
age gender
111 25 female
222 32 male
pageid
age
1 25
2 25
1 32
X =
page_viewuser
pv_users
Hive QL – Join
key value
111 <1,1>
111 <1,2>
222 <1,1>
key value
111 <2,25>
222 <2,32>
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14useri
dage gende
r
111 25 female
222 32 male
page_view
user Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
ShuffleSort
pageid
age
1 25
2 25
pageid
age
1 32
Reduce
Hive QL – Join in Map Reduce
Outer JoinsINSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)
WHERE pv.date = 2008-03-03;
Joins
Only Equality Joins with conjunctions supported
Future Pruning of values send from map to
reduce on the basis of projections Make Cartesian product more memory
efficient Map side joins Hash Joins if one of the tables is very smallExploit pre-sorted data by doing map-side
merge join
Join To Map Reduce
SQL:FROM (a join b on a.key = b.key) join c on
a.key = c.key SELECT …
key
av bv
1 111
222
key av
1 111
A
Map Reducekey bv
1 222
B
key cv
1 333
C
AB
Map Reducekey
av bv cv
1 111
222 333
ABC
Hive Optimizations – Merge Sequential Map Reduce Jobs
SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;
pageid
age
1 25
2 25
1 32
2 25
pv_users
pageid
age count
1 25 1
2 25 2
1 32 1
Hive QL – Group By
pageid
age count
2 25 2
pageid
age
1 25
2 25
pv_users
pageid
age count
1 25 1
1 32 1
pageid
age
1 32
2 25
Map
key value
<1,25>
1
<2,25>
1
key value
<1,32>
1
<2,25>
1
key value
<1,25>
1
<1,32>
1
key value
<2,25>
1
<2,25>
1
ShuffleSort
Reduce
Hive QL – Group By in Map Reduce
SELECT pageid, COUNT(DISTINCT userid)FROM page_view GROUP BY pageid
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
2 111 9:08:20
page_view
pageid count_distinct_userid
1 2
2 1
Hive QL – Group By with Distinct
pageid
count
1 1
page_view
pageid
count
1 1
2 1ShuffleandSort
Reduce
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
pageid
userid
time
1 222 9:08:14
2 111 9:08:20
key v
<1,111>
<2,111>
<2,111>key v
<1,222>
MapReduce
Hive QL – Group By with Distinct in Map Reduce
FROM pv_users INSERT INTO TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender)
INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)
INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013 SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);
Inserts into Files, Tables and Local Files
Thank You