introduction to apache hive - ut · –slow start-up and clean-up of mapreduce jobs •it takes...
TRANSCRIPT
Introduction to Apache Hive
Pelle Jakovits
14 Oct, 2015, Tartu
Outline
• What is Hive
• Why Hive over MapReduce or Pig?– Advantages and disadvantages
• Running Hive
• HiveQL language
• User Defined Functions
• Hive vs Pig
• Other projects in Hadoop Ecosystem
2
Hive
• Initially developed by Facebook
• Data warehousing on top of Hadoop.
• Designed for
– easy data summarization
– ad-hoc querying
– analysis of large volumes of data.
• HiveQL statements are automatically translated into MapReduce jobs
3
Advantages of Hive
• Higher level query language
– Simplifies working with large amounts of data
• Lower learning curve than Pig or MapReduce
– HiveQL is much closer to SQL than Pig
– Less trial-and-error than Pig
4
Disadvantages
• Updating data is complicated
– Mainly because of using HDFS
– Can add records
– Can overwrite partitions
• No real time access to data
– Use other means like Hbase or Impala
• High latency
5
Running Hive
• We will look at it more closely in the practicesession, but you can run hive from– Hive web interface
– Hive shell• $HIVE_HOME/bin/hive for interactive shell
• Or you can run queries directly: $HIVE_HOME/bin/hive -e ‘select a.col from tab1 a’
– JDBC – Java Database Connectivity• "jdbc:hive://host:port/dbname„
– Also possible to use hive directly in Python, C, C++, PHP
6
HiveQL
• Hive query language provides the basic SQL like operations.
• Some of these operations are:
– CREATE/DROP/ALTER TABLE, SHOW, DESCRIBE
– FILTER, SELECT, JOIN, INSERT, UNION, UPDATE(>= 0.14), DELETE (>= 0.14)
• Base HiveQL language is extended withUser Defined Functions (UDF)
7
Create Tables
CREATE TABLE page_view(viewTime INT, userid BIGINT,page_url STRING, referrer_url STRING,ip STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’--’LINES TERMINATED BY ‘__’STORED AS SEQUENCEFILE;
8
Load Data
• There are multiple ways to load data into Hive tables.
• The user can create an external table that points to a specified location within HDFS.
• The user can copy a file into the specified location in HDFS and create a table pointing to this location with all the relevant row format information.
• Once this is done, the user can transform the data and insert them into any other Hive table orrun queries.
9
Describing data file
CREATE EXTERNAL TABLE page_view_stg(
viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING,
country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ’--’LINES TERMINATED BY ‘__’
STORED AS TEXTFILE LOCATION '/user/data/staging/page_view';
10
Loading from data file
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_viewPARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';
11
Directly loading ja Storing data
• Loading data directly:
LOAD DATA LOCAL INPATH ‘/tmp/pv_2008-06-08_us.txt’ INTO TABLE page_view;
• To write the output into a local disk, for example to load it into an excel spreadsheet later:
FROM pv_users
INSERT OVERWRITE LOCAL DIRECTORY '/user/pelle/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
12
Data units
• Databases:
– Namespaces that separate tables and other data units from naming conflicts
• Tables:
– Homogeneous units of data which have the same schema
– Consists of specified columns accordingly to its schema
• Partitions:
– Each Table can have one or more partition Keys which determines how data is stored
– Partitions allow the user to efficiently identify the rows that satisfy a certain criteria
– It is the user's job to guarantee the relationship between partition name and data!
• Buckets (or Clusters):
– Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.
– For example the page_views table may be bucketed by userid to sample the data.
13
Partitions ja Buckets
CREATE TABLE order (
username STRING,
orderdate STRING,
amount DOUBLE,
tax DOUBLE,
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;
14
Types
• Types are associated with the columns in the tables. The following Primitive types are supported:– Integers
• TINYINT - 1 byte integer• SMALLINT - 2 byte integer• INT - 4 byte integer• BIGINT - 8 byte integer
– Boolean type• BOOLEAN
– Floating point numbers• FLOAT• DOUBLE
– String type• STRING
15
Built in functions
• Round(number), floor(number), ceil(number)
• Rand(seed)
• Concat(string), substr(string, length)
• Upper(string), lower(string), trim(string)
• Year(date), Month(date), Day (date)
• Count(*), sum(col), avg(col), max(col), min(col)
16
Display functions
• SHOW TABLES;
– Lists all tables
• SHOW PARTITIONS page_view;
– Lists partitions on a specific table
• DESCRIBE EXTENDED page_view;
– To show all metadata
– Mainly for debugging
17
INSERT
INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
page_views.referrer_url like '%xyz.com';
18
JOIN
• JOIN, {LEFT | RIGHT |FULL} OUTER JOIN, LEFT SEMI JOIN, CROSS JOIN
• Can join more than 2 tables at once• It is best to put the largest table on the rightmost side
of the join to get the best performance.• Only equality expressions allowed: expression =
expression
SELECT pv.*, u.gender, u.ageFROM user u JOIN page_view pvON (pv.userid = u.id) WHERE pv.date = '2008-03-03';
19
Sampling
• By number of rows:SELECT * FROM source_tableTABLESAMPLE (1m ROWS);
• By sampling rate;SELECT * FROM source_tableTABLESAMPLE(0.1 PERCENT) ;
• Forced random order sampling: select * FROM source_tabledistribute by rand() sort by rand() limit 1000;
20
Union
INSERT OVERWRITE TABLE actions_usersSELECT u.id, actions.dateFROM (
SELECT av.uid AS uidFROM action_video avWHERE av.date = '2008-06-03'
UNION ALL
SELECT ac.uid AS uidFROM action_comment acWHERE ac.date = '2008-06-03') actions JOIN users u ON(u.id = actions.uid);
21
Complex types
• Structs– the elements within the type can be accessed using the DOT (.)
notation. F– or example, for a column c of type STRUCT {a INT; b INT} the a field is
accessed by the expression c.a
• Maps (key-value tuples)– The elements are accessed using ['element name'] notation. – For example in a map M comprising of a mapping from 'group' -> gid
the gid value can be accessed using M['group']
• Arrays (indexable lists)– The elements in the array have to be in the same type. – Elements can be accessed using the [n] notation where n is an index
(zero-based) into the array. – For example for an array A having the elements ['a', 'b', 'c'], A[1]
retruns 'b'.
22
collect_list & sort_array
• SELECT city, sort_array(collect_list(balance)) as balances from bank_accounts GROUP BY city;
23
• Values inside field balances can be addressed as anarray: balances[0]
Complex Type operations
• size(array), array_contains(array, val),
• map_keys(map), map_values(map),
• greatest(array), least(array)
• explode(array)
• sentences(string text, string lang, string locale)
• ngrams(sentences, int gram, int num)
• histogram_numeric(field)
24
Hive WordCount
SELECT word, COUNT(*) FROM input
LATERAL VIEW explode(split(text, ' ')) lnputTable as word
GROUP BY word;
• A lateral view applies UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table.
25
Java UDF
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;
public final class Lower extends UDF {public Text evaluate(final Text s) {
if (s == null) { return null; }return new Text(s.toString().toLowerCase());
}}
26
Using UDF
27
• create temporary function my_lower as 'com.example.hive.udf.Lower';
• hive> select my_lower(title), sum(freq) from titles group by my_lower(title);
Running custom mapreduce
FROM (FROM pv_usersMAP pv_users.userid, pv_users.dateUSING 'map_script.py'AS dt, uidCLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reducedREDUCE map_output.dt, map_output.uidUSING 'reduce_script.py'AS date, count;
28
Map Example
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print ','.join([userid, str(weekday)])
29
DEMO
30
Hive disadvantages
• Same disadvantages as MapReduce and Pig– Slow start-up and clean-up of MapReduce jobs
• It takes time for Hadoop to schedule MR jobs
– Not suitable for interactive OLAP Analytics• When results are expected in < 1 sec
• Designed for querying and not datatransformation– Limitations of the SQL language
– Tasks like co-grouping can get complicated
31
Pig vs Hive
Pig Hive
Purpose Data transformation Ad-Hoc querying
Language Something similiar to SQL SQL-like
Difficulty Medium (Trial-and-error) Low to medium
Schemas Yes (implicit) Yes (explicit)
Join (Distributed) Yes Yes
Shell Yes Yes
Streaming Yes Yes
Web interface Third-Party Yes
Partitions No Yes
UDF’s Yes Yes
32
Pig vs Hive
• SQL might not be the perfect language for expressing data transformation commands
• Pig
– Mainly for data transformations and processing
– Unstructured data
• Hive
– Mainly for warehousing and querying data
– Structured data
33
Big Picture
• Store large amounts of data to HDFS
• Process raw data: Pig
• Build schema using Hive
• Data querries: Hive
• Real time access
– access to data with Hbase
– real time queries with Impala
34
Is Hadoop enough?
• Why use Hadoop for large scale data processing?
– It is becoming a de facto standard in Big Data
– Collaboration among Top Companies instead ofvendor tool lock-in.
• Amazon, Apache, Facebook, Yahoo!, etc all contribute toopen source Hadoop
– There are tools from setting up Hadoop cluster inminutes and importing data from relational databasesto setting up workflows of MR, Pig and Hive.
35
More Hadoop projects
• Hbase
– Open-source distributed database ontop of HDFS
– Hbase tables only use a single key
– Tuned for real-time access to data
• Cloudera Impala™
– Simplified, real time queries over HDFS
– Bypass job schedulling, and remove everythingelse that makes MR slow.
36
Thats All
• This week`s practice session
– Processing data with Hive
– Similiar exercise as last two weeks, this time using Hive
• Next lecture: Spark
37