introduction to apache hive - ut · –slow start-up and clean-up of mapreduce jobs •it takes...

Introduction to Apache Hive

Pelle Jakovits

14 Oct, 2015, Tartu

Outline

• What is Hive

• Why Hive over MapReduce or Pig?– Advantages and disadvantages

• Running Hive

• HiveQL language

• User Defined Functions

• Hive vs Pig

• Other projects in Hadoop Ecosystem

2

Hive

• Initially developed by Facebook

• Data warehousing on top of Hadoop.

• Designed for

– easy data summarization

– ad-hoc querying

– analysis of large volumes of data.

• HiveQL statements are automatically translated into MapReduce jobs

3

Advantages of Hive

• Higher level query language

– Simplifies working with large amounts of data

• Lower learning curve than Pig or MapReduce

– HiveQL is much closer to SQL than Pig

– Less trial-and-error than Pig

4

Disadvantages

• Updating data is complicated

– Mainly because of using HDFS

– Can add records

– Can overwrite partitions

• No real time access to data

– Use other means like Hbase or Impala

• High latency

5

Running Hive

• We will look at it more closely in the practicesession, but you can run hive from– Hive web interface

– Hive shell• $HIVE_HOME/bin/hive for interactive shell

• Or you can run queries directly: $HIVE_HOME/bin/hive -e ‘select a.col from tab1 a’

– JDBC – Java Database Connectivity• "jdbc:hive://host:port/dbname„

– Also possible to use hive directly in Python, C, C++, PHP

6

HiveQL

• Hive query language provides the basic SQL like operations.

• Some of these operations are:

– CREATE/DROP/ALTER TABLE, SHOW, DESCRIBE

– FILTER, SELECT, JOIN, INSERT, UNION, UPDATE(>= 0.14), DELETE (>= 0.14)

• Base HiveQL language is extended withUser Defined Functions (UDF)

7

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

Create Tables

CREATE TABLE page_view(viewTime INT, userid BIGINT,page_url STRING, referrer_url STRING,ip STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ’--’LINES TERMINATED BY ‘__’STORED AS SEQUENCEFILE;

8

Load Data

• There are multiple ways to load data into Hive tables.

• The user can create an external table that points to a specified location within HDFS.

• The user can copy a file into the specified location in HDFS and create a table pointing to this location with all the relevant row format information.

• Once this is done, the user can transform the data and insert them into any other Hive table orrun queries.

9

http://hadoop.apache.org/common/docs/current/hdfs_design.html

Describing data file

CREATE EXTERNAL TABLE page_view_stg(

viewTime INT,

userid BIGINT,

page_url STRING,

referrer_url STRING,

ip STRING,

country STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ’--’LINES TERMINATED BY ‘__’

STORED AS TEXTFILE LOCATION '/user/data/staging/page_view';

10

Loading from data file

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_viewPARTITION(dt='2008-06-08', country='US')

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip

WHERE pvs.country = 'US';

11

Directly loading ja Storing data

• Loading data directly:

LOAD DATA LOCAL INPATH ‘/tmp/pv_2008-06-08_us.txt’ INTO TABLE page_view;

• To write the output into a local disk, for example to load it into an excel spreadsheet later:

FROM pv_users

INSERT OVERWRITE LOCAL DIRECTORY '/user/pelle/pv_age_sum'

SELECT pv_users.age, count_distinct(pv_users.userid)

GROUP BY pv_users.age;

12

Data units

• Databases:

– Namespaces that separate tables and other data units from naming conflicts

• Tables:

– Homogeneous units of data which have the same schema

– Consists of specified columns accordingly to its schema

• Partitions:

– Each Table can have one or more partition Keys which determines how data is stored

– Partitions allow the user to efficiently identify the rows that satisfy a certain criteria

– It is the user's job to guarantee the relationship between partition name and data!

• Buckets (or Clusters):

– Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table.

– For example the page_views table may be bucketed by userid to sample the data.

13

Partitions ja Buckets

CREATE TABLE order (

username STRING,

orderdate STRING,

amount DOUBLE,

tax DOUBLE,

) PARTITIONED BY (company STRING)

CLUSTERED BY (username) INTO 25 BUCKETS;

14

Types

• Types are associated with the columns in the tables. The following Primitive types are supported:– Integers

• TINYINT - 1 byte integer• SMALLINT - 2 byte integer• INT - 4 byte integer• BIGINT - 8 byte integer

– Boolean type• BOOLEAN

– Floating point numbers• FLOAT• DOUBLE

– String type• STRING

15

Built in functions

• Round(number), floor(number), ceil(number)

• Rand(seed)

• Concat(string), substr(string, length)

• Upper(string), lower(string), trim(string)

• Year(date), Month(date), Day (date)

• Count(*), sum(col), avg(col), max(col), min(col)

16

Display functions

• SHOW TABLES;

– Lists all tables

• SHOW PARTITIONS page_view;

– Lists partitions on a specific table

• DESCRIBE EXTENDED page_view;

– To show all metadata

– Mainly for debugging

17

INSERT

INSERT OVERWRITE TABLE xyz_com_page_views

SELECT page_views.*

FROM page_views

WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND

page_views.referrer_url like '%xyz.com';

18

JOIN

• JOIN, {LEFT | RIGHT |FULL} OUTER JOIN, LEFT SEMI JOIN, CROSS JOIN

• Can join more than 2 tables at once• It is best to put the largest table on the rightmost side

of the join to get the best performance.• Only equality expressions allowed: expression =

expression

SELECT pv.*, u.gender, u.ageFROM user u JOIN page_view pvON (pv.userid = u.id) WHERE pv.date = '2008-03-03';

19

Sampling

• By number of rows:SELECT * FROM source_tableTABLESAMPLE (1m ROWS);

• By sampling rate;SELECT * FROM source_tableTABLESAMPLE(0.1 PERCENT) ;

• Forced random order sampling: select * FROM source_tabledistribute by rand() sort by rand() limit 1000;

20

Union

INSERT OVERWRITE TABLE actions_usersSELECT u.id, actions.dateFROM (

SELECT av.uid AS uidFROM action_video avWHERE av.date = '2008-06-03'

UNION ALL

SELECT ac.uid AS uidFROM action_comment acWHERE ac.date = '2008-06-03') actions JOIN users u ON(u.id = actions.uid);

21

Complex types

• Structs– the elements within the type can be accessed using the DOT (.)

notation. F– or example, for a column c of type STRUCT {a INT; b INT} the a field is

accessed by the expression c.a

• Maps (key-value tuples)– The elements are accessed using ['element name'] notation. – For example in a map M comprising of a mapping from 'group' -> gid

the gid value can be accessed using M['group']

• Arrays (indexable lists)– The elements in the array have to be in the same type. – Elements can be accessed using the [n] notation where n is an index

(zero-based) into the array. – For example for an array A having the elements ['a', 'b', 'c'], A[1]

retruns 'b'.

22

collect_list & sort_array

• SELECT city, sort_array(collect_list(balance)) as balances from bank_accounts GROUP BY city;

23

• Values inside field balances can be addressed as anarray: balances[0]

Complex Type operations

• size(array), array_contains(array, val),

• map_keys(map), map_values(map),

• greatest(array), least(array)

• explode(array)

• sentences(string text, string lang, string locale)

• ngrams(sentences, int gram, int num)

• histogram_numeric(field)

24

Hive WordCount

SELECT word, COUNT(*) FROM input

LATERAL VIEW explode(split(text, ' ')) lnputTable as word

GROUP BY word;

• A lateral view applies UDTF to each row of base table and then joins resulting output rows to the input rows to form a virtual table.

25

Java UDF

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;

public final class Lower extends UDF {public Text evaluate(final Text s) {

if (s == null) { return null; }return new Text(s.toString().toLowerCase());

}}

26

Using UDF

27

• create temporary function my_lower as 'com.example.hive.udf.Lower';

• hive> select my_lower(title), sum(freq) from titles group by my_lower(title);

Running custom mapreduce

FROM (FROM pv_usersMAP pv_users.userid, pv_users.dateUSING 'map_script.py'AS dt, uidCLUSTER BY dt) map_output

INSERT OVERWRITE TABLE pv_users_reducedREDUCE map_output.dt, map_output.uidUSING 'reduce_script.py'AS date, count;

28

Map Example

import sys

import datetime

for line in sys.stdin:

line = line.strip()

userid, unixtime = line.split('\t')

weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

print ','.join([userid, str(weekday)])

29

DEMO

30

Hive disadvantages

• Same disadvantages as MapReduce and Pig– Slow start-up and clean-up of MapReduce jobs

• It takes time for Hadoop to schedule MR jobs

– Not suitable for interactive OLAP Analytics• When results are expected in < 1 sec

• Designed for querying and not datatransformation– Limitations of the SQL language

– Tasks like co-grouping can get complicated

31

Pig vs Hive

Pig Hive

Purpose Data transformation Ad-Hoc querying

Language Something similiar to SQL SQL-like

Difficulty Medium (Trial-and-error) Low to medium

Schemas Yes (implicit) Yes (explicit)

Join (Distributed) Yes Yes

Shell Yes Yes

Streaming Yes Yes

Web interface Third-Party Yes

Partitions No Yes

UDF’s Yes Yes

32

Pig vs Hive

• SQL might not be the perfect language for expressing data transformation commands

• Pig

– Mainly for data transformations and processing

– Unstructured data

• Hive

– Mainly for warehousing and querying data

– Structured data

33

Big Picture

• Store large amounts of data to HDFS

• Process raw data: Pig

• Build schema using Hive

• Data querries: Hive

• Real time access

– access to data with Hbase

– real time queries with Impala

34

Is Hadoop enough?

• Why use Hadoop for large scale data processing?

– It is becoming a de facto standard in Big Data

– Collaboration among Top Companies instead ofvendor tool lock-in.

• Amazon, Apache, Facebook, Yahoo!, etc all contribute toopen source Hadoop

– There are tools from setting up Hadoop cluster inminutes and importing data from relational databasesto setting up workflows of MR, Pig and Hive.

35

More Hadoop projects

• Hbase

– Open-source distributed database ontop of HDFS

– Hbase tables only use a single key

– Tuned for real-time access to data

• Cloudera Impala™

– Simplified, real time queries over HDFS

– Bypass job schedulling, and remove everythingelse that makes MR slow.

36

Thats All

• This week`s practice session

– Processing data with Hive

– Similiar exercise as last two weeks, this time using Hive

• Next lecture: Spark

37

introduction to apache hive - ut · –slow start-up and clean-up of mapreduce jobs •it takes...

Documents