cosc 6397 big data analytics hadoop mapreduce infrastructure: pig, hive, and mahout ·...

1

COSC 6397

Big Data Analytics

Hadoop MapReduce Infrastructure:

Pig, Hive, and Mahout

Edgar Gabriel

Spring 2017

Pig

• Pig is a platform for analyzing large data sets

– abstraction on top of Hadoop

– Provides high level programming language designed

for data processing

– Converted into MapReduce and executed on Hadoop

Clusters

2

Why using Pig?

• MapReduce requires programmers

– Must think in terms of map and reduce functions

– More than likely will require Java programming

• Pig provides high-level language that can be used by Analysts and

Scientists

– Does not require know how in parallel programming

• Pig’s Features

– Join Datasets

– Sort Datasets

– Filter

– Data Types

– Group By

– User Defined Functions

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Pig Components

• Pig Latin

– Command based language

– Designed specifically for data transformation and flow

expression

• Execution Environment

– The environment in which Pig Latin commands are

executed

– Supporting local and Hadoop execution modes

• Pig compiler converts Pig Latin to MapReduce

– Automatic vs. user level optimizations compared to

manual MapReduce code


3

Running Pig

• Script

– Execute commands in a file

– $pig scriptFile.pig

• Grunt

– Interactive Shell for executing Pig Commands

– Started when script file is NOT provided

– Can execute scripts from Grunt via run or exec commands

• Embedded

– Execute Pig commands using PigServer class

– Can have programmatic access to Grunt via PigRunner

class


Pig Latin concepts

• Building blocks

– Field – piece of data

– Tuple – ordered set of fields, represented with “(“ and

“)” (10.4, 5, word, 4, field1)

– Bag – collection of tuples, represented with “{“ and “}” {

(10.4, 5, word, 4, field1), (this, 1, blah) }

• Some similarities to relational databases

– Bag is a table in the database

– Tuple is a row in a table


4

Simple Pig Latin example$ pig

grunt> cat /input/pig/a.txt

a 1

d 4

c 9

k 6

grunt> records = LOAD '/input/a.txt' as (letter:chararray, count:int);

grunt> dump records;

...

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer

.MapReduceLauncher - 50% complete

2012-07-14 17:36:22,040 [main] INFO

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer

.MapReduceLauncher - 100% complete

...

(a,1)

(d,4)

(c,9)

(k,6)

grunt>

Load grunt in default map-reduce mode

grunt supports file system commands

Load contents of text file into a bag called records

Display records on screen


Simple Pig Latin example

• No action is taken until DUMP or STORE commands are

encountered

– Pig will parse, validate and analyze statements but not execute

them

• STORE – saves results (typically to a file)

• DUMP – displays the results to the screen

– doesn’t make sense to print large arrays to the screen

– For information and debugging purposes you can print a small

sub-set to the screen

grunt> records = LOAD '/input/excite-small.log'

AS (userId:chararray, timestamp:long, query:chararray);

grunt> toPrint = LIMIT records 5;

grunt> DUMP toPrint;


5

Simple Pig Latin example

LOAD 'data' [USING function] [AS schema];

• data – name of the directory or file

– Must be in single quotes

• USING – specifies the load function to use

– By default uses PigStorage which parses each line into

fields using a delimiter

– Default delimiter is tab (‘\t’)

– The delimiter can be customized using regular

expressions

• AS – assign a schema to incoming data

– Assigns names and types to fields ( alias:type)

– (name:chararray, age:int, gpa:float)


records = LOAD '/input/excite-small.log‘ USING

PigStorage() AS (userId:chararray, timestamp:long,

query:chararray);

• int Signed 32-bit integer 10

• long Signed 64-bit integer 10L or 10l

• float 32-bit floating point 10.5F or 10.5f

• double 64-bit floating point 10.5 or 10.5e2 or 10.5E2

• chararray Character array (string)

in Unicode UTF-8 hello world

• bytearray Byte array (blob)

• tuple An ordered set of fields (T: tuple (f1:int, f2:int))

• bag A collection of tuples (B: bag {T: tuple(t1:int, t2:int)})


6

Pig Latin Diagnostic Tools

• Display the structure of the Bag

– grunt> DESCRIBE <bag_name>;

• Display Execution Plan

– Produces Various reports, e.g. logical plan, MapReduce

plan

– grunt> EXPLAIN <bag_name>;

• Illustrate how Pig engine transforms the data

– grunt> ILLUSTRATE <bag_name>;


Joining Two Data Sets

• Join Steps

– Load records into a bag from input #1

– Load records into a bag from input #2

– Join the 2 data-sets (bags) by provided join key

• Default Join is Inner Join

– Rows are joined where the keys match

– Rows that do not have matches are not included in the

result

Set 1 Set 2

Inner join


7

Simple join example

1. Load records into a bag from input #1posts = load '/input/user-posts.txt' using PigStorage(',')

as (user:chararray, post:chararray, date:long);

2. Load records into a bag from input #2likes = load '/input/user-likes.txt' using PigStorage(',')

as (user:chararray,likes:int,date:long);

3. Join the data sets when a key is equal in both data-sets

then the rows are joined into a new single row; In this

case when user name is equaluserInfo = join posts by user, likes by user;

dump userInfo;


$ hdfs dfs -cat /input/user-posts.txt

user1,Funny Story,1343182026191

user2,Cool Deal,1343182133839

user4,Interesting Post,1343182154633

user5,Yet Another Blog,13431839394

$ hdfs dfs -cat /input/user-likes.txt

user1,12,1343182026191

user2,7,1343182139394

user3,0,1343182154633

user4,50,1343182147364

$ pig /code/InnerJoin.pig

(user1,Funny Story,1343182026191,user1,12,1343182026191)

(user2,Cool Deal,1343182133839,user2,7,1343182139394)

(user4,InterestingPost,1343182154633,user4,50,1343182147364)


8

Outer Join

• Records which will not join with the ‘other’ record-set are still

included in the result

• Left Outer

– Records from the first data-set are included whether they have

a match or not. Fields from the unmatched (second) bag are

set to null.

• Right Outer

– The opposite of Left Outer Join: Records from the second data-

set are included no matter what. Fields from the unmatched

(first) bag are set to null.

• Full Outer

– Records from both sides are included. For unmatched records

the fields from the ‘other’ bag are set to null.


Pig Use cases

• Loading large amounts of data

– Pig is built on top of Hadoop -> scales with the number of

servers

– Alternative to manual bulkloading e.g. in HBASE

• Using different data sources, e.g.

– collect web server logs,

– use external programs to fetch geo-location data for the users’

IP addresses,

– join the new set of geo-located web traffic to click maps stored

• Support for data sampling

9

Hive

• Data Warehousing Solution built on top of Hadoop

• Provides SQL-like query language named HiveQL

– Minimal learning curve for people with SQL expertise

– Data analysts are target audience

• Early Hive development work started at Facebook in

2007

• Translates HiveQL statements into a set of MapReduce

Jobs which are then executed on a Hadoop Cluster


Hive

• Ability to bring structure to various data formats

• Simple interface for ad hoc querying, analyzing and

summarizing large amounts of data

• Access to files on various data stores such as HDFS and

HBase

• Hive does NOT provide low latency or realtime queries

– Even querying small amounts of data may take minutes

• Designed for scalability and ease-of-use rather than low

latency responses


10

Hive

• To support features like schema(s) and data

partitioning Hive keeps its metadata in a Relational

Database

– Packaged with Derby, a lightweight embedded SQL DB

• Default Derby based is good for evaluation an testing

• Schema is not shared between users as each user has

their own instance of embedded Derby

• Stored in metastore_db directory which resides in the

directory that hive was started from

– Can easily switch another SQL installation such as MySQL


Hive Architecture


11

Hive Interface Options

• Command Line Interface (CLI)

• Hive Web Interface – https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface

• Java Database Connectivity (JDBC) – https://cwiki.apache.org/confluence/display/Hive/HiveClient

• Re-used from Relational Databases

– Database: Set of Tables, used for name conflict resolution

– Table: Set of Rows that have the same schema (same

columns)

– Row: A single record; a set of columns

– Column: provides value and type for a single value


Hive creating a tablehive> CREATE TABLE posts (user STRING, post STRING, time BIGINT)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY ','

> STORED AS TEXTFILE;

OK Time taken: 10.606 seconds

hive> show tables;

OK

posts Time taken: 0.221 seconds

hive> describe posts;

OK

user string

post string

time bigint

Time taken: 0.212 seconds

creates a table with 3 columns

How the underlying file should be parsed


Display schema for posts table

https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface

https://cwiki.apache.org/confluence/display/Hive/HiveClient

12

Hive Query Data

hive> select * from posts where user="user2"; ... ...

OK

user2 Cool Deal 1343182133839

Time taken: 12.184 seconds

hive> select * from posts where time<=1343182133839 limit

2;

...

...

OK

user1 Funny Story 1343182026191

user2 Cool Deal 1343182133839

Time taken: 12.003 seconds hive>


Partitions

• To increase performance Hive has the capability to

partition data

– The values of partitioned column divide a table into

segments

– Entire partitions can be ignored at query time

– Similar to relational databases’ indexes but not as

granular

• Partitions have to be properly crated by users

– When inserting data must specify a partition

• At query time, whenever appropriate, Hive will

automatically filter out partitions


13

Bucketing

• Mechanism to query and examine random samples of

data

• Break data into a set of buckets based on a hash

function of a "bucket column"

– Capability to execute queries on a sub-set of random data

• Doesn’t automatically enforce bucketing

– User is required to specify the number of buckets by

setting # of reducer


Joins

• Hive support outer joins – left, right and full joins

• Can join multiple tables

• Default Join is Inner Join

– Rows are joined where the keys match

– Rows that do not have matches are not included in the

result


14

Pig vs. Hive• Hive

– Uses an SQL like query language called HQL

– Gives non-programmers the ability to query and analyze

data in Hadoop.

• Pig

– Uses a workflow driven scripting language

– Don't need to be an expert Java programmer but need a

few coding skills.

– Can be used to convert unstructured data into a

meaningful form.

Mahout

• Scalable machine learning library

– Built with MapReduce and Hadoop in mind

– Written in Java

• Focusing on three application scenarios

– Recommendation Systems

– Clustering

– Classifiers

• Multiple ways for utilizing Mahout

– Java Interfaces

– Command line interfaces

• Newest Mahout releases target Spark, not Mapreduce

anymore!

15

Classification

• Currently supported algorithms

– Naïve Baysian Classifier

– Hidden Markov Models

– Logistical Regression

– Random Forest

Clustering

• Currently supported algorithms

– Canopy clustering

– K-means clustering

– Fuzzy k-means clustering

– Spectral clustering

• Multiple tools available to support clustering

– clusterdump: utility to output results of a clustering to a

text file

– cluster visualization

16

Mahout input arguments

• Input data has to be sequence files and sequence

vectors

– Sequence file: generic Hadoop concept for binary files

containing a

• list of key/value pairs

• Classes used for the key and the value pair

– Sequence vector: binary file containing list of key/(array

of values)

• For using mahout algorithms, key has to be text and value has to be of type VectorWritable (which is a

Mahout class, not a Hadoop class)

Sequence Files

• Creating a sequencfile using command line argumentgabriel@shark>mahout seqdirectory -i /lastfm/input/ -o

/lastfm/seqfiles

• Looking at the output of a sequence filegabriel@shark>mahout seqdumper –i /lastfm/seqfiles/control-

data.seq | more

Input Path: file:/lastfm/seqfiles/control-data.seq

Key class: class org.apache.hadoop.io.Text Value Class: class

org.apache.mahout.math.VectorWritable

Key: 0: Value: {0:28.7812,1:34.4632,2:31.3381}

Key: 1: Value: {0:24.8923,1:25.741,2:27.5532}

…

17

Sequence File from Java

• Required if the original input file is not already structured in a

manner that can be interpreted as key/value pair

public class CreateSequenceFile {

public static void main(String[] argsx) throws

FileNotFoundException, IOException

{

String filename = "/home/gabriel/mahouttest/synthetic-control-

data/input/synthetic-control.data";

String outputfilename = "/home/gabriel/mahouttest/synthetic-

control-data/seqfile/synthetic-control.seq";

Path path = new Path(outputfilename);

BufferedReader br=new BufferedReader(new FileReader(filename));

String line;

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

SequenceFile.Writer writer = new

SequenceFile.Writer(fs,conf,path,Text.class,VectorWritable.class);

Text key = new Text();

long tempkey = 0;

while( (line = br.readLine()) != null ) {

Scanner scanner = new Scanner (new StringReader (line) );

double[] values = new double[64] ;

int i=0;

while ( scanner.hasNextDouble() && i < 64 ) {

values[i] = scanner.nextDouble();

i++;

}

DenseVector val = new DenseVector (values) ;

VectorWritable vec = new VectorWritable(val);

key = new Text(String.format("%d",tempkey));

writer.append(key,vec);

tempkey++;

}

writer.close();}}

18

Using Mahout clustering

The SequenceFile containing the input vectors.

The SequenceFile containing the initial cluster

centers.

The similarity measure to be used.

The convergenceThreshold.

The number of iterations to be done.

The Vector implementation used in the input files.

Using Mahout clustering

19

Distance measures

Euclidean distance measure

Squared Euclidean distance measure

Manhattan distance measure

Distance measures

Cosine distance measure

Tanimoto distance measure

20

Running Mahout Clustering algorithms

bin/mahout kmeans

-i <input vectors directory> \

-c <input clusters directory> \

-o <output working directory> \

-k <optional no. of initial clusters> \

-dm <DistanceMeasure> \

-x <maximum number of iterations> \

-cd <optional convergence delta. Default is 0.5> \

-ow <overwrite output directory if present>

-cl <run input vector clustering after computing Canopies>

-xm <execution method: sequential or mapreduce>

mahout clusterdump -i /gabriel/clustering/canopy/clusters-0-final --

pointsDir /gabriel/clustering/canopy/clusteredPoints

-o /home/gabriel/mahouttest/synthetic-control-data/canopy.out

cosc 6397 big data analytics hadoop mapreduce infrastructure: pig, hive, and mahout ·...

Documents