epam. hadoop mr streaming in hive

24
Yauheni Yushyn, EPAM Systems – September 2014 Hadoop MR Streaming in Hive Use-case with Hive and Python from real life

Upload: eugene-yushin

Post on 19-Jun-2015

101 views

Category:

Software


0 download

DESCRIPTION

Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process. This presentation will tell how use Streaming feature in Hive to reduce code complexity with real story example.

TRANSCRIPT

Page 1: EPAM. Hadoop MR streaming in Hive

Yauheni Yushyn, EPAM Systems – September 2014

Hadoop MR Streaming in HiveUse-case with Hive and Python from real life

Page 2: EPAM. Hadoop MR streaming in Hive

2

Agenda

• Intro

• Pros and Cons

• Hive reference

• Use case from Real Life

• Possible solutions

• Hive Streaming: Architecture

• Hive Streaming: Realization

• Hive Streaming: Source code

• Hive Streaming: Debug

• Hive Streaming: Pitfalls

• Hive Streaming: Benchmarks

Page 3: EPAM. Hadoop MR streaming in Hive

3

CONCEPTSHadoop MR Streaming in Hive

SECTION

Page 4: EPAM. Hadoop MR streaming in Hive

Intro

Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process

Unix like interface:• Streaming API opens an I/O pipe to an external process• Process reads data from the standard input and writes the results out through the standard

output

By default,INPUT for user script:• columns transformed to STRING• delimited by TAB• NULL values converted to the literal string \N (differentiate NULL values from empty strings)

OUTPTUT of user script:• treated as TAB-separated STRING columns•  \N will be re-interpreted as a NULL• resulting STRING column will be cast to the data type specified in the table declaration

These defaults can be overridden with ROW FORMAT

Page 5: EPAM. Hadoop MR streaming in Hive

• Simplicity for developer, dealing with stdin/stdout

• Schema-less model, treat values as needed

• Non-Java interface

• Overhead for Serialization/Deserialization between processes

• Disallowed when "SQL standard based authorization" is configured (Hive 0.13.0 and later releases) 

Pros and Cons

Page 6: EPAM. Hadoop MR streaming in Hive

• MAP()

• REDUCE()

• TRANSFORM()

Hive provides several clauses to use streaming: MAP(), REDUCE(), and TRANSFORM().

Note:

MAP() does not actually force streaming during the map phase nor does reduce force streaming to happen in the reduce phase. For this reason, the functionally equivalent yet more generic TRANSFORM() clause is suggested to avoid misleading the reader of the query.

Hive reference

Page 7: EPAM. Hadoop MR streaming in Hive

7

USE CASEHadoop MR Streaming in Hive

SECTION

Page 8: EPAM. Hadoop MR streaming in Hive

Requirements:

There’re 14 flags in source table in Hive, which controls output values for 4 new fields in target table

Solutions:

• Hive "case … when" clause

• User Defined Function (UDF)

• Custom MR Job

• Hive Streaming

Use case from Real Life

Page 9: EPAM. Hadoop MR streaming in Hive

Use case from Real Life: Requirements

Page 10: EPAM. Hadoop MR streaming in Hive

10

• There’re more than 1,500 lines of code to map flags with new fields (statement repeats for every new output field)

• Complexity for debugging

• Fast execution• SQL-like syntax• All logic in one place (hql script)

Hive "case … when" clause

Page 11: EPAM. Hadoop MR streaming in Hive

• You are single consumer of UDF (for this particular case, custom logic for single DataMart)

• Java-code

• Fast execution• Pass only needed flags into UDF (in contrast with

Hive Streaming)• In the final point: SQL-like syntax, All logic in one

place• Java-code

UDF

Page 12: EPAM. Hadoop MR streaming in Hive

• Slower execution (time for SerDe)• Deal with all fields, not only flags (in contrast

with UDF)

• Reducing complexity of code using script language

• Small size of code• Fast developing• Wide stack of programming languages

Hive Streaming

Page 13: EPAM. Hadoop MR streaming in Hive

13

REALIZATIONHadoop MR Streaming in Hive

SECTION

Page 14: EPAM. Hadoop MR streaming in Hive

Hive Streaming: Architecture

Page 15: EPAM. Hadoop MR streaming in Hive

Python snippets:

• Create matrix (e.g., list of tuples) with flags and related values of fields

• Loop through INPUT

• Split INPUT by TAB

• Split data fields and flags

• Compare with matrix and get max possible matching

• Spill out data with new fields as TAB separated text

Hive Streaming: Realization

Page 16: EPAM. Hadoop MR streaming in Hive

#!/usr/bin/env python"""Mapper for Hive Streaming, using Python iterators and generators.Spill out new fields in accordance with input flags."""

import sysimport logging

def read_input(file): """Read data from STDIN using python generator""" #yield "IAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t01\t01\t0\t0\t0\t0\t0\t1\t0\t0\t0\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t2014-01-01\tEpam.COM" #yield "IAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t\N1\t\N1\t\N\t\N\t\N\t\N\t\N\t1\t\N\t\N\t\N\t1\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t2014-01-01\tEpam.COM" for line in file: yield line.strip()

def compare_flags(source, target): """Compare flags from source and target lists. Src/trg should have the same size""" size = len(source) out = list()

# Go through elemets, add 0 to OUT list if src/trg elements equals for i in xrange(size): if target[i] != '-': if target[i] == source[i]: out.append(0) else: #logging.debug("Position: %i. Values of src/trg not equals, skip: %s,%s" % (i, source[i], target[i])) return None #out.append(1) else: out.append('-')

return out

def main(separator='\t'): column_list = ["ORIGIN","DESTINATION","OND","CARRIER","LOS","BKG_WINDOW","LOCAL_CURRENCY","LOWEST_PRICE","PAGE","POSITION","XP_RANK","XP_PRICE","XP_COMPETED","XP_PRICE_DIFF","BML","NUMBER_SELLERS","XP_IS_HERO","ECPC_LOSS","PRICE_LOSS","OTA_1","OTA_1_PRICE","OTA_2","OTA_2_PRICE","OTA_3","OTA_3_PRICE","OTA_4","OTA_4_PRICE","OTA_5","OTA_5_PRICE","OTA_6","OTA_6_PRICE","OTA_7","OTA_7_PRICE","OTA_8","OTA_8_PRICE","OTA_9","OTA_9_PRICE","OTA_10","OTA_10_PRICE","OTA_11","OTA_11_PRICE","OTA_12","OTA_12_PRICE","OTA_13","OTA_13_PRICE","OTA_14","OTA_14_PRICE","OTA_15","OTA_15_PRICE","PARTNER_NAME","RCXR","DCXR","SPLIT_TICKET","DEPARTURE_DURATION","RETURN_DURATION","DEPARTURE_STOPS","RETURN_STOPS"] flag_list = ["exp_listed_on_route_flag","exp_listed_on_carr_flag","exp_lst_on_itin_flag","carr_is_seller_flag","more_than_1_seller_flag","split_ticket_flag","exp_in_hero_flag","ota_in_hero_flag","meta_in_hero_flag","carr_in_hero_flag","cheapest_prc_is_unique_flag","exp_prc_match_carr_flag","exp_prc_match_cheapest_flag","cheapest_ota_meta_prc_match_carr_flag"] partition_list = ["SHOP_DATE", "PARTNER_POS"]

logging.debug("Star specifying vocabulary matrix") target = [ (["Inventory","Epam not showing route","Epam Lost","Unknown"],["0","-","0","-","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier for Epam"],["1","0","0","1","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Restricted carrier on Meta"],["1","0","0","1","0","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing carrier","Epam Lost","Unknown"],["1","0","0","0","-","-","-","-","-","-","-","-","-","-"]) ,(["Inventory","Epam not showing itinerary","Epam Lost","Unknown"],["1","1","0","-","1","0","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","0","0","0","0","1","-","-","-","-","-","-","-","-"]) ,(["Inventory","Unique Inventory","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","1","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Split Ticket"],["1","1","0","0","-","1","-","0","1","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Unknown","Epam Won"],["1","1","1","0","0","0","1","0","0","0","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Suspected Carrier Restricted Content"],["1","1","0","1","0","0","0","0","0","1","-","-","-","-"]) ,(["Inventory","Unique Inventory","Epam Lost","Unknown"],["1","1","0","0","0","0","-","-","-","-","-","-","-","-"]) ,(["Price","Carrier more expensive","Undercutting carrier","Epam Won"],["1","1","1","1","-","0","1","0","0","0","-","0","-","-"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","0","1","0","-","-","0","0"]) ,(["Price","Carrier more expensive","Epam Lost","Undercutting carrier"],["1","1","1","1","1","0","0","1","0","0","-","-","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Unknown"],["1","1","1","1","1","0","0","-","-","1","-","0","0","0"]) ,(["Price","Carrier cheapest","Epam Lost","Carrier controlled pricing"],["1","1","1","1","-","0","0","0","0","1","1","0","-","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Split Ticket"],["1","1","1","0","-","1","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Split Ticket","Epam Won"],["1","1","1","0","-","1","1","0","0","0","-","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","0","1","0","-","-","0","-"]) ,(["Price","Fees or charges","Epam Lost","Unknown"],["1","1","1","0","-","0","0","1","0","0","-","-","0","-"]) ,(["Price","Fees or charges","Unknown","Epam Won"],["1","1","1","0","-","0","1","0","0","0","1","-","-","-"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","1","0","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","1","0","0","-","0","1"]) ,(["Price","Fees or charges","Epam Lost","Fees or charges"],["1","1","1","1","1","0","0","0","0","1","0","0","-","1"]) ,(["Rank","Rank","ECPC","Epam Won"],["1","1","1","-","1","-","1","0","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","1","0","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","1","0","0","-","1","-"]) ,(["Rank","Rank","Epam Lost","ECPC"],["1","1","1","-","1","-","0","0","0","1","0","-","1","-"]) ]

# Input comes from STDIN

data = read_input(sys.stdin)

header_list = column_list + flag_list + partition_list logging.debug("Header for input data: %s" % header_list)

logging.debug("Start reading from STDIN") # Loop through STDIN for words in data: #for words in sys.stdin: logging.debug("-----------") current_flags = list()

#words = words.strip() words = words.split('\t')

logging.debug("Input values from external process (STDIN): %s" % words) logging.debug("Input length: %s" % len(words))

if (len(header_list) != len(words)): logging.error("Length of IN data (%i) not equal Header length (%i)! Exit" % (len(words), len(header_list))) sys.exit(1)

data_set = dict(zip(header_list, words))

logging.debug("Parsing of STDIN: %s" % data_set)

# Get flags for flag in flag_list: current_flags.append(data_set[flag])

logging.debug("Find flags: %s" % current_flags)

# Get list with result of comparison src/trg compared_list = list() logging.debug("Comparing flags with vocabulary...") for k,v in target: #logging.debug("key, value: %s,%s" % (k, v)) temp_out = compare_flags(current_flags,v) if not temp_out: continue

logging.debug("Match is found: %s" % temp_out) compared_list.append((k, temp_out)) temp_out = list()

logging.debug("Comparing flags with vocabulary finished. List of matches: %s" % (compared_list))

# Find max occurrence of src in trg (find max-occurrence of zeros) max_zeros = 0 out_fields = list() max_flag_from_trg =list() for k, v in compared_list: #logging.debug("key, value: %s,%s" % (k, v)) count_zero = v.count(0) if count_zero > max_zeros: out_fields = k max_flag_from_trg = v

if (not out_fields) or (not max_flag_from_trg): logging.warning("Can't find values in vocabulary. Set values for DEFAULT") logging.warning("Fields: %s" % out_fields) logging.warning("Flags: %s" % max_flag_from_trg) out_fields = ["DEFAULT" for x in xrange(len(target[0][0]))] else: logging.debug("Output fields found") logging.debug("Fields: %s" % out_fields) logging.debug("Flags: %s" % max_flag_from_trg)

# Output fields with flags in STDOUT field_data = [data_set[x] for x in column_list] partition_date = [data_set[x] for x in partition_list] out_row = separator.join(field_data) + separator + separator.join(out_fields) + separator + separator.join(partition_date) logging.debug("Output string: %s" % out_row) print out_row #print "%s%s%s%s%s" % (separator.join(field_data), separator, separator.join(out_fields), separator, separator.join(partition_date))

if __name__ == "__main__": logging.basicConfig(level=logging.DEBUG, stream=sys.stderr, #format='%(filename)s[LINE:%(lineno)d]# %(levelname)-8s [%(asctime)s] %(message)s' format='[%(asctime)s][%(filename)s][%(levelname)s] %(message)s' ) main()

Hive Streaming: Source code

Page 17: EPAM. Hadoop MR streaming in Hive

echo -e “val11\tval12\t…val1N\nval21\tval22\t…val2N"| ./script_name.py

Example:

Put 2 lines (TSV) in stdinecho -e "IAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t01\t01\t0\t0\t0\t0\t0\t1\t0\t0\t0\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t2014-01-01\tEPAM.COM\nIAH\tCUN\tIAH-CUN\t01\t\t14\tUSD\t520.99\t4\t19\t\N\t\N\t0\t\N\tDID\t2\tDID\tDID\tDID\tCHEAPTICKETS\t520.99\tORBITZ\t520.99\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\t\N\tTRIPADVISOR - US\t01\t01\t0\t0\t0\t0\t0\t1\t0\t0\t0\t1\t0\t0\t0\t0\t0\t0\t0\t0\t0\t2014-01-01\tEPAM.COM“ | ./script_name.py

Get 2 lines with new fields (without flags) in stdoutIAH CUN IAH-CUN 01 14 USD 520.99 4 19 \N \N 0 \N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM

IAH CUN IAH-CUN 01 14 USD 520.99 4 19 \N \N 0 \N DID 2 DID DID DID CHEAPTICKETS 520.99 ORBITZ 520.99 \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N TRIPADVISOR - US 01 01 0 0 0 0 0 Inventory Epam not showing carrier Epam Lost Unknown 2014-01-01 EPAM.COM

Hive Streaming: Debug

Page 18: EPAM. Hadoop MR streaming in Hive

• Add script in Distributed Cash before running query with Hive Streaming• Use last columns in select statement for Dynamic Partitioning• Use more robust separator (default, TAB) to prevent inconsistency of data

Note: always use iterator/generator (python methodology) functions instead of explicit reading from stdin! It saves system resources and executes script much faster (more over than 10 times)

Example:

Hive Streaming: Pitfalls

def read_input(file): for line in file: # split the line into words yield line.strip()

data = read_input(sys.stdin)for words in data:…

for words in sys.stdin:…

Page 19: EPAM. Hadoop MR streaming in Hive

19

BENCHMARKSHadoop MR Streaming in Hive

SECTION

Page 20: EPAM. Hadoop MR streaming in Hive

Hive Streaming: BenchmarksHive "case … when" clause

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Non-Partitioned

Time spent: 2m39s

Page 21: EPAM. Hadoop MR streaming in Hive

Hive Streaming: Benchmarks

Hive Streaming

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Non-Partitioned

Time spent: 4m53s

Note: no compression for output, so “Number of bytes written extremely larger

Page 22: EPAM. Hadoop MR streaming in Hive

Hive Streaming: Benchmarks

Hive "case … when" clause

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Partitioned by 2 columns

Time spent: 2m44s

Page 23: EPAM. Hadoop MR streaming in Hive

Hive Streaming: Benchmarks

Hive Streaming

Source: MANAGED, Non-partitioned, 2M rows

Target: MANAGED, Partitioned by 2 columns

Time spent: 5m12s