hyperlane data, code, lab, factoryfiles.meetup.com/4533812/hug_gfk_210615.pdf · staged dataset:...

1

HYPERLANE Data, Code, Lab, Factory

Munich | 23 June 2015 | Hadoop User Group

2

Assignment

•  Cross-Media Audience Measurement System.

•  Dozends of data sources. Varyiing quality.

•  Distributed Data Science and Engineering Teams.

•  Global Ops Hub.

3

Hyperlane ...but why?

Eng

inee

rs

Sci

entis

ts Data Code

Factory

Lab

Configurable Job Flows

Configurable Engines

Business Metadata

Technical Metadata

Sculley et al (2014):

Glue Code: Using self-contained solutions often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.

Pipeline Jungles: The system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output.

Configuration Debt: In a mature system which is being actively developed, the number of lines of configuration can far exceed the number of lines of the code.

Sculley et al (2014):

Undeclared Consumers: Consuming the output of a given prediction model as an input to another component of the system. Changes will very likely impact these other parts.

Unstable Data Dependencies: Some input signals are unstable, meaning that they qualitatively change behavior over time.

Static Analysis of Data: On teams with many engineers, or if there are multiple interacting teams, not everyone knows the status of every single feature.

4

User Stories •  As a Data Provider..

•  I want to provide either (time-based) increments or full snapshots of my data in order to avoid expensive transfomations.

•  I want to provide data using the most recent schema even if this schema evolves over time in order to avoid expensive schema mappings.

•  I want to provide new versions of data that has already been delivered in order to reflect reworks.

•  I want to tag and comment my data (-increments, -versions) in order to retain qualitative and business meta-data that has an impact on how the data can/should (not) be used. This can include lineage data encoded in tags.

•  As a Data Consumer..

•  I want to find and browse (sampled) data sets based on their business meta data, tags, comments etc. in order to find relevant data and understand the status this data has.

•  I want to access data using consistent schema and format whenever possible (no matter if the the data was written using evolving schemas or particular technical representations or not) in ordert to avoid writing expensive and error prone mapping code.

•  I want to have (automated) access to growing data sets (Feeds) filtered by tags / business meta data in order to feed it into data pipelines.

•  I want to be exposed to immutable data while I‘m working on it in order to avoid breaking processing jobs due to concurrent write/update operations (potentially using a different schema) during the run time of my job.

•  I want to be able to use standard tooling (Hive, PIG, JDBC, etc.) to access data in order to avoid lock-in and to benefit from the momentum in the engineering and data science community.

•  I want to understand at what data center a particular data set is physically stored in order to adhere to regulation and to optimize the processing (e.g. avoid remote data access).

5

Hyperlane – Functional Architecture

Data Set Browser

Quality Gates

Data Set Publisher

Taxonomy Workbench

EXASOL

Cube

Dat

a A

PI

Visu

laiz

er F

ront

end

Job Flow Console

JobFlow API

Config Repo

Taxo Bundle

JF Template

JF Conf Bundle Typo3

FAQ

Import LEO TR Import LEO RU

Import LEO DE

Enrichment TR Enrichment RU

Enrichment DE

Visualizer TR Visualizer DE

ATR

2GX

LI

ATLA

SG

XL

SE

NS

IC2G

XL

Weighting Imputation Mass Selection Taxonomy

Import Enrichment Visualizer ETL Export

ULD TR ULD RU IRD DE ATR TR ATR RU ATR DE

SEN TR SEN RU SEN DE SEN TR SEN RU ATL DE

Job Flows Templates Engines

Data Sets

Job Flow Instances

Business App „GXL“ Platform „Hyperlane“ Business App „Visualizer“

Hue

QueryJobs

Data Management API

Published Data Set

Time Slice

Staged Data Set

Time Slice

Sice Version

Slice Version

Dim

AP

I

Visu

laiz

er A

PI

Aggregated Logs

Metrics Browser

6

HYPERLANE DATA

Berlin | 16 March 2015 | Get Together

7

Hyerlane - Data Set Management

Staged DataSet: PurchasesDE

A data table that receives additions over time and may see evolving schemata. Business Meta Data includes Owner, Description, Lineage, Usage

Timeslice: 2014-01-01/P1M

Represents a complete slice of the data bounded by start-date and duration. Used to render a suitable visual representation, e.g. in the Calender View of the Data Set Browser

Version: 2

Created = 2014-02-03 QG1=Approved Schema = cbda

Version: 3

Created = 2014-04-02 QG1=Pending Schema = dcba

Timeslice: 2014-02-01/P1M

Version: 1

Created = 2014-03-03 QG1=Pending Schema = abcd

Version: 2

Created = 2014-03-04 QG1=Approved Schema = bacd

Version: 1

Created = 2014-02-02 QG1=Approved Schema = cbda

Published Data Sets

Table: ATRDE-Approved-S1

LastUpdate = 2015-03-04 09:29:00 ReadLock = False Schema = bacd

WR

ITE

R

EA

D

Mount Job

MOUNT ApprovedLEO USING SCHEMA cbda WHERE QG1=Approved WHERE LAST(Created)

Table: ATRDE-Approved-S2

LastUpdate = 2015-03-04 09:29:00 ReadLock = False Schema = cbda

Table: ATRDE-Latest-S1

LastUpdate = 2015-03-04 09:29:00 ReadLock = False Schema = bacd

Mount Job

MOUNT ApprovedLEO USING SCHEMA bacd WHERE QG1=Approved WHERE LAST(Created)

Mount Job

MOUNT LatestLEO USING SCHEMA bacd WHERE LAST(Created)

Dat

a B

row

ser /

Qua

lity

Gat

es

Sign

Off

• C

onfir

m D

ata

Del

iver

y

and

Rec

eptio

n

Com

men

t • 

Link

qua

litat

ive

info

rmat

ion

to

spec

ificD

ata

Set

Slic

e Ve

rsio

ns.

Dee

p A

naly

tics

(Apa

che

Hue

)

Que

ry a

nd J

oin

• Fi

lter,

Agg

rega

te a

nd J

oin

al

l Dat

a S

ets

Do

Dat

a Sc

ienc

e • 

App

ly L

ogic

Eng

ines

or c

usto

m c

ode

on

all D

ata

Set

. Sto

re a

nd R

euse

resu

lts.

8

Meta Data Management

Challenges: •  Versioned Data, Quality Gates, Comments as Meta Data, Logging •  Wording: Dataset, Instance, Partition, Timeslice? Dataset & Timeslice Make or Buy? •  Cloudera Navigator

+ Policy Engine, - Vendor-Lock-In, Inflexible •  Waterline Data

+ very good for unstructured data, + profiling, + tag propagation, +lineage, - no qg, no key value tags, - product fairly new

9

Example: Movement data

•  Events: when, who, what

Dataset: online behaviour, DE

10

Movement data


Dataset: f.e. online behaviour, DE

Slice: December 2014

January 2015

February 2015

11

Movement data




January 2015

February 2015

1.1 1.2 1.3

1.1 1.2

1.1

12

Movement data




January 2015

February 2015

1.1 1.2 1.3

1.1 1.2

1.1

13

Data Access

1.  Evolving Schema

2.  Different versions -- Updates and reworks

3.  Different access scenarios - f.e. latest approved version of each slice

Tool: Dataset Management

14

Metadata:

1.  Semantics: What does it actually mean?

2.  Quality: Checked and Approved

3.  Validity: What can/can‘t we do with this data?

Tool: Data Set Browser + QGs

DEMO TIME!

15

HYPERLANE CODE


16

Pig UDF R Code

Pig Wrapper Configurable Engines

•  Business Logic is packaged as Pig Scripts

•  Takes Parameters and invokes R or JAVA code

•  Easy to re-use through Apache Hue and Apache Oozie

•  No UI

17

JobF

low

Inst

ance

s

JobF

low

Tem

plat

es


layout.json workflow.xml

de/config.json uk/config.json tur/config.json

Scientists & Engineers Operations Hub

18

HYPERLANE WRAP UP


19

Hyperlane. Wrap Up.

Eng

inee

rs

Sci

entis

ts Data Code

Factory

Lab


Configurable Engines

Business Metadata

Technical Metadata

More transparency for productionized Job Flows visually represented as interactive DAGs used for documentation, configuration and monitoring.

Easier fusion, imputation and weighting of data sets through reusable and extensible "Engines" implemented in PIG, JAVA and R.

Improved findability of relevant data sets using a data catalog that features rich meta data (context, lineage, evolving schemas).

Better collaboration and strict quality controls enforced though sign-offs and clear accountabilities for providers and consumers of data sets.

hyperlane data, code, lab, factoryfiles.meetup.com/4533812/hug_gfk_210615.pdf · staged dataset:...

Documents