hyperlane data, code, lab, factoryfiles.meetup.com/4533812/hug_gfk_210615.pdf · staged dataset:...
TRANSCRIPT
1
HYPERLANE Data, Code, Lab, Factory
Munich | 23 June 2015 | Hadoop User Group
2
Assignment
• Cross-Media Audience Measurement System.
• Dozends of data sources. Varyiing quality.
• Distributed Data Science and Engineering Teams.
• Global Ops Hub.
3
Hyperlane ...but why?
Eng
inee
rs
Sci
entis
ts Data Code
Factory
Lab
Configurable Job Flows
Configurable Engines
Business Metadata
Technical Metadata
Sculley et al (2014):
Glue Code: Using self-contained solutions often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.
Pipeline Jungles: The system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output.
Configuration Debt: In a mature system which is being actively developed, the number of lines of configuration can far exceed the number of lines of the code.
Sculley et al (2014):
Undeclared Consumers: Consuming the output of a given prediction model as an input to another component of the system. Changes will very likely impact these other parts.
Unstable Data Dependencies: Some input signals are unstable, meaning that they qualitatively change behavior over time.
Static Analysis of Data: On teams with many engineers, or if there are multiple interacting teams, not everyone knows the status of every single feature.
4
User Stories • As a Data Provider..
• I want to provide either (time-based) increments or full snapshots of my data in order to avoid expensive transfomations.
• I want to provide data using the most recent schema even if this schema evolves over time in order to avoid expensive schema mappings.
• I want to provide new versions of data that has already been delivered in order to reflect reworks.
• I want to tag and comment my data (-increments, -versions) in order to retain qualitative and business meta-data that has an impact on how the data can/should (not) be used. This can include lineage data encoded in tags.
• As a Data Consumer..
• I want to find and browse (sampled) data sets based on their business meta data, tags, comments etc. in order to find relevant data and understand the status this data has.
• I want to access data using consistent schema and format whenever possible (no matter if the the data was written using evolving schemas or particular technical representations or not) in ordert to avoid writing expensive and error prone mapping code.
• I want to have (automated) access to growing data sets (Feeds) filtered by tags / business meta data in order to feed it into data pipelines.
• I want to be exposed to immutable data while I‘m working on it in order to avoid breaking processing jobs due to concurrent write/update operations (potentially using a different schema) during the run time of my job.
• I want to be able to use standard tooling (Hive, PIG, JDBC, etc.) to access data in order to avoid lock-in and to benefit from the momentum in the engineering and data science community.
• I want to understand at what data center a particular data set is physically stored in order to adhere to regulation and to optimize the processing (e.g. avoid remote data access).
5
Hyperlane – Functional Architecture
Data Set Browser
Quality Gates
Data Set Publisher
Taxonomy Workbench
EXASOL
Cube
Dat
a A
PI
Visu
laiz
er F
ront
end
Job Flow Console
JobFlow API
Config Repo
Taxo Bundle
JF Template
JF Conf Bundle Typo3
FAQ
Import LEO TR Import LEO RU
Import LEO DE
Enrichment TR Enrichment RU
Enrichment DE
Visualizer TR Visualizer DE
ATR
2GX
LI
ATLA
SG
XL
SE
NS
IC2G
XL
Weighting Imputation Mass Selection Taxonomy
Import Enrichment Visualizer ETL Export
ULD TR ULD RU IRD DE ATR TR ATR RU ATR DE
SEN TR SEN RU SEN DE SEN TR SEN RU ATL DE
Job Flows Templates Engines
Data Sets
Job Flow Instances
Business App „GXL“ Platform „Hyperlane“ Business App „Visualizer“
Hue
QueryJobs
Data Management API
Published Data Set
Time Slice
Staged Data Set
Time Slice
Sice Version
Slice Version
Dim
AP
I
Visu
laiz
er A
PI
Aggregated Logs
Metrics Browser
6
HYPERLANE DATA
Berlin | 16 March 2015 | Get Together
7
Hyerlane - Data Set Management
Staged DataSet: PurchasesDE
A data table that receives additions over time and may see evolving schemata. Business Meta Data includes Owner, Description, Lineage, Usage
Timeslice: 2014-01-01/P1M
Represents a complete slice of the data bounded by start-date and duration. Used to render a suitable visual representation, e.g. in the Calender View of the Data Set Browser
Version: 2
Created = 2014-02-03 QG1=Approved Schema = cbda
Version: 3
Created = 2014-04-02 QG1=Pending Schema = dcba
Timeslice: 2014-02-01/P1M
Version: 1
Created = 2014-03-03 QG1=Pending Schema = abcd
Version: 2
Created = 2014-03-04 QG1=Approved Schema = bacd
Version: 1
Created = 2014-02-02 QG1=Approved Schema = cbda
Published Data Sets
Table: ATRDE-Approved-S1
LastUpdate = 2015-03-04 09:29:00 ReadLock = False Schema = bacd
WR
ITE
R
EA
D
Mount Job
MOUNT ApprovedLEO USING SCHEMA cbda WHERE QG1=Approved WHERE LAST(Created)
Table: ATRDE-Approved-S2
LastUpdate = 2015-03-04 09:29:00 ReadLock = False Schema = cbda
Table: ATRDE-Latest-S1
LastUpdate = 2015-03-04 09:29:00 ReadLock = False Schema = bacd
Mount Job
MOUNT ApprovedLEO USING SCHEMA bacd WHERE QG1=Approved WHERE LAST(Created)
Mount Job
MOUNT LatestLEO USING SCHEMA bacd WHERE LAST(Created)
Dat
a B
row
ser /
Qua
lity
Gat
es
Sign
Off
• C
onfir
m D
ata
Del
iver
y
and
Rec
eptio
n
Com
men
t •
Link
qua
litat
ive
info
rmat
ion
to
spec
ificD
ata
Set
Slic
e Ve
rsio
ns.
Dee
p A
naly
tics
(Apa
che
Hue
)
Que
ry a
nd J
oin
• Fi
lter,
Agg
rega
te a
nd J
oin
al
l Dat
a S
ets
Do
Dat
a Sc
ienc
e •
App
ly L
ogic
Eng
ines
or c
usto
m c
ode
on
all D
ata
Set
. Sto
re a
nd R
euse
resu
lts.
8
Meta Data Management
Challenges: • Versioned Data, Quality Gates, Comments as Meta Data, Logging • Wording: Dataset, Instance, Partition, Timeslice? Dataset & Timeslice Make or Buy? • Cloudera Navigator
+ Policy Engine, - Vendor-Lock-In, Inflexible • Waterline Data
+ very good for unstructured data, + profiling, + tag propagation, +lineage, - no qg, no key value tags, - product fairly new
9
Example: Movement data
• Events: when, who, what
Dataset: online behaviour, DE
10
Movement data
• Events: when, who, what
Dataset: f.e. online behaviour, DE
Slice: December 2014
January 2015
February 2015
11
Movement data
• Events: when, who, what
Dataset: f.e. online behaviour, DE
Slice: December 2014
January 2015
February 2015
1.1 1.2 1.3
1.1 1.2
1.1
12
Movement data
• Events: when, who, what
Dataset: f.e. online behaviour, DE
Slice: December 2014
January 2015
February 2015
1.1 1.2 1.3
1.1 1.2
1.1
13
Data Access
1. Evolving Schema
2. Different versions -- Updates and reworks
3. Different access scenarios - f.e. latest approved version of each slice
Tool: Dataset Management
14
Metadata:
1. Semantics: What does it actually mean?
2. Quality: Checked and Approved
3. Validity: What can/can‘t we do with this data?
Tool: Data Set Browser + QGs
DEMO TIME!
15
HYPERLANE CODE
Berlin | 16 March 2015 | Get Together
16
Pig UDF R Code
Pig Wrapper Configurable Engines
• Business Logic is packaged as Pig Scripts
• Takes Parameters and invokes R or JAVA code
• Easy to re-use through Apache Hue and Apache Oozie
• No UI
17
JobF
low
Inst
ance
s
JobF
low
Tem
plat
es
Configurable Job Flows
layout.json workflow.xml
de/config.json uk/config.json tur/config.json
Scientists & Engineers Operations Hub
18
HYPERLANE WRAP UP
Berlin | 16 March 2015 | Get Together
19
Hyperlane. Wrap Up.
Eng
inee
rs
Sci
entis
ts Data Code
Factory
Lab
Configurable Job Flows
Configurable Engines
Business Metadata
Technical Metadata
More transparency for productionized Job Flows visually represented as interactive DAGs used for documentation, configuration and monitoring.
Easier fusion, imputation and weighting of data sets through reusable and extensible "Engines" implemented in PIG, JAVA and R.
Improved findability of relevant data sets using a data catalog that features rich meta data (context, lineage, evolving schemas).
Better collaboration and strict quality controls enforced though sign-offs and clear accountabilities for providers and consumers of data sets.