presentation pdi data_vault_framework_meetup2012
TRANSCRIPT
Introductionn
Data Vault Definition
Source: Dan Linstedthttp://www.tdan.com/view-articles/5054/
The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses.
Data Vault Building Blocks
Source: Dan Linstedthttp://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012
different sources/rate of change
Data Vault Fundamentals: Hub
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: Link
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: Satellite
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
Data Vault Fundamentals: Model
Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren
Data Vault ETL
Many objects to load, standardized procedures
This screams for a generic solution!
I don't want to:
throw ETL tool away and code it all myself
manage too many ETL objects
connect similar columns in mappings by hand
I do want to:
generate ETL (Kettle) objects? No
Take it one step further: there's only 1 parameterised hub load object. Don't need to know xml structure of PDI objects
Tools
Version Control
Database
Virtualization
Data Integration
Operating System
'Productivity'
Sql Development
Place of framework in architecture
StagingArea
CSVFiles
ETL
ERP
DBMS
Sources ETL Process Data Warehouse EUL
MySQL
Files
ETL:KettleDataVault Framework
Central DWH & Data Marts
MySQLDataVault
ETL
What has to be taken care of?
Data Vault designed and implemented in database
Staging tables and loading procedures in place(can also be generic, we use PDI Metadata Injection step for loading files)
Mapping from source to Data Vault specified (now in an Excel sheet)
What
Framework components
PDI repository (file based), jobs and transformations
Configuration files:kettle.properties
shared.xml
repositories.xml
Excel sheet that contains the specifications
MySQL database for metadata
Virtual machine with Ubuntu 12.04 Server
Design decisions
Updateable views with generic column names
(MySQL more lenient than PostgreSQL)
Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter)
'inject' the metadata using Kettle parameters
Generate and use an error table for each Data Vault table
Metadata tables
All have history tables
Metadata in Excel
Data Vault
connections
source systems
source tables
Metadata in Excel (hub + sat)
x 200 (max)
Metadata in Excel (link)
link attributes
x 10
Metadata in Excel (link satellite)
x 10
x 5
x 200 (max)
Last seen date
applicable for hubs and links
existing hubs and links: update 'last_seen_dts'!
Link validity satellite
Link has 'business key': not all hub id's
Loading the metadata
'design errors'
Checks to avoid debugging:(compares design metadata with Data Vault DB information_schema)
hubs, links, satellites that don't exist in the DV
key columns that do not exist in the DV
missing connection data (source db)
missing attribute columns
A complete run
Metadata needed for a hub
name
key column
business key column
source table
source table business key column(can be expression, e.g. concatenate for composite key)
Job for hub
Transformation for hub
Metadata needed for a linkname
key column
for each hub (maximum 10, can be a ref-table)
hub name
column name for the hub key in the link (roles!)
column in the source table → business key of hub
link 'attributes' (part of key, no hub, maximum 5)
link validity satellite needed?
last seen date needed?
source table
Job for link
Transformation for link
Run table needed for validity sat ?
Lookup hubs
Remove columns not in link
Last seen?
Metadata needed for a hub satellite
name
key column
hub name
column in the source table → business key of hub
for each attribute (maximum 200)
source column target column
source table
Job for hub satellite
Transformation for hub satellite
Metadata needed for a link satellite
name
key column
link name
for each hub of the link:
column in the source table → business key of hub
for each key attribute: source column
for each attribute: source column → target column
source table
Job for link satellite
Transformation for link satellite
Executing in a loop ..
.. and parallel
Logging
Configuring log tablesfor concurrent access
PDI logging
Custom logging
Version Control: PDI objects
Version Control: database objects
Some points of interest
Easy to make mistake in design sheet
Generic → a bit harder to maintain and debug
Application/tool to maintain metadata?
Data Vault generators (e.g. Quipu)?
Spinoff using Informatica and Oracle: Sander Robijns
Thanks to: Jos van Dongen Kasper de Graaf
Sourceforge!