data warehousing 2016

Data Warehousing 2016Kent Graziano

Senior Technical Evangelist

2

Agenda• Bio• Data Warehousing: Historical Theory• Data Warehousing: The Reality• Data Warehousing: The Future• Closing Thoughts

3

My Bio• Senior Technical Evangelist, Snowflake Computing• Oracle ACE Director (DW/BI)• Certified Data Vault Master and DV 2.0 Practitioner• Former Member: Boulder BI Brain Trust (#BBBT)• Member: DAMA International• Data Architecture and Data Warehouse Specialist

• 30+ years in IT• 25+ years of Oracle-related work• 20+ years of data warehousing experience

• Co-Author of • The Business of Data Vault Modeling • The Data Model Resource Book (1st Edition)

• Blogger – The Data Warrior• Past-President of ODTUG and Rocky Mountain Oracle User Group

4

What about you?• Survey says…

Theoretical Architectures

6

“A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision making process.”

W.H. Inmon

“The data warehouse is where we publish used data.”

Ralph Kimball

What Is a Data Warehouse?

7

Data Warehouse• What is it

• Centralized location for data • “Single source of truth” or• “Single source of Facts”• Source of data for reporting, analytics, and offline operational processes

• Who is it• Capital ‘EDW’:

• Primary: Teradata, Oracle Exadata, IBM Pure Systems, …

• Secondary: HP Vertica, Pivotal Greenplum

• “Data warehouse”: SQL Server, MySQL, Oracle, …

Proprietary and Confidential

8

Datamarts• What are they

• Databases used to provide fast, independent access to a subset of data

• Often created for departments, projects, users, …

• Comparison to data warehouse• Similar technology• Subset of data• Relieves pressure on EDW• Provides “sandbox” for analysis / analysts


9

Data sourcesTraditional

• OLTP databases• Oracle, Sybase, DB2, SQL Server, MySQL, Postgres, …

• Enterprise applications• ERP, CRM, HR, …

• Traditional third-party data• Consumer databases, stock trade data, …

Non-traditional• Web applications• Website applications, mobile applications, …

• New third-party data• API data, Twitter, Facebook, Segment, weather, …

• Other• Sensors, devices, …


10

Transformation (ETL)• What is it• Getting data from source form into a standard, clean, normalized form

• How it gets done• Third-party tools• Custom home-grown scripts• Hadoop


11

Direct Data Mart

Sales Data Mart

FinancialData Mart

CustomerServiceData Mart

Source 1

Source 2

Source 3

Transformation Routines (ETL)

12

Source 1

Source 2

Source 3

Sales Data Mart

FinancialData Mart

CustomerServiceData Mart

Enterprise Data

Warehouse

ETLRoutines

ETLRoutines

Basic “Inmon” Architected Data Warehouse

13

Information Workshop

Meta Data Management

Operation & Administration

Library & Toolbox Workbench

Change Management

Service Management

Data Acquisition Management

Systems Management

Data Acquisition

CIF Data Management

Data Delivery

Information Feedback

API

API

API

API DSI

DSI

TrI

DSI

DSI

Operational Systems

OperationalData Store

Data Warehouse

Exploration Warehouse

Data Mining Warehouse

OLAP Data Mart

Oper Mart

External

ERP

Internet

Legacy

Other

© 2002, Intelligent Solutions, Inc.

Corporate Information Factory

Courtesy of Intelligent Solutions, Inc.

14

DW 2.0tm• Next Generation data warehouse architecture from Bill Inmon• Superseded CIF (for some)• Includes more accommodation and integration of meta data• Includes integration of “unstructured” data

15

DW 2.0tm

16

Data Vault• Invented and Developed by Daniel Linstedt• New, hybrid modeling for enterprise date warehousing• Introduced with TDAN articles in 2002• Truly introduces an approach for agile, incremental dw model development• Called “hyper normalized” by some• Methodology adapted from Scott Ambler’s Disciplined Agile Development (DAD)

17

Data Vault DefinitionThe Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.

It is a hybrid approach encompassing the best of breed between 3rdnormal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.

Dan Linstedt: Defining the Data VaultTDAN.com Article

Architected specifically to meet the needs of today’s enterprise data warehouses

18

Where does a Data Vault Fit?

© LearnDataVault.com

19

Data Vault: 3 Simple Structures

© LearnDataVault.com

20

Standard Data Vault Model

• Hub: List of UNIQUE business keys.• Link: List of UNIQUE relationships• Satellite: Historical descriptive data.

Email ID

Sat

Sat

Sat

Link Bank ID

Sat

Sat

Sat

PassengerID

Sat

Sat

Sat

F(x)

Email Information Bank Transactions

Airline Reservations

Sat

Link

Records a history of the interaction

** Dashed Line is a possible New Relationship

Hub

Satellite

21

Data Vault Extensibility

Adding new components to the EDW has NEAR ZERO impact to:• Existing Loading Processes

• Existing Data Model• Existing Reporting & BI Functions

• Existing Source Systems• Existing Star Schemas and Data Marts

(C) LearnDataVault.com

Back in the Real World

23

What a Data Warehouse Isn’t?

• A panacea• An IT department endeavor alone• Time to avoid user and IT communications• The sure-fire way to reduce overhead and increase company / department profits• The answer to all decision support and reporting needs• “Just a reporting data base”

24

ETL

Typical DW/BI environment

EDW

Data sources

Hadoop

Datamarts BI / Analytics

OLTP databases

Enterprise applications

Web applications

Third-party

Other


25

Lots of Hybrids• Most organizations mix Inmon & Kimball• ODS feeding Data Marts• Data Marts backed into an EDW• Off the Shelf models – customized to work!• Canned BI apps • Oracle BI Apps

• Data Vaults inside a CIF• Some using Hadoop for Staging

• etc

26

COMNStage

<Full copies of source data structures with additional plumbing fields to facilitate capturing subsequent data changes over time>

COMNPresentation

Example:Hybrid -Original Schema ArchitectureSource(s)of Record

ReportingMSH EDW

COMN Integration

<Enterprise business key model with key mapping pointers to COMN_STG data >

JIT Transformation<Virtual v. Physical>

G2

MU

HI

KDW

CI SAS Routines

EDW V1

FDW / PMS

KDW Lite

Lynx

SFDC BOBJ

Δ CDC

Insert1Xonly

ΣΣ

ΣΣ

ΣΣ

ΣΣ

ΣΣ

StarSchema(s)

DataMarts

Web

TBLU

27

HI Stage

COMNStage

FIN Stage FINPresentation

HI Presentation

COMNPresentation

Hoped for Schema Architecture (Parallel Loads)Source(s)of Record

BOBJ / BI / ReportingMSH EDW

COMN Validation

COMN Integration

FIN

HI

CLIN

G2

MU

HI

KDW

CI SAS Routines

EDW V1

FDW / PMS

KDW Lite

Lynx

SFDC

MKTG

28

HI Stage

COMNStage

FIN Stage FINPresentation

HI Presentation

COMNPresentation

Actual Schema ArchitectureSource(s)of Record

BOBJ / BI / ReportingMSH EDW

COMN Validation (DQ)

COMN Integration

FIN

HI

CLIN

G2

MU

HI

KDW

CI SAS Routines

EDW V1

FDW / PMS

KDW Lite

Lynx

SFDC

MKTG

The Future

30

Today’s realities

Data diversityExternal data, machine-generated

data, streaming data

ComplexityComplex systems, data pipelines,

data silos

Barriers to analyticsIncomplete data, slow time to access, performance and concurrency barriers

EDW Datamarts

Hadoop

31

Current architectures can’t keep up

Data Warehousing• Complex: manage hardware, data distribution, indexes, …

• Limited elasticity: forklift upgrades, data redistribution, downtime

• Costly: overprovisioning, significant care & feeding

Hadoop• Complex: specialized skills, new tools• Limited elasticity: data redistribution, resource contention

• Not a data warehouse: batch-oriented, limited optimization, incomplete security

32

Next Generation – Extended Data Warehouse Architecture (XDW)

Traditional EDWenvironment

Investigative computingplatform

Datarefinery

Data integrationplatform

Analytic tools & applications

Operational real-time environment

RT analysis platform

Other internal & externalstructured & multi-structured dataReal-time streaming data

Operational systems

RT BI servicesSlide created by Colin White – BI Research, Inc.

Copyright Intellegent Solutions, Inc 2105. All Rights Reserved. Used by Permission

33

What we need to solve for• Cost Containment!• More data all the time & more complexity• Hard to keep up infrastructure & skills

• Quicker time to delivery• See the data sooner!

• Elasticity• On demand resources• True “grid” utility computing

• Security

34

New possibilities with the cloud• More & more data “born in the cloud”• Natural integration point for data• Low-cost, scalable storage• Capacity on demand

35

What is Snowflake?

All-new SQL data warehouse

No legacy code or constraints

Delivered as a serviceInfrastructure, resiliency, optimization built in

Designed for the cloudRunning in Amazon Web

Services

36

Our vision:Reinvent the Data Warehouse

Data Warehousing…

• SQL relational database• Optimized storage & processing• Standard connectivity – BI, ETL, …

…for Everyone

• Existing SQL skills and tools• “Load and go” ease of use• Cloud-based elasticity to fit any scale

Data scientists

SQL users & tools

37

Brings together diverse dataApple 101.12 250 FIH-2316

Pear 56.22 202 IHO-6912

Orange 98.21 600 WHQ-6090

{ "firstName": "John", "lastName": "Smith", "height_cm": 167.64, "address": {

"streetAddress": "21 2nd Street", "city": "New York", "state": "NY","postalCode": "10021-3100"

}, "phoneNumbers": [

{ "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" }

] }

Structured data(e.g. CSV)

Semi-structured data(e.g. JSON, Avro, XML)

• Optimized storage• Flexible schema• Relational processing

38

Designed for the cloud

Low-cost, scalable cloud storage

Never worry about sizing for storage again

Elastic compute, on demand

Exact amount of compute needed, exactly

when needed

Optimized for diverse data

Load and optimize semi-structured + structured

data without transformation

Software as a service

No knobs, tuning, or infrastructure management

39

A new architecture: multi-cluster, shared data

• Standard interfaces• Cloud services layer coordinates across service• Independent compute clusters access data• Data centralized in enterprise-class cloud storage

40

Enabling multi-dimensional scaling• Elastic scaling for storageLow-cost cloud storage, fully replicated and resilient

• Elastic scaling for computeVirtual warehouses scale up & down on the fly to support workload needs

• Elastic scaling for concurrencyScale concurrency using independent virtual warehouses

Finance

Marketing

Operations

Loading / ETL

Sales

Test / Dev

41

Delivered as a service:no infrastructure, knobs, or tuning

Infrastructure management

Virtual hardware and software managed by

Snowflake

Metadata management

Automatic statistics collection, scaling, and

redundancy

**..

**..

Manual query optimization

Dynamic optimization, parallelization, and

concurrency management

Data storage management

Adaptive data distribution, automatic compression, automatic optimization

42

Fits with existing tools & processes

Complex Data InfrastructureComplex systems, data pipelines,

data silos

EDW Datamarts

HadoopData Diversity ChallengesExternal data, machine-generated

data, streaming data

Barriers to AnalysisAnalysis limited by incomplete data, delays in access, performance

limitations

Conclusions?

44

What Have We Learned Over The Years?• Need results soon• Multi-years projects not acceptable any more

• Executive buy in ($$$)• Build incrementally, test, refactor• Get user feedback RIGHT AWAY!• Avoid over analysis• You will learn as you go

45

Critical Success Factors• A data warehouse will be considered a success if it:• Can be loaded in a timely manner• Regardless of the data type or source

• Can be accessed in an easy fashion• By both data scientists and business users

• Can be understood by the business community• Is recognized as bringing value to the decision making process• For an acceptable TCO

46

An Option to Consider…Snowflake is:• …a team of accomplished data experts• Funded by top-tier VCs including Altimeter Capital, Redpoint Ventures, Sutter Hill Ventures, Wing VC

…who have developed a completely new data warehouse designed for the cloud• Data warehouse as a service• Multidimensional elasticity• Support for all business data – including semi-structured• Compelling price:performance

47

Available onAmazon.com

http://www.amazon.com/Better-Data-Modeling-

Introduction-Engineering-

ebook/dp/B018BREV1C/

SHAMELESS PLUG:

48

Kent GrazianoSnowflake [email protected] Twitter @KentGraziano

More info athttp://snowflake.net

Visit my blog athttp://kentgraziano.com

Contact Information

data warehousing 2016

Data & Analytics