data warehousing 2016

49
Data Warehousing 2016 Kent Graziano Senior Technical Evangelist

Upload: kent-graziano

Post on 14-Jan-2017

2.629 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Data  Warehousing  2016Kent  Graziano

Senior  Technical  Evangelist

2

Agenda• Bio• Data  Warehousing:  Historical  Theory• Data  Warehousing:  The  Reality• Data  Warehousing:  The  Future• Closing  Thoughts

3

My  Bio• Senior  Technical  Evangelist,  Snowflake  Computing• Oracle  ACE  Director  (DW/BI)• Certified  Data  Vault  Master  and  DV  2.0  Practitioner• Former  Member:  Boulder  BI  Brain  Trust  (#BBBT)• Member:  DAMA  International• Data  Architecture  and  Data  Warehouse  Specialist

• 30+  years  in  IT• 25+  years  of  Oracle-­related  work• 20+  years  of  data  warehousing  experience

• Co-­Author  of  • The  Business  of  Data  Vault  Modeling  • The  Data  Model  Resource  Book  (1st  Edition)

• Blogger  – The  Data  Warrior• Past-­President  of    ODTUG  and  Rocky  Mountain  Oracle  User  Group  

4

What  about  you?• Survey  says…

Theoretical  Architectures

6

“A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision making process.”

W.H. Inmon

“The data warehouse is where we publish used data.”

Ralph Kimball

What  Is  a  Data  Warehouse?

7

Data  Warehouse• What  is  it

• Centralized  location  for  data  • “Single  source  of  truth”  or• “Single  source  of  Facts”• Source  of  data  for  reporting,  analytics,  and  offline  operational  processes

• Who  is  it• Capital  ‘EDW’:  

• Primary:  Teradata,  Oracle  Exadata,  IBM  Pure  Systems,  …

• Secondary:  HP  Vertica,  Pivotal  Greenplum

• “Data  warehouse”:  SQL  Server,  MySQL,  Oracle,  …

Proprietary  and  Confidential

8

Datamarts• What  are  they

• Databases  used  to  provide  fast,  independent  access  to  a  subset  of  data

• Often  created  for  departments,  projects,  users,  …

• Comparison  to  data  warehouse• Similar  technology• Subset  of  data• Relieves  pressure  on  EDW• Provides  “sandbox”  for  analysis  /  analysts

Proprietary  and  Confidential

9

Data  sourcesTraditional

• OLTP  databases• Oracle,  Sybase,  DB2,  SQL  Server,  MySQL,  Postgres,  …

• Enterprise  applications• ERP,  CRM,  HR,  …

• Traditional  third-­party  data• Consumer  databases,  stock  trade  data,  …

Non-­traditional• Web  applications• Website  applications,  mobile  applications,  …

• New  third-­party  data• API  data,  Twitter,  Facebook,  Segment,  weather,  …

• Other• Sensors,  devices,  …

Proprietary  and  Confidential

10

Transformation  (ETL)• What  is  it• Getting  data  from  source  form  into  a  standard,  clean,  normalized  form

• How  it  gets  done• Third-­party  tools• Custom  home-­grown  scripts• Hadoop

Proprietary  and  Confidential

11

Direct  Data  Mart

Sales Data Mart

FinancialData Mart

CustomerServiceData Mart

Source 1

Source 2

Source 3

Transformation Routines (ETL)

12

Source 1

Source 2

Source 3

Sales Data Mart

FinancialData Mart

CustomerServiceData Mart

Enterprise Data

Warehouse

ETLRoutines

ETLRoutines

Basic  “Inmon”  Architected  Data  Warehouse

13

Information  Workshop

Meta  Data  Management

Operation  &  Administration

Library  &  Toolbox Workbench

Change  Management

Service  Management

Data  Acquisition  Management

Systems  Management

Data  Acquisition

CIF  Data  Management

Data  Delivery

Information  Feedback

API

API

API

API DSI

DSI

TrI

DSI

DSI

Operational  Systems

OperationalData  Store

Data  Warehouse

Exploration  Warehouse

Data  Mining  Warehouse

OLAP  Data  Mart

Oper  Mart

External

ERP

Internet

Legacy

Other

©  2002,  Intelligent  Solutions,  Inc.

Corporate  Information  Factory

Courtesy  of  Intelligent  Solutions,  Inc.  

14

DW  2.0tm• Next  Generation  data  warehouse  architecture  from  Bill  Inmon• Superseded  CIF  (for  some)• Includes  more  accommodation  and  integration  of  meta  data• Includes  integration  of  “unstructured”  data

15

DW  2.0tm

16

Data  Vault• Invented  and  Developed  by  Daniel  Linstedt• New,  hybrid  modeling  for  enterprise  date  warehousing• Introduced  with  TDAN  articles  in  2002• Truly  introduces  an  approach  for  agile,  incremental  dw model  development• Called  “hyper  normalized”  by  some• Methodology  adapted  from  Scott  Ambler’s  Disciplined  Agile  Development  (DAD)

17

Data  Vault  DefinitionThe  Data  Vault  is  a  detail  oriented,  historical  tracking  and  uniquely  linked  set  of  normalized   tables  that  support  one  or  more  functional  areas  of  business.    

It  is  a  hybrid  approach  encompassing  the  best  of  breed  between  3rdnormal  form  (3NF)  and  star  schema.    The  design  is  flexible,  scalable,  consistent  and  adaptable   to  the  needs  of  the  enterprise.    

Dan Linstedt: Defining the Data VaultTDAN.com Article

Architected  specifically   to  meet  the  needs  of  today’s  enterprise  data  warehouses

18

Where  does  a  Data  Vault  Fit?

©  LearnDataVault.com

19

Data  Vault:  3  Simple  Structures

©  LearnDataVault.com

20

Standard  Data  Vault  Model

• Hub:  List  of  UNIQUE  business  keys.• Link: List  of  UNIQUE  relationships• Satellite: Historical  descriptive  data.

Email  ID

Sat

Sat

Sat

Link Bank  ID

Sat

Sat

Sat

PassengerID

Sat

Sat

Sat

F(x)

Email  Information Bank  Transactions

Airline  Reservations

Sat

Link

Records a history of the interaction

** Dashed Line is a possible New Relationship

Hub

Satellite

21

Data  Vault  Extensibility

Adding new  components  to  the  EDW  has  NEAR  ZERO  impact  to:• Existing  Loading  Processes

• Existing  Data  Model• Existing  Reporting  &  BI  Functions

• Existing  Source  Systems• Existing  Star  Schemas  and  Data  Marts

(C)  LearnDataVault.com

Back  in  the  Real  World

23

What  a  Data  Warehouse  Isn’t?

• A  panacea• An  IT  department  endeavor  alone• Time  to  avoid  user  and  IT  communications• The  sure-­fire  way  to  reduce  overhead  and  increase  company  /  department  profits• The  answer  to  all  decision  support  and  reporting  needs• “Just  a  reporting  data  base”

24

ETL

Typical  DW/BI  environment

EDW

Data  sources

Hadoop

Datamarts BI  /  Analytics

OLTP  databases

Enterprise  applications

Web  applications

Third-­party

Other

Proprietary  and  Confidential

25

Lots  of  Hybrids• Most  organizations  mix  Inmon &  Kimball• ODS  feeding  Data  Marts• Data  Marts  backed  into  an  EDW• Off  the  Shelf  models  – customized  to  work!• Canned  BI  apps  • Oracle  BI  Apps

• Data  Vaults  inside  a  CIF• Some  using  Hadoop  for  Staging

• etc

26

COMNStage

<Full  copies  of  source  data  structures  with  additional  plumbing  fields  to  facilitate  capturing  subsequent  data  changes  over  time>

COMNPresentation

Example:Hybrid -­Original  Schema  ArchitectureSource(s)of  Record

ReportingMSH  EDW

COMN  Integration

<Enterprise  business  key  model  with  key  mapping  pointers  to  COMN_STG  data  >

JIT  Transformation<Virtual  v.  Physical>

G2

MU

HI

KDW

CI  SAS  Routines

EDW  V1

FDW  /  PMS

KDW  Lite

Lynx

SFDC  BOBJ

Δ CDC

Insert1Xonly

ΣΣ

ΣΣ

ΣΣ

ΣΣ

ΣΣ

StarSchema(s)

DataMarts

Web

TBLU

27

HI  Stage

COMNStage

FIN  Stage FINPresentation

HI  Presentation

COMNPresentation

Hoped  for  Schema  Architecture  (Parallel  Loads)Source(s)of  Record

BOBJ  /  BI  /  ReportingMSH  EDW

COMN  Validation

COMN  Integration

FIN

HI

CLIN

G2

MU

HI

KDW

CI  SAS  Routines

EDW  V1

FDW  /  PMS

KDW  Lite

Lynx

SFDC  

MKTG

28

HI  Stage

COMNStage

FIN  Stage FINPresentation

HI  Presentation

COMNPresentation

Actual  Schema  ArchitectureSource(s)of  Record

BOBJ  /  BI  /  ReportingMSH  EDW

COMN  Validation  (DQ)

COMN  Integration

FIN

HI

CLIN

G2

MU

HI

KDW

CI  SAS  Routines

EDW  V1

FDW  /  PMS

KDW  Lite

Lynx

SFDC  

MKTG

The  Future

30

Today’s  realities

Data  diversityExternal  data,  machine-­generated  

data,  streaming  data

ComplexityComplex  systems,  data  pipelines,  

data  silos

Barriers  to  analyticsIncomplete  data,  slow  time  to  access,  performance  and  concurrency  barriers

EDW Datamarts

Hadoop

31

Current  architectures  can’t  keep  up

Data  Warehousing• Complex: manage  hardware,   data  distribution,  indexes,  …

• Limited  elasticity: forklift  upgrades,  data  redistribution,  downtime

• Costly:  overprovisioning,  significant  care  &  feeding

Hadoop• Complex: specialized  skills,  new  tools• Limited  elasticity: data  redistribution,  resource  contention

• Not  a  data  warehouse:  batch-­oriented,  limited  optimization,  incomplete  security

32

Next  Generation  – Extended  Data  Warehouse  Architecture  (XDW)

Traditional  EDWenvironment

Investigative  computingplatform

Datarefinery

Data  integrationplatform  

Analytic  tools  &  applications

Operational  real-­time  environment

RT  analysis  platform

Other  internal  &  externalstructured  &  multi-­structured  dataReal-­time   streaming  data

Operational  systems

RT  BI  servicesSlide  created  by  Colin  White  – BI  Research,  Inc.

Copyright  Intellegent Solutions,  Inc 2105.   All  Rights  Reserved.  Used  by  Permission

33

What  we  need  to  solve  for• Cost  Containment!• More  data  all  the  time  &  more  complexity• Hard  to  keep  up  infrastructure  &  skills

• Quicker  time  to  delivery• See  the  data  sooner!

• Elasticity• On  demand  resources• True  “grid”  utility  computing

• Security

34

New  possibilities  with  the  cloud• More  &  more  data  “born  in  the  cloud”• Natural  integration  point  for  data• Low-­cost,  scalable  storage• Capacity  on  demand

35

What  is  Snowflake?

All-­new  SQL  data  warehouse

No  legacy  code  or  constraints

Delivered  as  a  serviceInfrastructure,  resiliency,  optimization  built  in

Designed  for  the  cloudRunning  in  Amazon  Web  

Services

36

Our  vision:Reinvent  the  Data  Warehouse

Data  Warehousing…

• SQL  relational  database• Optimized  storage  &  processing• Standard  connectivity  – BI,  ETL,  …

…for  Everyone

• Existing  SQL  skills  and  tools• “Load  and  go”  ease  of  use• Cloud-­based  elasticity  to  fit  any  scale

Data  scientists

SQL  users  &  tools

37

Brings  together  diverse  dataApple 101.12 250 FIH-­2316

Pear 56.22 202 IHO-­6912

Orange 98.21 600 WHQ-­6090

{ "firstName": "John", "lastName": "Smith", "height_cm": 167.64, "address": {

"streetAddress": "21 2nd Street", "city": "New York", "state": "NY","postalCode": "10021-3100"

}, "phoneNumbers": [

{ "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" }

] }

Structured data(e.g. CSV)

Semi-structured data(e.g. JSON, Avro, XML)

• Optimized storage• Flexible schema• Relational processing

38

Designed  for  the  cloud

Low-­cost,  scalable  cloud  storage

Never  worry  about  sizing  for  storage  again

Elastic  compute,  on  demand

Exact  amount  of  compute  needed,  exactly  

when  needed

Optimized  for  diverse  data

Load  and  optimize  semi-­structured  +  structured  

data  without  transformation

Software  as  a  service

No  knobs,  tuning,  or  infrastructure  management

39

A  new  architecture:  multi-­cluster,  shared  data

• Standard  interfaces• Cloud  services  layer  coordinates  across  service• Independent  compute  clusters  access  data• Data  centralized  in  enterprise-­class  cloud  storage

40

Enabling  multi-­dimensional  scaling• Elastic  scaling  for  storageLow-­cost  cloud  storage,  fully  replicated  and  resilient

• Elastic  scaling  for  computeVirtual  warehouses  scale  up  &  down  on  the  fly  to  support  workload  needs

• Elastic  scaling  for  concurrencyScale  concurrency  using  independent  virtual  warehouses

Finance

Marketing

Operations

Loading  /  ETL

Sales

Test  /  Dev

41

Delivered  as  a  service:no  infrastructure,  knobs,  or  tuning

Infrastructure  management

Virtual  hardware  and  software  managed  by  

Snowflake

Metadata  management

Automatic  statistics  collection,  scaling,  and  

redundancy

**..

**..

Manual  query  optimization

Dynamic  optimization,  parallelization,  and  

concurrency  management

Data  storage  management

Adaptive  data  distribution,  automatic  compression,  automatic  optimization

42

Fits  with  existing  tools  &  processes

Complex  Data  InfrastructureComplex  systems,  data  pipelines,  

data  silos

EDW Datamarts

HadoopData  Diversity  ChallengesExternal  data,  machine-­generated  

data,  streaming  data

Barriers  to  AnalysisAnalysis  limited  by  incomplete  data,  delays  in  access,  performance  

limitations

Conclusions?

44

What  Have  We  Learned  Over  The  Years?• Need  results  soon• Multi-­years  projects  not  acceptable  any  more

• Executive  buy  in  ($$$)• Build  incrementally,  test,  refactor• Get  user  feedback  RIGHT  AWAY!• Avoid  over  analysis• You  will  learn  as  you  go  

45

Critical  Success  Factors• A  data  warehouse  will  be  considered  a  success  if  it:• Can  be  loaded  in  a  timely  manner• Regardless  of  the  data  type  or  source

• Can  be  accessed  in  an  easy  fashion• By  both  data  scientists  and  business  users

• Can  be  understood  by  the  business  community• Is  recognized  as  bringing  value  to  the  decision  making  process• For  an  acceptable  TCO

46

An  Option  to  Consider…Snowflake  is:• …a  team  of  accomplished  data  experts• Funded  by  top-­tier  VCs  including  Altimeter  Capital,  Redpoint Ventures,  Sutter  Hill  Ventures,  Wing  VC

…who  have  developed  a  completely  new  data  warehouse  designed  for  the  cloud• Data  warehouse  as  a  service• Multidimensional   elasticity• Support  for  all  business  data  – including  semi-­structured• Compelling  price:performance

47

Available  onAmazon.com

http://www.amazon.com/Better-­Data-­Modeling-­

Introduction-­Engineering-­

ebook/dp/B018BREV1C/  

SHAMELESS  PLUG:

48

Kent GrazianoSnowflake [email protected]  Twitter  @KentGraziano

More  info  athttp://snowflake.net

Visit  my  blog  athttp://kentgraziano.com

Contact  Information