data warehousing 2016
TRANSCRIPT
2
Agenda• Bio• Data Warehousing: Historical Theory• Data Warehousing: The Reality• Data Warehousing: The Future• Closing Thoughts
3
My Bio• Senior Technical Evangelist, Snowflake Computing• Oracle ACE Director (DW/BI)• Certified Data Vault Master and DV 2.0 Practitioner• Former Member: Boulder BI Brain Trust (#BBBT)• Member: DAMA International• Data Architecture and Data Warehouse Specialist
• 30+ years in IT• 25+ years of Oracle-related work• 20+ years of data warehousing experience
• Co-Author of • The Business of Data Vault Modeling • The Data Model Resource Book (1st Edition)
• Blogger – The Data Warrior• Past-President of ODTUG and Rocky Mountain Oracle User Group
6
“A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision making process.”
W.H. Inmon
“The data warehouse is where we publish used data.”
Ralph Kimball
What Is a Data Warehouse?
7
Data Warehouse• What is it
• Centralized location for data • “Single source of truth” or• “Single source of Facts”• Source of data for reporting, analytics, and offline operational processes
• Who is it• Capital ‘EDW’:
• Primary: Teradata, Oracle Exadata, IBM Pure Systems, …
• Secondary: HP Vertica, Pivotal Greenplum
• “Data warehouse”: SQL Server, MySQL, Oracle, …
Proprietary and Confidential
8
Datamarts• What are they
• Databases used to provide fast, independent access to a subset of data
• Often created for departments, projects, users, …
• Comparison to data warehouse• Similar technology• Subset of data• Relieves pressure on EDW• Provides “sandbox” for analysis / analysts
Proprietary and Confidential
9
Data sourcesTraditional
• OLTP databases• Oracle, Sybase, DB2, SQL Server, MySQL, Postgres, …
• Enterprise applications• ERP, CRM, HR, …
• Traditional third-party data• Consumer databases, stock trade data, …
Non-traditional• Web applications• Website applications, mobile applications, …
• New third-party data• API data, Twitter, Facebook, Segment, weather, …
• Other• Sensors, devices, …
Proprietary and Confidential
10
Transformation (ETL)• What is it• Getting data from source form into a standard, clean, normalized form
• How it gets done• Third-party tools• Custom home-grown scripts• Hadoop
Proprietary and Confidential
11
Direct Data Mart
Sales Data Mart
FinancialData Mart
CustomerServiceData Mart
Source 1
Source 2
Source 3
Transformation Routines (ETL)
12
Source 1
Source 2
Source 3
Sales Data Mart
FinancialData Mart
CustomerServiceData Mart
Enterprise Data
Warehouse
ETLRoutines
ETLRoutines
Basic “Inmon” Architected Data Warehouse
13
Information Workshop
Meta Data Management
Operation & Administration
Library & Toolbox Workbench
Change Management
Service Management
Data Acquisition Management
Systems Management
Data Acquisition
CIF Data Management
Data Delivery
Information Feedback
API
API
API
API DSI
DSI
TrI
DSI
DSI
Operational Systems
OperationalData Store
Data Warehouse
Exploration Warehouse
Data Mining Warehouse
OLAP Data Mart
Oper Mart
External
ERP
Internet
Legacy
Other
© 2002, Intelligent Solutions, Inc.
Corporate Information Factory
Courtesy of Intelligent Solutions, Inc.
14
DW 2.0tm• Next Generation data warehouse architecture from Bill Inmon• Superseded CIF (for some)• Includes more accommodation and integration of meta data• Includes integration of “unstructured” data
16
Data Vault• Invented and Developed by Daniel Linstedt• New, hybrid modeling for enterprise date warehousing• Introduced with TDAN articles in 2002• Truly introduces an approach for agile, incremental dw model development• Called “hyper normalized” by some• Methodology adapted from Scott Ambler’s Disciplined Agile Development (DAD)
17
Data Vault DefinitionThe Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.
It is a hybrid approach encompassing the best of breed between 3rdnormal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.
Dan Linstedt: Defining the Data VaultTDAN.com Article
Architected specifically to meet the needs of today’s enterprise data warehouses
20
Standard Data Vault Model
• Hub: List of UNIQUE business keys.• Link: List of UNIQUE relationships• Satellite: Historical descriptive data.
Email ID
Sat
Sat
Sat
Link Bank ID
Sat
Sat
Sat
PassengerID
Sat
Sat
Sat
F(x)
Email Information Bank Transactions
Airline Reservations
Sat
Link
Records a history of the interaction
** Dashed Line is a possible New Relationship
Hub
Satellite
21
Data Vault Extensibility
Adding new components to the EDW has NEAR ZERO impact to:• Existing Loading Processes
• Existing Data Model• Existing Reporting & BI Functions
• Existing Source Systems• Existing Star Schemas and Data Marts
(C) LearnDataVault.com
23
What a Data Warehouse Isn’t?
• A panacea• An IT department endeavor alone• Time to avoid user and IT communications• The sure-fire way to reduce overhead and increase company / department profits• The answer to all decision support and reporting needs• “Just a reporting data base”
24
ETL
Typical DW/BI environment
EDW
Data sources
Hadoop
Datamarts BI / Analytics
OLTP databases
Enterprise applications
Web applications
Third-party
Other
Proprietary and Confidential
25
Lots of Hybrids• Most organizations mix Inmon & Kimball• ODS feeding Data Marts• Data Marts backed into an EDW• Off the Shelf models – customized to work!• Canned BI apps • Oracle BI Apps
• Data Vaults inside a CIF• Some using Hadoop for Staging
• etc
26
COMNStage
<Full copies of source data structures with additional plumbing fields to facilitate capturing subsequent data changes over time>
COMNPresentation
Example:Hybrid -Original Schema ArchitectureSource(s)of Record
ReportingMSH EDW
COMN Integration
<Enterprise business key model with key mapping pointers to COMN_STG data >
JIT Transformation<Virtual v. Physical>
G2
MU
HI
KDW
CI SAS Routines
EDW V1
FDW / PMS
KDW Lite
Lynx
SFDC BOBJ
Δ CDC
Insert1Xonly
ΣΣ
ΣΣ
ΣΣ
ΣΣ
ΣΣ
StarSchema(s)
DataMarts
Web
TBLU
27
HI Stage
COMNStage
FIN Stage FINPresentation
HI Presentation
COMNPresentation
Hoped for Schema Architecture (Parallel Loads)Source(s)of Record
BOBJ / BI / ReportingMSH EDW
COMN Validation
COMN Integration
FIN
HI
CLIN
G2
MU
HI
KDW
CI SAS Routines
EDW V1
FDW / PMS
KDW Lite
Lynx
SFDC
MKTG
28
HI Stage
COMNStage
FIN Stage FINPresentation
HI Presentation
COMNPresentation
Actual Schema ArchitectureSource(s)of Record
BOBJ / BI / ReportingMSH EDW
COMN Validation (DQ)
COMN Integration
FIN
HI
CLIN
G2
MU
HI
KDW
CI SAS Routines
EDW V1
FDW / PMS
KDW Lite
Lynx
SFDC
MKTG
30
Today’s realities
Data diversityExternal data, machine-generated
data, streaming data
ComplexityComplex systems, data pipelines,
data silos
Barriers to analyticsIncomplete data, slow time to access, performance and concurrency barriers
EDW Datamarts
Hadoop
31
Current architectures can’t keep up
Data Warehousing• Complex: manage hardware, data distribution, indexes, …
• Limited elasticity: forklift upgrades, data redistribution, downtime
• Costly: overprovisioning, significant care & feeding
Hadoop• Complex: specialized skills, new tools• Limited elasticity: data redistribution, resource contention
• Not a data warehouse: batch-oriented, limited optimization, incomplete security
32
Next Generation – Extended Data Warehouse Architecture (XDW)
Traditional EDWenvironment
Investigative computingplatform
Datarefinery
Data integrationplatform
Analytic tools & applications
Operational real-time environment
RT analysis platform
Other internal & externalstructured & multi-structured dataReal-time streaming data
Operational systems
RT BI servicesSlide created by Colin White – BI Research, Inc.
Copyright Intellegent Solutions, Inc 2105. All Rights Reserved. Used by Permission
33
What we need to solve for• Cost Containment!• More data all the time & more complexity• Hard to keep up infrastructure & skills
• Quicker time to delivery• See the data sooner!
• Elasticity• On demand resources• True “grid” utility computing
• Security
34
New possibilities with the cloud• More & more data “born in the cloud”• Natural integration point for data• Low-cost, scalable storage• Capacity on demand
35
What is Snowflake?
All-new SQL data warehouse
No legacy code or constraints
Delivered as a serviceInfrastructure, resiliency, optimization built in
Designed for the cloudRunning in Amazon Web
Services
36
Our vision:Reinvent the Data Warehouse
Data Warehousing…
• SQL relational database• Optimized storage & processing• Standard connectivity – BI, ETL, …
…for Everyone
• Existing SQL skills and tools• “Load and go” ease of use• Cloud-based elasticity to fit any scale
Data scientists
SQL users & tools
37
Brings together diverse dataApple 101.12 250 FIH-2316
Pear 56.22 202 IHO-6912
Orange 98.21 600 WHQ-6090
{ "firstName": "John", "lastName": "Smith", "height_cm": 167.64, "address": {
"streetAddress": "21 2nd Street", "city": "New York", "state": "NY","postalCode": "10021-3100"
}, "phoneNumbers": [
{ "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" }
] }
Structured data(e.g. CSV)
Semi-structured data(e.g. JSON, Avro, XML)
• Optimized storage• Flexible schema• Relational processing
38
Designed for the cloud
Low-cost, scalable cloud storage
Never worry about sizing for storage again
Elastic compute, on demand
Exact amount of compute needed, exactly
when needed
Optimized for diverse data
Load and optimize semi-structured + structured
data without transformation
Software as a service
No knobs, tuning, or infrastructure management
39
A new architecture: multi-cluster, shared data
• Standard interfaces• Cloud services layer coordinates across service• Independent compute clusters access data• Data centralized in enterprise-class cloud storage
40
Enabling multi-dimensional scaling• Elastic scaling for storageLow-cost cloud storage, fully replicated and resilient
• Elastic scaling for computeVirtual warehouses scale up & down on the fly to support workload needs
• Elastic scaling for concurrencyScale concurrency using independent virtual warehouses
Finance
Marketing
Operations
Loading / ETL
Sales
Test / Dev
41
Delivered as a service:no infrastructure, knobs, or tuning
Infrastructure management
Virtual hardware and software managed by
Snowflake
Metadata management
Automatic statistics collection, scaling, and
redundancy
**..
**..
Manual query optimization
Dynamic optimization, parallelization, and
concurrency management
Data storage management
Adaptive data distribution, automatic compression, automatic optimization
42
Fits with existing tools & processes
Complex Data InfrastructureComplex systems, data pipelines,
data silos
EDW Datamarts
HadoopData Diversity ChallengesExternal data, machine-generated
data, streaming data
Barriers to AnalysisAnalysis limited by incomplete data, delays in access, performance
limitations
44
What Have We Learned Over The Years?• Need results soon• Multi-years projects not acceptable any more
• Executive buy in ($$$)• Build incrementally, test, refactor• Get user feedback RIGHT AWAY!• Avoid over analysis• You will learn as you go
45
Critical Success Factors• A data warehouse will be considered a success if it:• Can be loaded in a timely manner• Regardless of the data type or source
• Can be accessed in an easy fashion• By both data scientists and business users
• Can be understood by the business community• Is recognized as bringing value to the decision making process• For an acceptable TCO
46
An Option to Consider…Snowflake is:• …a team of accomplished data experts• Funded by top-tier VCs including Altimeter Capital, Redpoint Ventures, Sutter Hill Ventures, Wing VC
…who have developed a completely new data warehouse designed for the cloud• Data warehouse as a service• Multidimensional elasticity• Support for all business data – including semi-structured• Compelling price:performance
47
Available onAmazon.com
http://www.amazon.com/Better-Data-Modeling-
Introduction-Engineering-
ebook/dp/B018BREV1C/
SHAMELESS PLUG:
48
Kent GrazianoSnowflake [email protected] Twitter @KentGraziano
More info athttp://snowflake.net
Visit my blog athttp://kentgraziano.com
Contact Information