healthcare best practices in data warehousing & analytics
TRANSCRIPT
DataWarehousing
A Look Back, Moving Forward
Dale SandersJune 2005
2
Introduction & Warnings Why am I here?
Teach Stimulate some thought Share some of my experiences and lessons
Learn From you, please… Ask questions, challenge opinions, share your knowledge
I’ll do my best to live up to my end of the bargain
Warnings The pictures in this presentation
May or may not have any relevance whatsoever to the topic or slide Mostly intended to break up the monotony
3
Expectation Management DW Strengths (according to others)
I know what not to do as much as I know what to do Seen and made all the big mistakes
Vision, strategy, system architecture, data management & DW modeling, complex cultural issues, “leapfrog” problem solving
What not to expect: DW weaknesses My programming skills suck
Haven’t written a decent line of code in four years! Some might say it’s been 24 years…
Knowledge of leading products is very rusty Though I’m beefing up on Microsoft and Cognos
Within these expectations, make no mistake about it… I know data warehousing
4
Today’s Discussions
I am a good “Idea Guy” But, ideas are worthless without someone to implement and
enhance them Steve Barlow, Dan Lidgard, Jon Despain, Chuck Lyon, Laure
Shull, Kris Mitchell, Peter Hess, Ron Gault, Rob Carpenter, my wife, and many others
My greatest strength and blessing The ability to recognize, listen to, and hold onto good people Knock on wood
My achievements in personal and professional life More a function of those around me than a reflection on me
5
DW Best Practices: The Most Important Metrics
Employee satisfaction Without it, long-term customer satisfaction is impossible
Customer satisfaction That’s the nature of the Information Services career field Some people in our profession still don’t get it
We are here to serve
The Organizational Laugh Metric How many times do you hear laughter in the day-to-day
operations of your team? It is the single most important vital sign to organizational health
and business success
6
My Background Three, eight-year chapters
Captain, Information Systems Engineer, US Air Force Nuclear warfare battle management Force status data integration Intelligence and attack warning data “fusion”
Consultant in several industries TRW
CIA Data Center TRW Credit Reporting Data Base
National Security Agency (NSA) Intel: New Mexico Data Repository (NMDR) Air Force
Integrated Minuteman Data Base (IMDB) Peacekeeper Information Retrieval System (PIRS)
Many others… Healthcare
Intermountain Health Care Enterprise Data Warehouse Consultant to other healthcare organizations’ data warehouses Now at Northwestern University Medical System
7
Overview Data warehousing history
According to Sanders Why and how did this become a sub-specialty in information
systems? What have we learned so far?
My take on “Best Practices” Key lessons-learned
My thoughts on the most popular authors in the field What they contribute, where they detract
8
Data Warehousing History
“Newspaper Rock”100 B.C.
American Retail2005 A.D.
Lots of stuff happened
9
What Happened in the Cloud? Stage 1: Laziness
Operators grew tired of hanging tapes In response to requests for historical financial data
They stored data on-line, in “unauthorized” mainframe databases
Stage 2: End of the mainframe bully Computing moved out from finance to the rest of the business Unix and relational databases Distributed computing created islands of information
Stage 2.1: The government gets involved Consolidating IRS and military databases to save money on mainframes “Hey, look what I can do with this data…”
Stage 3: Demming comes along Push towards constant business “reengineering” Cultural emphasis on “continuous quality improvement” and “business innovation” drives the need for data
Stage 4: Data warehousing has it’s own language Ralph Kimball publishes “The Data Warehouse Toolkit”
10
The Real Truth Data warehousing is a symptom of a problem
Technological inability to deploy single-platform information systems that: Capture data once and reuse it throughout an
enterprise Support high-transaction rates (single record
CREATE, SELECT, UPDATE, DELETE) and analytic queries on the same computing platform, with the same data, at the same time
Someday, maybe we will address the root cause Until then, it’s a good way to make a living
11
The “Ideal Library” Practice Stores all of the books and other reference material you need to
conduct your research The Enterprise data warehouse
A single place to visit One database environment
Contents are kept current and refreshed Timely, well choreographed data loads
Staffed with friendly, knowledgeable people that can help you find your way around Your Data Warehouse team
Organized for easy navigation and use Metadata Data models “User friendly” naming conventions
12
Cultural DetractorsThe two biggies…
The business supported by the data warehouse must be motivated by a desire for constant improvement and fact-based decision making
The data warehouse team falls victim to the “Politics of Data” Through naivety Through misguided motives, themselves
13
Business Culture
Does your CEO… Talk about constant improvement, constantly? Drive corporate goals that are SMART?
Specific, Measurable, Attainable, Realistic, Tangible
Crave data to make better informed decisions? Become visibly, buoyantly excited at a demo for
a data cube?
If so, the success of your data warehouse is right around the corner… sort of…
I love data!
14
Political Best Practices You will be called a “data thief”
Get used to it Encourage life cycle ownership of the OLTP
data, even in the EDW You will be called “dangerous”
“You don’t understand our data!” OLTP owners know their data better than you
do– acknowledge it and leverage it
You will be blamed for poor data quality in the OLTP systems This is a natural reaction Data warehouses raise the visibility of poor data quality Use the EDW as a tool for raising overall data quality
You will be called a “job robber” EDW is perceived as a replacement for OLTP systems Educate people: The EDW depends on OLTP systems for its existence
Stick to your values and pure motives The politics will fade away
15
Data Quality Pitfall
Taking accountability for data quality on the source system Spending gobs of time and money “cleansing” data before it’s loaded into
the DW It’s a never ending, never win battle You will always be one step behind data quality You will always be in the cross-hairs of blame
Best Practice Push accountability where it belongs– to the
source system Use the data warehouse as a tool to reveal
data quality, either good or bad Be prepared to weather the initial storm of
blame
16
Measuring Data Quality Data Quality = Completeness x Validity
Can it be measured objectively?
Measuring “Completeness” Number of null values in a column
Measuring “Validity” Cardinality is a simple way to measure validity
“We only have four standard regions in the business, but we have 18 distinct values in the region column.”
17
Business Validity How can you measure it? You can’t…
“I collect this data from our customers, but I have to guess sometimes because I don’t speak Spanish.”
“This data is valid for trend analysis decisions before 9/11/2001, but should not be used after that date, due to changes in security procedures.”
“You can’t use insurance billing and reimbursement data to make clinical, patient care decisions.”
“This customer purchased four copies of ‘Zamfir, Master of the Pan Flute’, therefore he loves everything about Zamfir.” What Amazon didn’t know: I bought them for my mom and her
sewing circle.
Where do you capture subjective data quality? Metadata….
18
The Importance of Metadata
Maybe the most over-hyped, underserved area of data warehousing common sense Vendors want to charge you big $$$$$ for their tools Consultants would like you to think that it’s the Holy Grail in
disguise and only they can help you find it Authors who have never been in an operational environment
would have you chasing your tail in pursuit of an esoteric, mythological Metadata Nirvana
Don’t listen to the confusing messages! You know the answer… just listen to your common sense…
19
Metadata: Keep It Simple! Ultimately, what are the most valuable business
motives behind metadata? Make data more “understandable” to those who are not
familiar with it Data quality issues Data timeliness and temporal issues Context in which is was collected Translating physical names to natural language
Make data more “findable” to those who don’t know where it is Organize it
Take a lesson from library science and the card catalog
20
Table ElementsRequired Elements
Long Name (or English name) Description
Semi-optional Elements Source Example Data Steward
21
Column ElementsRequired Elements
Long Name Description
Optional Elements Value Range Data Quality Associated Lookup
22
The Data Model
TABLE_ENTTABLE_ENT_ID: NUMBER
TABLE_ENT_DESC: VARCHAR2(4000)TABLE_ENT_SRC: VARCHAR2(50)TABLE_ENT_NAME: VARCHAR2(50)TABLE_TYPE: VARCHAR2(10)CREATE_DT: DATELAST_LOAD_DT: DATESCHEMA_ID: NUMBER
DATA_MARTDATA_MART_ID: NUMBER
DATA_MART_NAME: VARCHAR2(50)DATA_MART_DESC: VARCHAR2(4000)DATA_STEWARD: VARCHAR2(50)LAST_LOAD_DT: DATEUPDATE_FREQ: VARCHAR2(50)DATA_BEG_DT: DATEDATA_END_DT: DATE
DATA_MART_TABLE_ENTDATA_MART_ID: NUMBERTABLE_ENT_ID: NUMBER
FOLDERFOLDER_ID: NUMBER
PARENT_FOLDER_ID: NUMBERFOLDER_NM: VARCHAR2(50)FOLDER_DSC: VARCHAR2(4000)CREATE_USER_ID: VARCHAR2(20)CREATE_DT: DATE
REPORTRPT_ID: NUMBER
FOLDER_ID: NUMBERRPT_NM: VARCHAR2(250)RPT_LOC_TXT: VARCHAR2(1000)PURPOSE_TXT: VARCHAR2(4000)RUN_FREQ_TXT: VARCHAR2(1000)AUDIENCE_TXT: VARCHAR2(500)EDW_RPT_FLG: NUMBERDATA_SOURCE_TXT: VARCHAR2(4000)SELECT_CRITERIA_TXT: VARCHAR2(4000)STAT_METHODS_TXT: VARCHAR2(4000)RPT_TOOL_TXT: VARCHAR2(250)CODE_TXT: CLOBFORMULA_TXT: CLOBCOMMENTARY_TXT: VARCHAR2(4000)AUTHOR_NM: VARCHAR2(500)AUTHOR_TITLE_TXT: VARCHAR2(500)AUTHOR_DEPT_TXT: VARCHAR2(500)AUTHOR_LOC_TXT: VARCHAR2(500)AUTHOR_PHONE_TXT: VARCHAR2(500)AUTHOR_EMAIL_TXT: VARCHAR2(500)BUSINESS_OWNER_TXT: VARCHAR2(500)METADATA_UPDATE_DT: DATEVALIDATION_DT: DATECREATE_USER_ID: VARCHAR2(20)CREATE_DT: DATE
REPORT_TABLE_ENT_ASSOCRPT_ID: NUMBERTABLE_ENT_ID: NUMBER
ATTRIBUTEATTRIBUTE_ID: NUMBER
TABLE_ENT_ID: NUMBERATTRIBUTE_DESC: VARCHAR2(4000)ATTRIBUTE_NAME: VARCHAR2(50)ATTRIBUTE_DATATYPE: VARCHAR2(50)SAMPLE_VALUE: VARCHAR2(100)INDEX_FLG: NUMBERPRIMARY_KEY_FLG: NUMBERTABLE_POSITION_NO: NUMBER
SCHEMASCHEMA_ID: NUMBER
SCHEMA_DESC: VARCHAR2(50)
23
Example Metadata EntryLKUP.POSTAL_CD_MASTER TableLong Name:
Postal Code Master - IHC
Description:Contains Postal (Zip) codes for the IHC referral region and
IHC specific descriptions. These descriptions allow for specific IHC groupings used in various analyses.
Data Steward: Jim Allred, ext. 3518
25
Some Info Is Free It can be collected from the database.For example:
Primary and Foreign Keys Indexed Columns Table Creation Dates
26
Most Valuable Info is SubjectiveThe human element Most metadata is not automatically
collected by tools because it does NOT exist in that form
Interviews with data stewards are the key
It can take months (and months and months) of effort to collect initial metadata.
27
Holding Feet to the Fire Made data architects
responsible for metadata in their subject areas
Metadata completion reports in every staff meeting for a year
Standing rule: No new data marts go live without metadata
28
Is it all worth it?
Data analysts think so.
“I couldn’t do my job without it.”
It will push the ROI of a home-hum DW into the stratosphere
It does for DW’ing what the Yellow Pages did for the business ROI of the telephone
29
It Gets UsedAt Intermountain Health Care
210 web hits on average each week day (23,000 employees, $2B revenue)
Avg Hits by Day of Week(April 2004 - Sep 2004)
189217 212
240
188
0
50
100
150
200
250
300
MON TUE WED THU FRI
31
Report Quality A function of…
Data quality How well does the report reflect the intent behind the question being
asked? “This report doesn’t make sense. I’m trying to find out how many
widgets we can produce next year, based on the last four years’ production.”
“That’s not what you asked for.” SQL and other programming accuracy Statistical validity– population size of the data Timeliness of the data relative to the decision Event Correlation
Best Practice: An accompanying “meta-report” for every report that involves
significant, high risk decisions
32
Meta Report
A document, associated with a published report, which defines the report.
33
Repository
A central place for storing and sharing information about business reports
34
IHC Analyst Use of Meta Reports
37%
89%
21%
95%
0%
20%
40%
60%
80%
100%
Data Collected Aug-04 N=32Read Others Search Duplication Search SQL Audience Request
35
Meta Report
Core Elements Author Information Report Name Report Purpose Data Source(s) Report Methods
Recommended Elements Business Owner Run Frequency Intended Audience Statistical Tests Software Used Source Code Formulas Relevant Issues &
Commentary
38
• Selection Criteria
• Statistics
• Software
• Source Code
• Formulas
42
Think: Mission Control Customized ETL Library Schedule of operations Alerting tool Storage strategies / backups Development philosophy and environment Performance—monitoring and tuning
Operations Best Practices
43
EDW Oracle v 9.2.0.3 on AIX 5.2 Storage: IBM SAN (shark), >3T
ETL tools Ascential’s Data Stage Kornshell (unix), SQL scripts, PL/SQL scripting
OLAP: MS’ Analysis Services BI: Business Objects (Crystal Enterprise)
With a Cube presentation layer Dashboard: Visual Mining’s Net Charts EDW Team: ~16 FTEs, plus SAs and DBAs
IHC Architecture
45
One of our ETL programmers noticed he kept doing the same things over and over for all of his ETL jobs. Rather than copying and pasting this repetitive code, he created a library. Now we all use the ETL Library.
We named the library EDW_UTIL (EDW Utilities)
History
46
Implementation Executes via Oracle stored procedures Supported by associated tables to hold data
when necessary Error table QA table Index table
47
Benefits Provides standardization Eliminates code rewrites Can hide complexities Such as the appropriate way to analyze and gather statistics on tables Very accessible to all of our ETL tools Simply an Oracle stored procedure call
48
Index Management Past process included:
Dropping the table’s indexes with a script Loading the table Creating the indexes with a script
The past process resulted in messy scripts to manage and coordinate
49
Index Management New process includes:
Capturing a table’s existing indexes metadata Dropping the table’s indexes with a single procedure call Loading the table Recreating the indexes with a single procedure call
There are no more messy scripts to manage and coordinate No “lost” indexes were neglected when adding to create index script
50
Index Management Samples
IMPORT_SCHEMA_INDEX_DATA IMPORT_TABLE_INDEX_DATA DROP_TABLE_INDEXES CREATE_TABLE_INDEXES
51
Background Loading of Tables We often load data into tables which are not
accessible to end users. A simple rename puts them into production.
Helps transfer the identical attributes from the live to the background table
Samples COPY_TABLE_METADATA TRANSFER_TABLE_PRIVS DROP_TABLE_INDEXES CREATE_TABLE_INDEXES
(Create on background table, identical to production table)
52
Load Times, Errors, QA We had no idea who was loading what and when
Each staff member logged in their own way and for their own interest
ETL error capturing and QA was difficult We can now capture errors and QA information in a
somewhat standardized fashion
53
Load Times, Errors, QASamples BEGIN_JOB_TIME
(ex: CASEMIX) BEGIN_LOAD_TIME
(ex: CASEMIX INDEX) END_LOAD_TIME END_JOB_TIME COMPLETE_LOAD_TIME
(Begin and end together) LOAD_TIME_ERROR
(Alert on these errors) LOAD_TIME_METRICS
QA (row counts)
54
Miscellaneous Procedures Hide the “gory” details from the majority
of the EDW team Such as Oracle’s table analyze command
Gives us consistent application of system wide parameters such as: A new box with a different number of CPUs
(parallel slaves)or
A new version of Oracle We populate some metadata too, such
as last load date
55
DW Schedule of Operations Some loads are adhoc, not scheduled Users query in an adhoc fashion We have a minimal service/application tier
implemented (loss of control) Use of a variety of ETL tools Use of a variety of user categories
DBA, SA, ETL user, end users Use of a variety of servers
Production EDW, Stage EDW, ETL servers, OLAP servers, Presentation layer servers
56
General Approach Focus on load jobs against production EDW
Still working on all the reporting aspects (a sample on the next slide)
Pull this information out of the “load times” data captured by these ETL library calls BEGIN_JOB_TIME BEGIN_LOAD_TIME END_LOAD_TIME END_JOB_TIME COMPLETE_LOAD_TIME
58
DW Alerting Tool DW alerting
Aggregate data alerts, such as, your average length of stay just crossed a certain threshold
A simple tool was created which sends a text email, based on existence of data returned from a query
Primarily embraced by DW team members for internal DW operations, not that the original intent is abandoned
59
Features Web based Open to all EDW users Run daily, weekly, every two weeks, monthly,
quarterly (wakes every 5 minutes) This is a passive polling
Ability to enter query in SQL Alert (email) on 3 situations
Query returns data Query returns no data Always
61
Examples ~100 alerts in use Live performance check
Every 4 hours—look for inactive sessions holding active slaves
Daily—look for any active sessions older than 72 hours ETL monitoring; alert only if problem
Alert on errors logged via the ETL_UTIL library (manage by exception)
Alert on existence of “bad” records captured during ETL
62
Storage and Backup Inherited state of affairs Running like any OLTP database
High end expensive SANs (storage area networks)
FULL nightly online backups Out of space? Just buy more
63
Nightmare in the Making Exponential growth
More data sources More summary tables More indexes No data has yet been purged
Relaxed attitude Disk is cheap Reality: Disk management is expensive
64
Looming Crisis Backups often run 16 hours or more
Performance degradation witnessed by users Good backups obtained less than 50% of the time
Literally running out of space Gross underestimating Some reckless overuse
Financial $$$$ cost The system administrators (SAs) quadruple the price of
disk purchase from the previous budget year. Ouch! SAs roll in the price of tape drives, etc.
65
Major Changes in Operations
Transfer some disk ownership AND backup responsibilities to DW team, away from SAs and DBAs
EDW team more aware of upcoming space demands
EDW team more in tune with which data sets are easily recreated from the source (don’t need a backup)
Stop performing full daily backups Move towards less expensive disk
option IBM offers a few levels of SANs
67
Changes to Backup Strategy Perform full backup once monthly during
downtime Perform no data backup on DEV/STAGE
environments
Do backup DDL (all code) daily in all environments
Implement daily “incremental” backup
68
Daily Incremental Backups Easier said than done We’ve resorted to a table level backup (in Oracle,
that’s an EXPORT) The EDW team owns which tables are exported
EDW team populates a table, the “export table list” with each table’s export frequency
Populated via an application in development The DBA’s run an export based on the “export table
list”
69
Use Cheaper Disk General practice: You can take greater risks with DW reliability
and availability vs. OLTP systems Use it to your advantage
Our SAN vendor (IBM) offers a few levels of SANs. Next level down is a big step down in price, small step down in features.
Feature loss: Read cache (referring to disk cache, not box memory).
We rarely read the same thing twice anyway No “phone home” to IBM (auto paging) Mean time to failure is higher, but still acceptable
70
Performance Monitoring & Tuning Err on the side of freedom and empowerment
How much harm can really be done? We’d rather not constrain our customers
“Pounding queries” do find their way to production Opportunity to educate users Opportunity for us to tune
underlying structures
71
The Focus Areas Indexing
Well-defined criteria for when and how to apply indexes Is this a lost art?
Big use of BITMAPS Composite index trick (acts like a table)
Partitioning for performance, rather than data management Exploiting Oracle’s Direct Path INSERT feature Avoiding UPDATE and DELETE commands
Copy with MINUS instead Implementing Oracle's Parallel Query Turn off referential integrity in the DW.. no brainer
That’s the job of the source system
72
DW Monitoring: Empowering End Users
Motive Too many calls from end users about their queries
“Please kill it.” “Is it still running or is my PC locked up?” “Why is the DW so slow?”
Give them the insight and tools Give them the ability to kill their own queries
Still in the works
74
Tracking Long-Running Queries
We use Pinecone (from Ambeo) to monitor the duration of all queries and the SQL
Each week, we look at the top few Typical outcome?
We’ll add indexes We’ll denormalize We'll contact the user and assist them with writing a better query
75
The DW Sandbox More empowerment for customers Motive
Lots of little MS Access databases (with valuable data) spread all over the place
Needed to be joined with DW data Costly to maintain PC hogs
Solution Provide customers with their own “sandbox” on the DW, with DBA-like
priv’s
76
Features Web based tool for creating tables and
loading MS Access data to the DW Simple, easy to use interface
Privileges Users have full rights to the tables they create Can grant rights to others
Big, big victory for customer service and data “maturity” 10% of DW customers use the
Sandbox About 600 tables in use now About 2G of data
77
Design-Build Best Practices Build vertically, design horizontally
Start by building data marts that address analytic needs in one area of the business with a fairly limited data set
But, design with the horizontal needs of the company in mind, so that you will eventually “tie” all of these vertical data marts together with a common semantic layer
79
For Example…Ca
ncer
Reg
istry
Mam
mog
raph
yRa
diol
ogy
Path
olog
y
Labo
rato
ry
Cont
inui
ng C
are
And
Follo
w-Up
Qual
ity o
f Life
Surv
ey
Radi
atio
nTh
erap
y
Heal
th P
lans
Clai
ms
Ambu
lato
ryCa
sem
ix
Acut
e Ca
reCa
sem
ix
An Integrated Reporting Model of Cancer Patient’s Data
Oncology Data Integration StrategyTop down reporting requirements and data model
Disparate Sources “connected” semantically to the data bus
80
The Logic Layer in Data Warehouses
SourceSystem
ETL Process DataWarehouse
Reports
Data Layer Logic Layer Presentation Layer
Analytic Systems
Transaction Systems
HereNot Here
81
Evidence of Business Process Alignment
1. Map out your high level business process Don’t fall prey to analysis paralysis with endless business
process modeling diagrams!2. Identify and associate the transaction systems that support
those processes3. Identify the common, overlapping semantics/data attributes
and their utilization rates4. Build your data marts within an enterprise framework that is aligned with the processes you are trying to understand
82
For example…
DiagnosisHealth Need PatientPerceptionProcedure Results &
Outcomes
Episode of Care
AP/AR Claims ProcessingHealthcare business process
HELP Lab HPIMC400
SurveyAS400IDX HDMCIS/CDRHNA
Supported by non-integrated data in Transaction Systems…
Rx
Integrated in the Data Warehouse
DataWarehouse
83
Event Correlation A leading edge Best Practice
The third dimension to rows and columns Overlays the data that underlies a report or graph
“In 2004, we experienced a drop in revenue as a result of the earthquake that destroyed our plant in the Philippines.”
“In January of 2005, we saw a spike in the North America market for snow shovel sales that coincided with an increase in sales for pain relievers. This correlates to the record snowfall in that region and should not be considered a trend. Barring major product innovation, we consider the market for snow shovels in this area as saturated. Sales will be slow for the next several years.”
84
Standardizing Semantics Sweet irony are the many synonyms for “standard
semantics” Data dictionary Vocabulary Dimensions Data elements Data attributes
The bottom line issue: Standardizing the terms you use to describe key facts about your business
85
Standardizing “Names of Things” You better do it within the first two months of your
data warehouse project If you are beyond that point, you better stop and do it now,
lest you pay a bigger price later
Don’t… Push the standard on the source systems, unless it’s easy
to accomplish This was one of the common pitfalls of early data
warehousing project failures Try to standardize everything under the sun!
Focus on the high value facts
86
Where Are The “High Value” Semantics?In the high-overlap, high-utilization areas…
SourceSystem X
SourceSystem Y
SourceSystem Z
Highest value area for
standardizing semantics
87
Another Perspective
Semantic Utilization
Sem
antic
Ove
rlap
88
The Standard Semantic “Layer”
DataWarehouse
Source Systems Extract, Transform, Load
Semantic Standards
89
Data Modeling Star schemas are great and simple, but they aren’t
the end-all, be-all of analytic data modeling Best practices: Do what makes sense– don’t be a schema
bigot I’ve seen great analytic value from 3NF models
Maintain data familiarity for your customers When meeting vertical needs Don’t make massive changes to the way the model looks
and feels, nor the naming conventions– you will alienate existing users of the data
Use views to achieve “new” or standards-compliant perspectives on data When meeting horizontal needs
90
For Example…
Source perspective
DW perspective
Similar names & organization
Vertical data customerHorizontal data customer
“Standardized” view
91
The Case For Timely Updates%
Req
uest
s fo
r D
ata
utili
zati
on
Data Age
0
100
Today 1 year 2 years
Generally, to minimize Total Cost of Ownership (TCO), your update frequency should be no greater than the decision making cycle associated with the data. But… everyone wants more timely data.
92
Best Practice: Measure Yourself
Employee satisfaction Customer satisfaction Average number of
queries/month Number of queries above a
threshold (30 minutes?) Average query response time Total number of records
Total number of query-able tables
Total number of query-able columns
Number of “users” Average rows delivered per
month Storage utilization CPU utilization Downtime per month by data
mart
The Data Warehouse Dashboard
93
Other Best Practices
The Data Warehouse Information Systems Team reports to the CIO Most data analysts can and probably
should report to the business units
Change management/service level agreements with the source systems No changes in the sources systems
unless they are coordinated with the data warehouse team
94
More Best Practices Skills of the Data Warehouse IS Team
Experienced chief architect/project manager Procedural/script programmers SQL/declarative programmers Data warehouse storage management architects Data warehouse hardware architects and system
administrators Data architects/modelers DBAs
95
More Best Practices Evidence of project collaboration
A cross section of members and expertise from the data warehouse IS team
Statisticians and data analysts who understand the business domain
A customer that understands the process(es) being measured and can influence change
A data steward– usually someone from the front lines who knows how the data is collected
Project = complex reports or a data mart
96
More Best Practices When at all possible, always extract as close
to the source as possible
PrimarySource
Copy A
Copy B
DataWarehouse
Best Practice Path
97
The Most Popular Authors I appreciate…
The interest they stir The vocabulary– semantics– of this new specialty that they helped
create
The downside… The buzzwords that are more buzz than substance
“Corporate Information Factories” Endless, meaningless debate
“That’s not an Operational Data Store!” “Do you follow Kimball or Inmon?”
Follow your own common sense Most of these authors have not had to build a data warehouse from
scratch and live with their decisions through a complete lifecycle
98
ETL Operations Besides the cultural risks and challenges, the riskiest part of a
data warehouse… Good book
Westerman, WalMart Data Warehousing
The Extract, Transform, and Load processes
Worthy of it’s own “Best Practices” discussion Suffice to say, mitigate risks in this area carefully and deliberately The major design errors don’t show up until late in the lifecycle,
when the cost of repair is great
99
Two Essential ETL Functions Initial loads
How far back do we go in history? Maintenance loads
Differential loads or total refresh? How often?
You will run and tune these processes several times before you go into production How many records are we dealing with? How long will this take to run? What’s the impact on the source system performance?
100
Maintenance Loads Total refresh vs. Incremental loads
Total refresh: Truncate and reload everything from the source system
Incremental: Load only the new and updated records For small data sets, a total refresh strategy is the
easiest to implement How do you define “small”? You will know it when don’t
see it. Sometimes the fastest strategy when you are trying to
show quick results Grab and go…
101
Incremental Loads How do we get a snapshot of the data that
has changed since the last load? Many source systems will have an existing log file
of some kind Take advantage of these when you can,
otherwise incremental loads can be complicated
102
File Transfer FormatsDesign your extract so that it uses… Fixed, predetermined length for all records and fields
Avoid variable length if at all possible A unique character that separates each field in a record, such as ~ A standard format for header records across all source systems
Such as the first three records in each file Include name of source system, file, and record count and number of fields in the record This will be handy for monitoring jobs and collecting load metadata
103
Benefits of Standard File Transfer Format
Compatible with standard database and operating system utilities Dynamically create initial and maintenance load scripts
Read the table definitions (DDL) then merge that with the standard transfer file format
Dynamically generate load monitoring data Read the header row, insert that into a
“Load Status” table with status “Running”, # of records, start time
At EOF, change status to “Complete” and capture end of load time
I wish I would have thought about this topic more, and earlier in my career
104
Westerman Makes A Good Point
My experience: ETL is the least tasteful and productive use of a veteran EDW Team member, so I like Westerman’s insight on this topic
If you design for instantaneous updates from the beginning, it translates to less ETL maintenance and labor time for the EDW staff, later
105
Messaging Applied to ETL
Basic concepts Use a load message queue for records that need to be updated, coming
from the source systems When the EDW analytical processing workload is low (off-peak), pick the
next message off the load queue and load the data Run this in parallel so that you can process several load messages at
the same time while you have a window of opportunity Sometimes called “throttling”
Speed up and slow down based upon traffic conditions
Motive behind the concept Continuous updates in a mixed
workload environment Mixed: Analytical processing at the
same time as transaction oriented, constant updates, deletes, inserts
106
ETL Message Queue Process
Source Systems•Updates•Inserts•Deletes
ETL Message Queue
ETLManager
DatabaseWorkload andPerformance
MetricsEDW Production Tables
107
Four Data Maintenance Processes
Initial load Loading into an empty table
Append load Update process Delete process As much as practical, use your database utilities for these
processes Study and know your database utilities for data warehousing;
they are getting better all the time I see some bad strategies in this area-- companies spending
time building their own utilities…aye cucumber!
108
A Few Planning Thoughts Understand the percentage of records
that will be updated, deleted, or inserted You’ll probably develop a different
process for 90% inserts vs. 90% updates
Logging In general, turn logging off during the processes, if logging was
on at all Field vs. Record level updates
Some folks, in the interest of purity, will build complex update processes for passing only field (attribute) level changes
No brainer: Pass the whole record
109
Initial Load
Every table will, at some time, require an initial load For some tables, it will be the best choice for data
maintenance Total data refresh Best for “small” tables
Simple process to implement Simply delete (or truncate) and reload with fresh
data
110
A Better Initial Load Process Background load
Safer– protects against corrupt files Higher availability to customers
Three or four steps… maybe 6?1. Create a temporary table2. Load the temporary table3. Run quality checks4. Rename the temporary table to the production table name5. Delete the old table6. Regrant rights, if necessary
Westerman: “You want to use as many initial load processes as possible.”
I agree!
111
Append Load For larger tables that accumulate historical
data There are no updates, just appends
A hard fact that will not change Example
Sales that are closed Lab results
112
Append Load Options Load a single part of a table Load a partition and ‘attach’ it to the table
Create a new, empty partition Load the new records Attach the partition to the table
Look to use the “LOAD APPEND” command in your database
113
Another Append Option1. Create a temp table identical to the one you
are loading2. Load the new records into the empty temp
table3. Issue INSERT/SELECT
INSERT INTO Big_Table (SELECT * FROM Temp_Big_Table)
4. Delete the temp table
IF # RECORDS IN TEMP IS MUCH < # OF RECORDS IN BIGTHEN GOOD TECHNIQUEELSE NOT GOOD
114
Update Process The most difficult and risky to build Use this process only if the tables are too large for a
complete refresh, “Initial Load” process
Updates affect data that changes over time Like Purchase Orders, hospital transactions, etc. Medical records, if you treat the data maintenance
at the macroscopic level
115
Update Process Options Simple process
1. Separate the affected records into an update file, insert file, or delete file
Do this on the source system, if possible2. Transfer the files to the data warehouse staging area3. Create and run two processes
– A delete process for deleting the records in the production table that need updated or deleted
– An insert process for inserting the entirely new “updated” record into the production table, as well as the true inserts
Simple, but typically not very fast
116
Simple Process
Updated records
Deleted records
New records
Source System
Updates
Deletes
Inserts
EDW Staging AreaDelete Process
Insert Process
EDWProduction Table1
24
56
1. Delete Process identifies records for deletion from the Production Table based upon contents of the Updates file.
2. Delete Process identifies records for deletion from the Production Table based upon contents of the Deletes file.
3. Delete process deletes records from Production Table.
4. Insert Process identifies records for insert to the Production Table based upon contents of the Updates file.
5. Insert Process identifies records for insert to the Production Table based upon contents of the Inserts file.
6. Insert Process inserts records into the Production Table.
3
117
When You Are Unsure Sometimes, source system log and audit files make
it difficult to know if a record was updated or inserted (i.e. created)
Try this…
1. Load the records into a temp table that is identical to the production table to be updated
2. Delete corresponding records from the production table
DELETE FROM prod_table WHERE key_field
IN (SELECT temp_key_field FROM temp_table)
3. Insert all the records from the temp table into the production table
Most databases now support this with an UPSERT
118
Massive Deletes Just as with Updates and Inserts, the number of Deletes you
have to manage is inversely proportional to the frequency of your ETL processes Infrequent ETL Massive data operations
Partitions work well for this, again E.g., keeping a 5 year window of data
Insert most recent year with a partition Delete the last year’s partition
Blazing fast!
1 2 3 4 5
Delete partition
Insert partition
119
“Raw” Data Standards for ETL Makes the process of communicating with your source system
partners much easier Data type (e.g., format for date time stamps) File formats (ASCII vs. EBCDIC) Header records Control characters
Rule of thumb Never transfer data at the binary level unless you are transferring
between binary compatible computer systems Use only text-displayable characters Less rework time vs. Less storage space and faster transfer speed
Storage and CPU time are cheap compared to labor
120
Last Thought…Indexing Strategies
Define these early, practice them religiously, use them extensively
This is “Database Design 101” Don’t fall prey to this most common performance problem!
121
My Thanks For being invited… For your time and attention For the many folks who have worked for and with
me over the years that made me look better as a result
Please contact me if you have any questions [email protected] PH: 312-695-8618