healthcare best practices in data warehousing & analytics

DataWarehousing

A Look Back, Moving Forward

Dale SandersJune 2005

2

Introduction & Warnings Why am I here?

Teach Stimulate some thought Share some of my experiences and lessons

Learn From you, please… Ask questions, challenge opinions, share your knowledge

I’ll do my best to live up to my end of the bargain

Warnings The pictures in this presentation

May or may not have any relevance whatsoever to the topic or slide Mostly intended to break up the monotony

http://www.grademyart.com/orig/1/132.jpg

3

Expectation Management DW Strengths (according to others)

I know what not to do as much as I know what to do Seen and made all the big mistakes

Vision, strategy, system architecture, data management & DW modeling, complex cultural issues, “leapfrog” problem solving

What not to expect: DW weaknesses My programming skills suck

Haven’t written a decent line of code in four years! Some might say it’s been 24 years…

Knowledge of leading products is very rusty Though I’m beefing up on Microsoft and Cognos

Within these expectations, make no mistake about it… I know data warehousing


4

Today’s Discussions

I am a good “Idea Guy” But, ideas are worthless without someone to implement and

enhance them Steve Barlow, Dan Lidgard, Jon Despain, Chuck Lyon, Laure

Shull, Kris Mitchell, Peter Hess, Ron Gault, Rob Carpenter, my wife, and many others

My greatest strength and blessing The ability to recognize, listen to, and hold onto good people Knock on wood

My achievements in personal and professional life More a function of those around me than a reflection on me


http://www.companionscross.ca/images/ministry/parish/Fr.%20John%20Likozar%20Imparts%20Blessing.jpg

5

DW Best Practices: The Most Important Metrics

Employee satisfaction Without it, long-term customer satisfaction is impossible

Customer satisfaction That’s the nature of the Information Services career field Some people in our profession still don’t get it

We are here to serve

The Organizational Laugh Metric How many times do you hear laughter in the day-to-day

operations of your team? It is the single most important vital sign to organizational health

and business success


6

My Background Three, eight-year chapters

Captain, Information Systems Engineer, US Air Force Nuclear warfare battle management Force status data integration Intelligence and attack warning data “fusion”

Consultant in several industries TRW

CIA Data Center TRW Credit Reporting Data Base

National Security Agency (NSA) Intel: New Mexico Data Repository (NMDR) Air Force

Integrated Minuteman Data Base (IMDB) Peacekeeper Information Retrieval System (PIRS)

Many others… Healthcare

Intermountain Health Care Enterprise Data Warehouse Consultant to other healthcare organizations’ data warehouses Now at Northwestern University Medical System


http://images.google.com/imgres?imgurl=http://recollectionbooks.com/bleed/images/humor/spyvspy.jpg&imgrefurl=http://recollectionbooks.com/bleed/images/humor/&h=200&w=200&sz=10&tbnid=LYj7wbbIjucJ:&tbnh=99&tbnw=99&hl=en&start=1&prev=/images%3Fq%3Dspy%2Bvs%2Bspy%26hl%3Den%26lr%3D

http://images.google.com/imgres?imgurl=http://www.healthtreasures.com/images/nuclear-bomb-radiation-2.jpg&imgrefurl=http://www.healthtreasures.com/rad-block-fda-guidelines-use.html&h=189&w=180&sz=10&tbnid=9VjHAWl1zYQJ:&tbnh=97&tbnw=92&hl=en&start=4&prev=/images%3Fq%3Dnuclear%2Bbomb%26hl%3Den%26lr%3D

http://images.google.com/imgres?imgurl=http://www.photozig.com/pz1/Applications/Healthcare/Healthcare_1-250x350.jpg&imgrefurl=http://www.photozig.com/pz1/Applications/Healthcare/healthcare.html&h=350&w=250&sz=26&tbnid=SWsFSF2BPmEJ:&tbnh=116&tbnw=82&hl=en&start=1&prev=/images%3Fq%3Dhealthcare%26hl%3Den%26lr%3D

http://www.jetmax.us/Trades/images/ALCS_looking_glass.jpg

http://www.jetmax.us/Trades/images/SAC_ACP_Looking_Glass.jpg

http://images.google.com/imgres?imgurl=http://www.jimmydoolittlemuseumpromotions.com/Patches_images/US%2520AIR%2520FORCE%2520patch%2520size%25203%2520and%25201%2520quarter.jpg&imgrefurl=http://www.jimmydoolittlemuseumpromotions.com/patches.htm&h=467&w=502&sz=56&tbnid=S_NOpfcUX6oJ:&tbnh=118&tbnw=127&hl=en&start=1&prev=/images%3Fq%3Dus%2Bair%2Bforce%26hl%3Den%26lr%3D%26sa%3DG

7

Overview Data warehousing history

According to Sanders Why and how did this become a sub-specialty in information

systems? What have we learned so far?

My take on “Best Practices” Key lessons-learned

My thoughts on the most popular authors in the field What they contribute, where they detract


8

Data Warehousing History

“Newspaper Rock”100 B.C.

American Retail2005 A.D.

Lots of stuff happened


9

What Happened in the Cloud? Stage 1: Laziness

Operators grew tired of hanging tapes In response to requests for historical financial data

They stored data on-line, in “unauthorized” mainframe databases

Stage 2: End of the mainframe bully Computing moved out from finance to the rest of the business Unix and relational databases Distributed computing created islands of information

Stage 2.1: The government gets involved Consolidating IRS and military databases to save money on mainframes “Hey, look what I can do with this data…”

Stage 3: Demming comes along Push towards constant business “reengineering” Cultural emphasis on “continuous quality improvement” and “business innovation” drives the need for data

Stage 4: Data warehousing has it’s own language Ralph Kimball publishes “The Data Warehouse Toolkit”


10

The Real Truth Data warehousing is a symptom of a problem

Technological inability to deploy single-platform information systems that: Capture data once and reuse it throughout an

enterprise Support high-transaction rates (single record

CREATE, SELECT, UPDATE, DELETE) and analytic queries on the same computing platform, with the same data, at the same time

Someday, maybe we will address the root cause Until then, it’s a good way to make a living


http://images.google.com/imgres?imgurl=http://www.hha.com.au/Kids_300dpi/Gruesome%2520Truth.jpg&imgrefurl=http://www.hha.com.au/books/0733616828.html&h=1181&w=761&sz=294&tbnid=a9rO8Yu11QcJ:&tbnh=150&tbnw=96&hl=en&start=19&prev=/images%3Fq%3Dtruth%26hl%3Den%26lr%3D

11

The “Ideal Library” Practice Stores all of the books and other reference material you need to

conduct your research The Enterprise data warehouse

A single place to visit One database environment

Contents are kept current and refreshed Timely, well choreographed data loads

Staffed with friendly, knowledgeable people that can help you find your way around Your Data Warehouse team

Organized for easy navigation and use Metadata Data models “User friendly” naming conventions


http://images.google.com/imgres?imgurl=http://www.umsl.edu/services/ur/media/umslimage/images/library.jpg&imgrefurl=http://www.umsl.edu/services/ur/media/umslimage/&h=488&w=485&sz=77&tbnid=r89k-oM8awsJ:&tbnh=127&tbnw=126&hl=en&start=7&prev=/images%3Fq%3Dlibrary%26hl%3Den%26lr%3D

12

Cultural DetractorsThe two biggies…

The business supported by the data warehouse must be motivated by a desire for constant improvement and fact-based decision making

The data warehouse team falls victim to the “Politics of Data” Through naivety Through misguided motives, themselves


http://images.google.com/imgres?imgurl=http://www.americaningredients.com/images/CQI.gif&imgrefurl=http://www.americaningredients.com/Profile.asp&h=164&w=173&sz=4&tbnid=kKEvwRY0BoUJ:&tbnh=89&tbnw=94&hl=en&start=5&prev=/images%3Fq%3DCQI%26hl%3Den%26lr%3D

http://images.google.com/imgres?imgurl=http://www.neepandin.com/data/politics/muaythai.jpg&imgrefurl=http://www.neepandin.com/data/politics/politics6b.htm&h=161&w=200&sz=31&tbnid=9dh4NtEret0J:&tbnh=79&tbnw=99&hl=en&start=4&prev=/images%3Fq%3Ddata%2Bpolitics%26hl%3Den%26lr%3D

13

Business Culture

Does your CEO… Talk about constant improvement, constantly? Drive corporate goals that are SMART?

Specific, Measurable, Attainable, Realistic, Tangible

Crave data to make better informed decisions? Become visibly, buoyantly excited at a demo for

a data cube?

If so, the success of your data warehouse is right around the corner… sort of…

I love data!


http://images.google.com/imgres?imgurl=http://www.vonkantor.com/CCF%2520CEO.jpg&imgrefurl=http://www.vonkantor.com/execportraitsrev.htm&h=600&w=397&sz=48&tbnid=1z5nABagLmQJ:&tbnh=133&tbnw=88&hl=en&start=13&prev=/images%3Fq%3DCEO%26hl%3Den%26lr%3D

14

Political Best Practices You will be called a “data thief”

Get used to it Encourage life cycle ownership of the OLTP

data, even in the EDW You will be called “dangerous”

“You don’t understand our data!” OLTP owners know their data better than you

do– acknowledge it and leverage it

You will be blamed for poor data quality in the OLTP systems This is a natural reaction Data warehouses raise the visibility of poor data quality Use the EDW as a tool for raising overall data quality

You will be called a “job robber” EDW is perceived as a replacement for OLTP systems Educate people: The EDW depends on OLTP systems for its existence

Stick to your values and pure motives The politics will fade away


15

Data Quality Pitfall

Taking accountability for data quality on the source system Spending gobs of time and money “cleansing” data before it’s loaded into

the DW It’s a never ending, never win battle You will always be one step behind data quality You will always be in the cross-hairs of blame

Best Practice Push accountability where it belongs– to the

source system Use the data warehouse as a tool to reveal

data quality, either good or bad Be prepared to weather the initial storm of

blame


16

Measuring Data Quality Data Quality = Completeness x Validity

Can it be measured objectively?

Measuring “Completeness” Number of null values in a column

Measuring “Validity” Cardinality is a simple way to measure validity

“We only have four standard regions in the business, but we have 18 distinct values in the region column.”


http://images.google.com/imgres?imgurl=http://avpsoft.com/images/approved1.gif&imgrefurl=http://avpsoft.com/products/save2ftp/&h=191&w=190&sz=16&tbnid=jfH5X1PlWSYJ:&tbnh=97&tbnw=96&hl=en&start=5&prev=/images%3Fq%3Dmeasure%2Bdata%2Bquality%26hl%3Den%26lr%3D

17

Business Validity How can you measure it? You can’t…

“I collect this data from our customers, but I have to guess sometimes because I don’t speak Spanish.”

“This data is valid for trend analysis decisions before 9/11/2001, but should not be used after that date, due to changes in security procedures.”

“You can’t use insurance billing and reimbursement data to make clinical, patient care decisions.”

“This customer purchased four copies of ‘Zamfir, Master of the Pan Flute’, therefore he loves everything about Zamfir.” What Amazon didn’t know: I bought them for my mom and her

sewing circle.

Where do you capture subjective data quality? Metadata….


18

The Importance of Metadata

Maybe the most over-hyped, underserved area of data warehousing common sense Vendors want to charge you big $$$$$ for their tools Consultants would like you to think that it’s the Holy Grail in

disguise and only they can help you find it Authors who have never been in an operational environment

would have you chasing your tail in pursuit of an esoteric, mythological Metadata Nirvana

Don’t listen to the confusing messages! You know the answer… just listen to your common sense…


http://images.google.com/imgres?imgurl=http://www.itl.nist.gov/div897/images/metadata.gif&imgrefurl=http://www.itl.nist.gov/div897/docs/metadata_descript.html&h=179&w=162&sz=3&tbnid=6ejtfElZBQsJ:&tbnh=95&tbnw=85&hl=en&start=25&prev=/images%3Fq%3Dmetadata%26start%3D20%26hl%3Den%26lr%3D%26sa%3DN

19

Metadata: Keep It Simple! Ultimately, what are the most valuable business

motives behind metadata? Make data more “understandable” to those who are not

familiar with it Data quality issues Data timeliness and temporal issues Context in which is was collected Translating physical names to natural language

Make data more “findable” to those who don’t know where it is Organize it

Take a lesson from library science and the card catalog


20

Table ElementsRequired Elements

Long Name (or English name) Description

Semi-optional Elements Source Example Data Steward


21

Column ElementsRequired Elements

Long Name Description

Optional Elements Value Range Data Quality Associated Lookup


22

The Data Model

TABLE_ENTTABLE_ENT_ID: NUMBER

TABLE_ENT_DESC: VARCHAR2(4000)TABLE_ENT_SRC: VARCHAR2(50)TABLE_ENT_NAME: VARCHAR2(50)TABLE_TYPE: VARCHAR2(10)CREATE_DT: DATELAST_LOAD_DT: DATESCHEMA_ID: NUMBER

DATA_MARTDATA_MART_ID: NUMBER

DATA_MART_NAME: VARCHAR2(50)DATA_MART_DESC: VARCHAR2(4000)DATA_STEWARD: VARCHAR2(50)LAST_LOAD_DT: DATEUPDATE_FREQ: VARCHAR2(50)DATA_BEG_DT: DATEDATA_END_DT: DATE

DATA_MART_TABLE_ENTDATA_MART_ID: NUMBERTABLE_ENT_ID: NUMBER

FOLDERFOLDER_ID: NUMBER

PARENT_FOLDER_ID: NUMBERFOLDER_NM: VARCHAR2(50)FOLDER_DSC: VARCHAR2(4000)CREATE_USER_ID: VARCHAR2(20)CREATE_DT: DATE

REPORTRPT_ID: NUMBER

FOLDER_ID: NUMBERRPT_NM: VARCHAR2(250)RPT_LOC_TXT: VARCHAR2(1000)PURPOSE_TXT: VARCHAR2(4000)RUN_FREQ_TXT: VARCHAR2(1000)AUDIENCE_TXT: VARCHAR2(500)EDW_RPT_FLG: NUMBERDATA_SOURCE_TXT: VARCHAR2(4000)SELECT_CRITERIA_TXT: VARCHAR2(4000)STAT_METHODS_TXT: VARCHAR2(4000)RPT_TOOL_TXT: VARCHAR2(250)CODE_TXT: CLOBFORMULA_TXT: CLOBCOMMENTARY_TXT: VARCHAR2(4000)AUTHOR_NM: VARCHAR2(500)AUTHOR_TITLE_TXT: VARCHAR2(500)AUTHOR_DEPT_TXT: VARCHAR2(500)AUTHOR_LOC_TXT: VARCHAR2(500)AUTHOR_PHONE_TXT: VARCHAR2(500)AUTHOR_EMAIL_TXT: VARCHAR2(500)BUSINESS_OWNER_TXT: VARCHAR2(500)METADATA_UPDATE_DT: DATEVALIDATION_DT: DATECREATE_USER_ID: VARCHAR2(20)CREATE_DT: DATE

REPORT_TABLE_ENT_ASSOCRPT_ID: NUMBERTABLE_ENT_ID: NUMBER

ATTRIBUTEATTRIBUTE_ID: NUMBER

TABLE_ENT_ID: NUMBERATTRIBUTE_DESC: VARCHAR2(4000)ATTRIBUTE_NAME: VARCHAR2(50)ATTRIBUTE_DATATYPE: VARCHAR2(50)SAMPLE_VALUE: VARCHAR2(100)INDEX_FLG: NUMBERPRIMARY_KEY_FLG: NUMBERTABLE_POSITION_NO: NUMBER

SCHEMASCHEMA_ID: NUMBER

SCHEMA_DESC: VARCHAR2(50)


23

Example Metadata EntryLKUP.POSTAL_CD_MASTER TableLong Name:

Postal Code Master - IHC

Description:Contains Postal (Zip) codes for the IHC referral region and

IHC specific descriptions. These descriptions allow for specific IHC groupings used in various analyses.

Data Steward: Jim Allred, ext. 3518


24

Metadata on the Web


25

Some Info Is Free It can be collected from the database.For example:

Primary and Foreign Keys Indexed Columns Table Creation Dates


26

Most Valuable Info is SubjectiveThe human element Most metadata is not automatically

collected by tools because it does NOT exist in that form

Interviews with data stewards are the key

It can take months (and months and months) of effort to collect initial metadata.


27

Holding Feet to the Fire Made data architects

responsible for metadata in their subject areas

Metadata completion reports in every staff meeting for a year

Standing rule: No new data marts go live without metadata


28

Is it all worth it?

Data analysts think so.

“I couldn’t do my job without it.”

It will push the ROI of a home-hum DW into the stratosphere

It does for DW’ing what the Yellow Pages did for the business ROI of the telephone


29

It Gets UsedAt Intermountain Health Care

210 web hits on average each week day (23,000 employees, $2B revenue)

Avg Hits by Day of Week(April 2004 - Sep 2004)

189217 212

240

188

0

50

100

150

200

250

300

MON TUE WED THU FRI


30

“What’s New”


31

Report Quality A function of…

Data quality How well does the report reflect the intent behind the question being

asked? “This report doesn’t make sense. I’m trying to find out how many

widgets we can produce next year, based on the last four years’ production.”

“That’s not what you asked for.” SQL and other programming accuracy Statistical validity– population size of the data Timeliness of the data relative to the decision Event Correlation

Best Practice: An accompanying “meta-report” for every report that involves

significant, high risk decisions


32

Meta Report

A document, associated with a published report, which defines the report.


33

Repository

A central place for storing and sharing information about business reports


34

IHC Analyst Use of Meta Reports

37%

89%

21%

95%

0%

20%

40%

60%

80%

100%

Data Collected Aug-04 N=32Read Others Search Duplication Search SQL Audience Request


35

Meta Report

Core Elements Author Information Report Name Report Purpose Data Source(s) Report Methods

Recommended Elements Business Owner Run Frequency Intended Audience Statistical Tests Software Used Source Code Formulas Relevant Issues &

Commentary


36

• Title

• Location

• Author

• Owner


37

• Purpose

• Frequency

• Audience

• Data Source(s)


38

• Selection Criteria

• Statistics

• Software

• Source Code

• Formulas


39

What’s It Look Like?


40


41

Utilization and Creation Rate

Error


42

Think: Mission Control Customized ETL Library Schedule of operations Alerting tool Storage strategies / backups Development philosophy and environment Performance—monitoring and tuning

Operations Best Practices


43

EDW Oracle v 9.2.0.3 on AIX 5.2 Storage: IBM SAN (shark), >3T

ETL tools Ascential’s Data Stage Kornshell (unix), SQL scripts, PL/SQL scripting

OLAP: MS’ Analysis Services BI: Business Objects (Crystal Enterprise)

With a Cube presentation layer Dashboard: Visual Mining’s Net Charts EDW Team: ~16 FTEs, plus SAs and DBAs

IHC Architecture


44

CustomizedETL

Library


45

One of our ETL programmers noticed he kept doing the same things over and over for all of his ETL jobs. Rather than copying and pasting this repetitive code, he created a library. Now we all use the ETL Library.

We named the library EDW_UTIL (EDW Utilities)

History


46

Implementation Executes via Oracle stored procedures Supported by associated tables to hold data

when necessary Error table QA table Index table


47

Benefits Provides standardization Eliminates code rewrites Can hide complexities Such as the appropriate way to analyze and gather statistics on tables Very accessible to all of our ETL tools Simply an Oracle stored procedure call


48

Index Management Past process included:

Dropping the table’s indexes with a script Loading the table Creating the indexes with a script

The past process resulted in messy scripts to manage and coordinate


49

Index Management New process includes:

Capturing a table’s existing indexes metadata Dropping the table’s indexes with a single procedure call Loading the table Recreating the indexes with a single procedure call

There are no more messy scripts to manage and coordinate No “lost” indexes were neglected when adding to create index script


50

Index Management Samples

IMPORT_SCHEMA_INDEX_DATA IMPORT_TABLE_INDEX_DATA DROP_TABLE_INDEXES CREATE_TABLE_INDEXES


51

Background Loading of Tables We often load data into tables which are not

accessible to end users. A simple rename puts them into production.

Helps transfer the identical attributes from the live to the background table

Samples COPY_TABLE_METADATA TRANSFER_TABLE_PRIVS DROP_TABLE_INDEXES CREATE_TABLE_INDEXES

(Create on background table, identical to production table)


52

Load Times, Errors, QA We had no idea who was loading what and when

Each staff member logged in their own way and for their own interest

ETL error capturing and QA was difficult We can now capture errors and QA information in a

somewhat standardized fashion


53

Load Times, Errors, QASamples BEGIN_JOB_TIME

(ex: CASEMIX) BEGIN_LOAD_TIME

(ex: CASEMIX INDEX) END_LOAD_TIME END_JOB_TIME COMPLETE_LOAD_TIME

(Begin and end together) LOAD_TIME_ERROR

(Alert on these errors) LOAD_TIME_METRICS

QA (row counts)


54

Miscellaneous Procedures Hide the “gory” details from the majority

of the EDW team Such as Oracle’s table analyze command

Gives us consistent application of system wide parameters such as: A new box with a different number of CPUs

(parallel slaves)or

A new version of Oracle We populate some metadata too, such

as last load date


55

DW Schedule of Operations Some loads are adhoc, not scheduled Users query in an adhoc fashion We have a minimal service/application tier

implemented (loss of control) Use of a variety of ETL tools Use of a variety of user categories

DBA, SA, ETL user, end users Use of a variety of servers

Production EDW, Stage EDW, ETL servers, OLAP servers, Presentation layer servers


56

General Approach Focus on load jobs against production EDW

Still working on all the reporting aspects (a sample on the next slide)

Pull this information out of the “load times” data captured by these ETL library calls BEGIN_JOB_TIME BEGIN_LOAD_TIME END_LOAD_TIME END_JOB_TIME COMPLETE_LOAD_TIME


57

Sample Report


58

DW Alerting Tool DW alerting

Aggregate data alerts, such as, your average length of stay just crossed a certain threshold

A simple tool was created which sends a text email, based on existence of data returned from a query

Primarily embraced by DW team members for internal DW operations, not that the original intent is abandoned


59

Features Web based Open to all EDW users Run daily, weekly, every two weeks, monthly,

quarterly (wakes every 5 minutes) This is a passive polling

Ability to enter query in SQL Alert (email) on 3 situations

Query returns data Query returns no data Always


60

User Interface


61

Examples ~100 alerts in use Live performance check

Every 4 hours—look for inactive sessions holding active slaves

Daily—look for any active sessions older than 72 hours ETL monitoring; alert only if problem

Alert on errors logged via the ETL_UTIL library (manage by exception)

Alert on existence of “bad” records captured during ETL


62

Storage and Backup Inherited state of affairs Running like any OLTP database

High end expensive SANs (storage area networks)

FULL nightly online backups Out of space? Just buy more


63

Nightmare in the Making Exponential growth

More data sources More summary tables More indexes No data has yet been purged

Relaxed attitude Disk is cheap Reality: Disk management is expensive


64

Looming Crisis Backups often run 16 hours or more

Performance degradation witnessed by users Good backups obtained less than 50% of the time

Literally running out of space Gross underestimating Some reckless overuse

Financial $$$$ cost The system administrators (SAs) quadruple the price of

disk purchase from the previous budget year. Ouch! SAs roll in the price of tape drives, etc.


65

Major Changes in Operations

Transfer some disk ownership AND backup responsibilities to DW team, away from SAs and DBAs

EDW team more aware of upcoming space demands

EDW team more in tune with which data sets are easily recreated from the source (don’t need a backup)

Stop performing full daily backups Move towards less expensive disk

option IBM offers a few levels of SANs


66

Tracking and PredictingStorage Use


67

Changes to Backup Strategy Perform full backup once monthly during

downtime Perform no data backup on DEV/STAGE

environments

Do backup DDL (all code) daily in all environments

Implement daily “incremental” backup


68

Daily Incremental Backups Easier said than done We’ve resorted to a table level backup (in Oracle,

that’s an EXPORT) The EDW team owns which tables are exported

EDW team populates a table, the “export table list” with each table’s export frequency

Populated via an application in development The DBA’s run an export based on the “export table

list”


69

Use Cheaper Disk General practice: You can take greater risks with DW reliability

and availability vs. OLTP systems Use it to your advantage

Our SAN vendor (IBM) offers a few levels of SANs. Next level down is a big step down in price, small step down in features.

Feature loss: Read cache (referring to disk cache, not box memory).

We rarely read the same thing twice anyway No “phone home” to IBM (auto paging) Mean time to failure is higher, but still acceptable


70

Performance Monitoring & Tuning Err on the side of freedom and empowerment

How much harm can really be done? We’d rather not constrain our customers

“Pounding queries” do find their way to production Opportunity to educate users Opportunity for us to tune

underlying structures


71

The Focus Areas Indexing

Well-defined criteria for when and how to apply indexes Is this a lost art?

Big use of BITMAPS Composite index trick (acts like a table)

Partitioning for performance, rather than data management Exploiting Oracle’s Direct Path INSERT feature Avoiding UPDATE and DELETE commands

Copy with MINUS instead Implementing Oracle's Parallel Query Turn off referential integrity in the DW.. no brainer

That’s the job of the source system


72

DW Monitoring: Empowering End Users

Motive Too many calls from end users about their queries

“Please kill it.” “Is it still running or is my PC locked up?” “Why is the DW so slow?”

Give them the insight and tools Give them the ability to kill their own queries

Still in the works


73

The Insight


74

Tracking Long-Running Queries

We use Pinecone (from Ambeo) to monitor the duration of all queries and the SQL

Each week, we look at the top few Typical outcome?

We’ll add indexes We’ll denormalize We'll contact the user and assist them with writing a better query


75

The DW Sandbox More empowerment for customers Motive

Lots of little MS Access databases (with valuable data) spread all over the place

Needed to be joined with DW data Costly to maintain PC hogs

Solution Provide customers with their own “sandbox” on the DW, with DBA-like

priv’s


76

Features Web based tool for creating tables and

loading MS Access data to the DW Simple, easy to use interface

Privileges Users have full rights to the tables they create Can grant rights to others

Big, big victory for customer service and data “maturity” 10% of DW customers use the

Sandbox About 600 tables in use now About 2G of data


77

Design-Build Best Practices Build vertically, design horizontally

Start by building data marts that address analytic needs in one area of the business with a fairly limited data set

But, design with the horizontal needs of the company in mind, so that you will eventually “tie” all of these vertical data marts together with a common semantic layer


78

Creating Value In Both Axes

Build

Design


79

For Example…Ca

ncer

Reg

istry

Mam

mog

raph

yRa

diol

ogy

Path

olog

y

Labo

rato

ry

Cont

inui

ng C

are

And

Follo

w-Up

Qual

ity o

f Life

Surv

ey

Radi

atio

nTh

erap

y

Heal

th P

lans

Clai

ms

Ambu

lato

ryCa

sem

ix

Acut

e Ca

reCa

sem

ix

An Integrated Reporting Model of Cancer Patient’s Data

Oncology Data Integration StrategyTop down reporting requirements and data model

Disparate Sources “connected” semantically to the data bus


80

The Logic Layer in Data Warehouses

SourceSystem

ETL Process DataWarehouse

Reports

Data Layer Logic Layer Presentation Layer

Analytic Systems

Transaction Systems

HereNot Here


http://images.google.com/imgres?imgurl=http://www.suzgraphics.com/media/stick-figure-me.jpg&imgrefurl=http://www.suzgraphics.com/graphics.html&h=261&w=174&sz=9&tbnid=ujshTZx7lHsJ:&tbnh=107&tbnw=71&hl=en&start=2&prev=/images%3Fq%3Dstick%2Bfigure%26hl%3Den%26lr%3D

http://images.google.com/imgres?imgurl=http://www.suzgraphics.com/media/stick-figure-me.jpg&imgrefurl=http://www.suzgraphics.com/graphics.html&h=261&w=174&sz=9&tbnid=ujshTZx7lHsJ:&tbnh=107&tbnw=71&hl=en&start=2&prev=/images%3Fq%3Dstick%2Bfigure%26hl%3Den%26lr%3D

81

Evidence of Business Process Alignment

1. Map out your high level business process Don’t fall prey to analysis paralysis with endless business

process modeling diagrams!2. Identify and associate the transaction systems that support

those processes3. Identify the common, overlapping semantics/data attributes

and their utilization rates4. Build your data marts within an enterprise framework that is aligned with the processes you are trying to understand


82

For example…

DiagnosisHealth Need PatientPerceptionProcedure Results &

Outcomes

Episode of Care

AP/AR Claims ProcessingHealthcare business process

HELP Lab HPIMC400

SurveyAS400IDX HDMCIS/CDRHNA

Supported by non-integrated data in Transaction Systems…

Rx

Integrated in the Data Warehouse

DataWarehouse


83

Event Correlation A leading edge Best Practice

The third dimension to rows and columns Overlays the data that underlies a report or graph

“In 2004, we experienced a drop in revenue as a result of the earthquake that destroyed our plant in the Philippines.”

“In January of 2005, we saw a spike in the North America market for snow shovel sales that coincided with an increase in sales for pain relievers. This correlates to the record snowfall in that region and should not be considered a trend. Barring major product innovation, we consider the market for snow shovels in this area as saturated. Sales will be slow for the next several years.”


84

Standardizing Semantics Sweet irony are the many synonyms for “standard

semantics” Data dictionary Vocabulary Dimensions Data elements Data attributes

The bottom line issue: Standardizing the terms you use to describe key facts about your business


http://images.google.com/imgres?imgurl=http://staticfree.info/graphics/photos/events/one%2520night/michael%2520-%2520tongue.jpg&imgrefurl=http://staticfree.info/graphics/photos/events/one%2520night/&h=1200&w=1600&sz=379&tbnid=ZaeAl5ddvHAJ:&tbnh=112&tbnw=150&hl=en&start=11&prev=/images%3Fq%3Dtongue%26hl%3Den%26lr%3D

85

Standardizing “Names of Things” You better do it within the first two months of your

data warehouse project If you are beyond that point, you better stop and do it now,

lest you pay a bigger price later

Don’t… Push the standard on the source systems, unless it’s easy

to accomplish This was one of the common pitfalls of early data

warehousing project failures Try to standardize everything under the sun!

Focus on the high value facts


86

Where Are The “High Value” Semantics?In the high-overlap, high-utilization areas…

SourceSystem X

SourceSystem Y

SourceSystem Z

Highest value area for

standardizing semantics


87

Another Perspective

Semantic Utilization

Sem

antic

Ove

rlap


88

The Standard Semantic “Layer”

DataWarehouse

Source Systems Extract, Transform, Load

Semantic Standards


89

Data Modeling Star schemas are great and simple, but they aren’t

the end-all, be-all of analytic data modeling Best practices: Do what makes sense– don’t be a schema

bigot I’ve seen great analytic value from 3NF models

Maintain data familiarity for your customers When meeting vertical needs Don’t make massive changes to the way the model looks

and feels, nor the naming conventions– you will alienate existing users of the data

Use views to achieve “new” or standards-compliant perspectives on data When meeting horizontal needs


90

For Example…

Source perspective

DW perspective

Similar names & organization

Vertical data customerHorizontal data customer

“Standardized” view


91

The Case For Timely Updates%

Req

uest

s fo

r D

ata

utili

zati

on

Data Age

0

100

Today 1 year 2 years

Generally, to minimize Total Cost of Ownership (TCO), your update frequency should be no greater than the decision making cycle associated with the data. But… everyone wants more timely data.


92

Best Practice: Measure Yourself

Employee satisfaction Customer satisfaction Average number of

queries/month Number of queries above a

threshold (30 minutes?) Average query response time Total number of records

Total number of query-able tables

Total number of query-able columns

Number of “users” Average rows delivered per

month Storage utilization CPU utilization Downtime per month by data

mart

The Data Warehouse Dashboard


93

Other Best Practices

The Data Warehouse Information Systems Team reports to the CIO Most data analysts can and probably

should report to the business units

Change management/service level agreements with the source systems No changes in the sources systems

unless they are coordinated with the data warehouse team


94

More Best Practices Skills of the Data Warehouse IS Team

Experienced chief architect/project manager Procedural/script programmers SQL/declarative programmers Data warehouse storage management architects Data warehouse hardware architects and system

administrators Data architects/modelers DBAs


95

More Best Practices Evidence of project collaboration

A cross section of members and expertise from the data warehouse IS team

Statisticians and data analysts who understand the business domain

A customer that understands the process(es) being measured and can influence change

A data steward– usually someone from the front lines who knows how the data is collected

Project = complex reports or a data mart


96

More Best Practices When at all possible, always extract as close

to the source as possible

PrimarySource

Copy A

Copy B

DataWarehouse

Best Practice Path


97

The Most Popular Authors I appreciate…

The interest they stir The vocabulary– semantics– of this new specialty that they helped

create

The downside… The buzzwords that are more buzz than substance

“Corporate Information Factories” Endless, meaningless debate

“That’s not an Operational Data Store!” “Do you follow Kimball or Inmon?”

Follow your own common sense Most of these authors have not had to build a data warehouse from

scratch and live with their decisions through a complete lifecycle


98

ETL Operations Besides the cultural risks and challenges, the riskiest part of a

data warehouse… Good book

Westerman, WalMart Data Warehousing

The Extract, Transform, and Load processes

Worthy of it’s own “Best Practices” discussion Suffice to say, mitigate risks in this area carefully and deliberately The major design errors don’t show up until late in the lifecycle,

when the cost of repair is great


99

Two Essential ETL Functions Initial loads

How far back do we go in history? Maintenance loads

Differential loads or total refresh? How often?

You will run and tune these processes several times before you go into production How many records are we dealing with? How long will this take to run? What’s the impact on the source system performance?


100

Maintenance Loads Total refresh vs. Incremental loads

Total refresh: Truncate and reload everything from the source system

Incremental: Load only the new and updated records For small data sets, a total refresh strategy is the

easiest to implement How do you define “small”? You will know it when don’t

see it. Sometimes the fastest strategy when you are trying to

show quick results Grab and go…


101

Incremental Loads How do we get a snapshot of the data that

has changed since the last load? Many source systems will have an existing log file

of some kind Take advantage of these when you can,

otherwise incremental loads can be complicated


102

File Transfer FormatsDesign your extract so that it uses… Fixed, predetermined length for all records and fields

Avoid variable length if at all possible A unique character that separates each field in a record, such as ~ A standard format for header records across all source systems

Such as the first three records in each file Include name of source system, file, and record count and number of fields in the record This will be handy for monitoring jobs and collecting load metadata


103

Benefits of Standard File Transfer Format

Compatible with standard database and operating system utilities Dynamically create initial and maintenance load scripts

Read the table definitions (DDL) then merge that with the standard transfer file format

Dynamically generate load monitoring data Read the header row, insert that into a

“Load Status” table with status “Running”, # of records, start time

At EOF, change status to “Complete” and capture end of load time

I wish I would have thought about this topic more, and earlier in my career


104

Westerman Makes A Good Point

My experience: ETL is the least tasteful and productive use of a veteran EDW Team member, so I like Westerman’s insight on this topic

If you design for instantaneous updates from the beginning, it translates to less ETL maintenance and labor time for the EDW staff, later


105

Messaging Applied to ETL

Basic concepts Use a load message queue for records that need to be updated, coming

from the source systems When the EDW analytical processing workload is low (off-peak), pick the

next message off the load queue and load the data Run this in parallel so that you can process several load messages at

the same time while you have a window of opportunity Sometimes called “throttling”

Speed up and slow down based upon traffic conditions

Motive behind the concept Continuous updates in a mixed

workload environment Mixed: Analytical processing at the

same time as transaction oriented, constant updates, deletes, inserts


106

ETL Message Queue Process

Source Systems•Updates•Inserts•Deletes

ETL Message Queue

ETLManager

DatabaseWorkload andPerformance

MetricsEDW Production Tables


107

Four Data Maintenance Processes

Initial load Loading into an empty table

Append load Update process Delete process As much as practical, use your database utilities for these

processes Study and know your database utilities for data warehousing;

they are getting better all the time I see some bad strategies in this area-- companies spending

time building their own utilities…aye cucumber!


108

A Few Planning Thoughts Understand the percentage of records

that will be updated, deleted, or inserted You’ll probably develop a different

process for 90% inserts vs. 90% updates

Logging In general, turn logging off during the processes, if logging was

on at all Field vs. Record level updates

Some folks, in the interest of purity, will build complex update processes for passing only field (attribute) level changes

No brainer: Pass the whole record


109

Initial Load

Every table will, at some time, require an initial load For some tables, it will be the best choice for data

maintenance Total data refresh Best for “small” tables

Simple process to implement Simply delete (or truncate) and reload with fresh

data


110

A Better Initial Load Process Background load

Safer– protects against corrupt files Higher availability to customers

Three or four steps… maybe 6?1. Create a temporary table2. Load the temporary table3. Run quality checks4. Rename the temporary table to the production table name5. Delete the old table6. Regrant rights, if necessary

Westerman: “You want to use as many initial load processes as possible.”

I agree!


111

Append Load For larger tables that accumulate historical

data There are no updates, just appends

A hard fact that will not change Example

Sales that are closed Lab results


112

Append Load Options Load a single part of a table Load a partition and ‘attach’ it to the table

Create a new, empty partition Load the new records Attach the partition to the table

Look to use the “LOAD APPEND” command in your database


113

Another Append Option1. Create a temp table identical to the one you

are loading2. Load the new records into the empty temp

table3. Issue INSERT/SELECT

INSERT INTO Big_Table (SELECT * FROM Temp_Big_Table)

4. Delete the temp table

IF # RECORDS IN TEMP IS MUCH < # OF RECORDS IN BIGTHEN GOOD TECHNIQUEELSE NOT GOOD


114

Update Process The most difficult and risky to build Use this process only if the tables are too large for a

complete refresh, “Initial Load” process

Updates affect data that changes over time Like Purchase Orders, hospital transactions, etc. Medical records, if you treat the data maintenance

at the macroscopic level


115

Update Process Options Simple process

1. Separate the affected records into an update file, insert file, or delete file

Do this on the source system, if possible2. Transfer the files to the data warehouse staging area3. Create and run two processes

– A delete process for deleting the records in the production table that need updated or deleted

– An insert process for inserting the entirely new “updated” record into the production table, as well as the true inserts

Simple, but typically not very fast


116

Simple Process

Updated records

Deleted records

New records

Source System

Updates

Deletes

Inserts

EDW Staging AreaDelete Process

Insert Process

EDWProduction Table1

24

56

1. Delete Process identifies records for deletion from the Production Table based upon contents of the Updates file.

2. Delete Process identifies records for deletion from the Production Table based upon contents of the Deletes file.

3. Delete process deletes records from Production Table.

4. Insert Process identifies records for insert to the Production Table based upon contents of the Updates file.

5. Insert Process identifies records for insert to the Production Table based upon contents of the Inserts file.

6. Insert Process inserts records into the Production Table.

3


117

When You Are Unsure Sometimes, source system log and audit files make

it difficult to know if a record was updated or inserted (i.e. created)

Try this…

1. Load the records into a temp table that is identical to the production table to be updated

2. Delete corresponding records from the production table

DELETE FROM prod_table WHERE key_field

IN (SELECT temp_key_field FROM temp_table)

3. Insert all the records from the temp table into the production table

Most databases now support this with an UPSERT


118

Massive Deletes Just as with Updates and Inserts, the number of Deletes you

have to manage is inversely proportional to the frequency of your ETL processes Infrequent ETL Massive data operations

Partitions work well for this, again E.g., keeping a 5 year window of data

Insert most recent year with a partition Delete the last year’s partition

Blazing fast!

1 2 3 4 5

Delete partition

Insert partition


119

“Raw” Data Standards for ETL Makes the process of communicating with your source system

partners much easier Data type (e.g., format for date time stamps) File formats (ASCII vs. EBCDIC) Header records Control characters

Rule of thumb Never transfer data at the binary level unless you are transferring

between binary compatible computer systems Use only text-displayable characters Less rework time vs. Less storage space and faster transfer speed

Storage and CPU time are cheap compared to labor


120

Last Thought…Indexing Strategies

Define these early, practice them religiously, use them extensively

This is “Database Design 101” Don’t fall prey to this most common performance problem!


121

My Thanks For being invited… For your time and attention For the many folks who have worked for and with

me over the years that made me look better as a result

Please contact me if you have any questions [email protected] PH: 312-695-8618


healthcare best practices in data warehousing & analytics

Data & Analytics