transportation: refreshing warehouse data chapter 13
TRANSCRIPT
Developing a Refresh Strategy for Capturing Changed Data
Consider load windowIdentify data volumesIdentify cycleKnow the technical infrastructurePlan a staging areaDetermine how to detect changes
Operational databases
T1 T2 T3
User Requirements and Assistance
Users define the refresh cycle IT balances requirements against technical
issues Document all tasks and processes Employ user skills
Operational databases
T1 T2 T3
Load Window Time available for entire ETT process Plan Test Prove Monitor
Load Window User Access Period Load Window
0 3am 6 9 12pm 3 6 9 12
Load Window Plan and build processes according to a
strategy. Consider volumes of data. Identify technical infrastructure. Ensure currency of data. Consider user access requirements first High availability requirements may mean a
small load window User Access Period
0 3am 6 9 12pm 3 6 9 12
Scheduling the Load Window
Requirements Load cycle
File NamesFile typesNumber of filesNumber of loadsFirst-time load or refreshDate of fileData rangeRecords in file - countsTotals - amounts
3
4
Control File
File1
File2
FTP
Receive data
Openand readfilesto
verifyand
analyze
Controlprocess
0 3 am
Scheduling the Load Window
Load intowarehouse
Verify,analyze,reapply Index
data
Createsummaries
Updatemetadata
5
6
7
8
9
File1
File2 Parallel
load
9 am3 am 6 am
Scheduling the Load Window
Back upwarehouse
Create Views for Specialized
tools
UsersAccess
Summarydata
Publish
10
11 1213
9 am6 am
User access
Capturing Changed Data for Refresh
Capture new fact data Capture changed dimension data Determine method for capture of each Methods: - Wholesale data replacement - Comparison of database instances - Time stamping - Database triggers - Database log Hybird techniques
Wholesale Data Replacement
Expensive Limited historical data, if any Data mart implementations Time period replacement
Operational databases
T1 T2 T3
Comparison of Database Instance
Simple to perform, but expensive in time and processing
Data file: - Changes to operational data since
last refresh - Used by various techniques
Yesterday’sOperationaldatabase
Today’sOperationaldatabase
Databasecomparison
Delta file holdsChanged data
Time and Date Stamping
Fast scanning for records changed since last extraction
Date Updated field No detection of deleted data
Operational data
Delta file holdsChanged data
Database Triggers
Changed data intersected at the server level
Extra I/O required Maintenance overhead
Operation Server
(DBMS)
Trigger
Trigger
Trigger
Using a Database Log
Contains before and after images Requires system checkpoint Common technique
OperationalServer
(DBMS)Log analysis
AndData extraction
LogOperational data
Delta file holdsChanged data
Verdict Consider each method on merit. Consider a hybrid approach if one
approach is not suitable. Consider current technical, existing
operational, and current application issues.
Applying the Changes to Data
You have a choice of techniques: Overwrite a record Add a record Add a field Maintain history Add version numbers
Overwriting a Record
Easy to implement Loses all history Not recommended
Customer ID John Doe Single
Customer ID John Doe Married
Adding a New Record
History is preserved; dimensions grow. Time constraints are not required. Generalized key is created. Metadata tracks usage of keys.
1 Customer Id John Doe Single
1 Customer Id John Doe Single
1A Customer Id John Doe Married
Adding a Current Field
Maintains some history Loses intermediate values Is enhanced by adding an Effective
Date field
Customer Id John Doe Single
Customer Id John Doe Single Married 01-JAN-96
Limitations of Methods for Applying Changes
Complete history impossible Dimensions may grow large Maintenance overload
1234 Comer 1 Main Street 555-67891234 Comer 200 First Ave 222-3211
1234 Comer 1 Main Street 555-6789
1234 Comer 1 Main Street 555-6789 01-Apr-93
1234-01 Comer 200 First Ave 222-3211
Effective Date
1234-01 Comer 200 First Ave 222-3212 01-Jun-97
Maintaining History
One-to-many relationship Always retain current record Consistently able to refer to
record history
HIST_CUST
CUSTOMER
Sales
Time
Product
History Preserved History enables realistic analysis. History retains context of data. History provides for realistic historical
analysis. - Reflect business changes - Maintain context between fact and dimension data - Retain sufficient data to relate old to
new
Version Numbering Avoid double counting Facts hold version number
Customer.CustId Version Customer Names1234 1 Comer1234 2 Comer
Customer.CustId Version Sales Facts1234 1 11,0001234 2 12,000
Customer
Sales
Time
Product
Purging and Archiving Data As data ages, its value
depreciates. Remove old data from the
warehouse: - Archive for later use - Purge without copy
Techniques for Purging Data TRUNCATE: Retains no rollback DELETE: Retains redo and rollback ALTER TABLE: Removes a partition PL/SQL: Uses database triggers
Techniques for Archiving Data Export to dump file from tables Import to tables from dump file ALTER TABLE EXCHANGE partitions
DatabaseDatabase
EXP
IMP .dmp
Final Tasks Update metadata - ETT - User Publish data - Availability - Changes - Subject area basis Use database roles to prevent and allow
access
Publishing Data Control access using database roles 24-hour operation may be requested Compromise between load and
access Consider - Staggering updates - Using temporary tables - Using separate tables
ETT Tool Selection Criteria Overlap with existing tools Availability of meta model Supported data sources Ease of modification and maintenance Required fine tuning of code Ease of change control Power of transformation logic Level of modularization Power of error, exception, resubmission features Intuitive documentation Performance of code
ETT Tool Selection Criteria Activity scheduling and
sophistication Metadata generation Learning curve Flexibility Supported operation systems Cost
Transportation Tools Information OpenBridge Oracle SQL*Loader Gateways PL/SQL Precompilers Platinum Technology InfoPump Platinum Info
Transport
Gateways and Middleware Brio Technology DataPrism Information Co. OpenBridge Information Builders EDA/SQL Oracle Gateways Platinum Technology InfoHub Prism Prism Manager Software AG Entire Transaction Propagator