data warehouse chapter 11. multiple files problem added complexity of multiple source files start...

27
Data Warehouse Chapter 11

Upload: mitchell-cummings

Post on 27-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Data Warehouse

Chapter 11

Multiple Files Problem Added complexity of multiple

source files Start simple

MultipleSource files

Extracted dataLogic to detectCorrect source

Transforming Data from Multiple files

File File File

File

File

File

File

File File

Missing Values ProblemSolution Ignore Wait Mark rows Extract when time-stamped

AIf NULL then Field=‘A’

Duplicate Value Problem

Solution SQL self-join techniques RDMBS constrains utilities

SELECT…FROM table_a, table_bWHERE table_a.key(+)=table_b.keyUNIONSELECT…FROM table_a, table_bWHERE table_a.key=table_b.key(+)ACME IncACME Inc

ACME Inc

ACME Inc

ACME Inc

Element Names Problem

Solution CTAS SQL*Loader

Customer

Client

Contact

Name

Customer

Element Meaning Problem

Avoid misinterpretation Complex solution Document meaning in

metadata

Customer’sname

All customerdetails

All detailsExcept name

Input Format Problem

EBCDIC ASCII

“123-73” 12373

Referential Integrity ProblemSolution SQL anti-join Server constraints Dedicated tools

Department10203040

Emp Name Department1099 Smith 101269 Jones 201270 Doe 506787 Harris 60

Name and Address Problem No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same

number Many names on one line One name on two lines

Name and Address Problem Single-field format

Multiple-field format

Mr.J.Smith, 100 Main St., Bigtown, County Luth, 23565

NameStreetTownCountyCode

Mr.J.Smith100 Main St.BigtownCounty Luth23565

Clean and Organize1. Create atomic values.2. Standardize formats.3. Verify data accuracy.4. Match with other records.5. Identify private and commercial

addresses and inhabitants.6. Document in metadata.Requires sophisticated tools and techniques

Merging Data Operational transactions do not usually

map one-to-one with warehouse data Data for the warehouse is merged to

provide information for analysis

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $15.00

Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

Return 1/2/98 12:00:03 Ham Pizza -$12.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Pizza sales/return by day, hour, seconds

Merging DataSale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $15.00

Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

Return 1/2/98 12:00:03 Ham Pizza -$12.00

Sale 1/2/98 12:00:01 Ham Pizza $10.00

Sale 1/2/98 12:00:02 Cheese Pizza $10.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Adding a Date Stamp Enables time analysis Label loaded data with a date

stamp Add time to fact and dimension

data

Adding a Date Stamp

Sales Fact TableItem_idStore_id

Time_keySales_dollarsSales_units

Store TableStore_id

District_idTime_key

Item_TableItem_idDept_id

Time_key

Time TableWeek_idPeriod_idYear_id

Time_key

Product TableProduct_idTime_key

Product_desc

Adding a Date Stamp Fact table - Add triggers - Recode applications - Compare tables Dimension table Time representation - Point in time - Time span

Adding Keys to Data#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00

#3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00

#4 Sale 1/2/98 12:00:03 Ham Pizza -$12.00

#dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00

#dw2 Sale 1/2/98 12:00:02 Cheese Pizza $10.00

#dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

#5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

Data values or artificial keys

Summarizing Data During extraction on staging area After loading onto the warehouse

server

Operational databases

Staging area

Warehouse database

Maintaining Transformation Metadata

Contains transformation rules, algorithms, and routines

Sources Stages Rules Publish Extract Transform Load Query

Transformation Timing and Location Transformation is performed: - Before load - In parallel May be initiated at different points

Unlikely Probable Possible

Choosing a Transformation Point

* Workload * Network bandwidth* Environment * Parallel execution* CPU use * Load window time * Disk space * User information needs

Monitoring and Tracking

Transformations should: Be self-documenting Provides summary statistics Handle process exceptions

Designing Transformation Processes Analysis: - Sources and target mappings, business rules - Key users, metadata, grain Design options: PL/SQL, replication, custom,

third-party tools Design issues: - Performance - Size of the staging area - Exception handling, integrity maintenance

Transformation Tools Purchased SQL*Loader In-house developed

Data Management, Quality, and Auditing Tools Data management: - Innovative Systems - Postalsoft - Vality Technology Data quality and auditing: - Innovative Systems - Vality Technology

Summary

This lesson discussed the following topics:

Importance of data quality Transformation processes Data transformation issuess Data anomalies Name and address management Tools