data warehouse part 01 - university of houstonsmiertsc/4397cis/data_warehouse_part_01.pdfdata...

34
Data Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data- Mining: A Tutorial-Based Primer by Roiger and Mining: A Tutorial Based Primer by Roiger and Geatz 1

Upload: vocong

Post on 01-May-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Data Warehouse – Part 01

Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial-Based Primer by Roiger and Mining: A Tutorial Based Primer by Roiger and

Geatz

1

Page 2: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

What’s the Problem with Data?What’s the Problem with Data?

2http://www.techrepublic.com/whitepapers/surviving-the-data-explosion-through-data-reduction/1125783?tag=content;siu-container

Page 3: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Why No Just Use Operational Dbs?Why No Just Use Operational Dbs?

Operational Decision Support SystemsOperational Decision Support Systems

Transactional OLTP

For example, systems that support decisions through OLTP

Transaction-oriented, i.e., designed for

support decisions through data mining

Subject-orientedg Quick processing of an

individual transaction h

j

e.g. a purchase

3

Page 4: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Review – Process for Building an O ti l DbOperational Db First Step: data modeling – create the entity relationship

diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model There is no consideration for use in the data model

4

Page 5: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

ERDsERDs

5

http://www.sqlservercentral.com/articles/Miscellaneous/designadatabaseusinganentityrelationshipdiagram/1159/

http://www.umsl.edu/~sauterv/analysis/er/er_intro.html

Page 6: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Entity RelationshipEntity - Relationship

Entity RelationshipEntity Relationship

Concept Represents a class of

Between two entities

One to oneRepresents a class of persons, places, things

May have attributes

One-to-one Husband-to-wife in US

culture and society (at any Some combination of

attributes can uniquely identify each instance of an

y yone time)

One-to-manyidentify each instance of an entity Key

Father-to-child

Many-to-many Student to teacher

6

Student-to-teacher

Page 7: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Sample of Credit Card Promotion Data (f T bl 2 3)(from Table 2.3)Income Range

Magazine Promo

Watch Promo

Life InsPromo

CC Ins Sex AgeRange Promo Promo Promo

40-50K Yes No No No Male 45

30-40K Yes Yes Yes No Female 40

40 0 l 4240-50K No No No No Male 42

30-40K Yes Yes Yes Yes Male 43

50-60K Yes No Yes No Female 38

20-30K No No No No Female 55

30-40K Yes No Yes Yes Male 35

20-30K No Yes No No Male 2720 30K No Yes No No Male 27

30-40K Yes No No No Male 43

30-40K Yes Yes Yes No Female 41

7 e.g., What is the cardinality of the relationship customer-to-promotion?

Page 8: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Review – Process for Building an O ti l DbOperational Db First Step: data modeling – create the entity relationship

diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model There is no consideration for use in the data model

Second Step: normalization (db normalization not mathematical normalization))

8

Page 9: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

NormalizationNormalization Reduces duplication of data within tables

Result is more tables with fewer columns per table

Page 10: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Effective NormalizationEffective Normalization Improves data integrity/validity by reducing data redundancy

Faster sorting of data

Queries run efficiently

Page 11: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Can Have Too Much NormalizationCan Have Too Much Normalization Too many relationships

Too many slim, small tables

To retrieve one piece of information requires access to many bl h h tables through many joins Compromises performance Compromised maintenanceCompromised maintenance

Page 12: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

NormalizationNormalization

A formal processp

12

Page 13: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

First Normal Form (1NF)First Normal Form (1NF) Eliminate any repeating groups of information (a row-

column intersection contains only one value, not a list)

No duplicate rows (a primary key can be assigned)

(P bl ) T bl E l(Problem) Table: EmployeeEmployee_ID Last_Name Children

100 Patel Babaraj, Salleh, Sara110 Washington Martha, Ted120 Cortez Sam, Jorge

Page 14: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Second Normal Form (2NF)Second Normal Form (2NF) 1NF plus

Each column within the table must depend on the whole primary key

(P bl ) T bl C D t il(Problem) Table: Course_DetailsPrefix Course Credits College

CIS 3320 3 TechnologygyCIS 4380 3 TechnologyCHEM 3505 5 NSMMIS 3320 3 BusinessMIS 3320 3 Business

Page 15: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Third Normal Form (3NF)Third Normal Form (3NF) 2NF plus

No column is dependant on any other column within the table that is not defined as a key.

N d d d f h d h h bl No data is derived from other data within the table.

(Problem) Table: Course_SectionS i I Offi Ph N bSection Instructor Office Phone Number

12345 100-4 M 355 5-701112467 101-6 B 424 5-632215083 100-4 M 355 5-701116078 210-8 B 434 5-332156701 101-6 B 424 5-632256701 101 6 B 424 5 632212554 100-12 M 201 A 5-7337

Page 16: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Relational ModelRelational ModelEntities are realized as two-dimensional tables where the columns are the

ib f h i d h d i f ( l f) h iattributes of the entity and the rows are data instances of (examples of) the entity.Relationships between entities are realized as relationship that maps the primary key attribute set of one table to one or more columns of the related entity’s table.

16

Page 17: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

C id T ti O i t ti Consider Transaction Orientation vs…Table: Course Section

S ti I t tSection Instructor

12451 100-4

19372 101-7

10029 100-12

12452 100-4

T bl G d dTable: Grade Record

Student Section Grade

1093456 12451 B

1184567 12452 B

2341100 10029 C

1972344 10029 D

17

1972344 10029 D

Page 18: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Subject OrientationSubject OrientationTable: Grade_by_Instructor

St d t G d S ti I t tStudent Grade Section Instructor

1093456 B 12451 100-4

84 67 24 2 00 41184567 B 12452 100-4

2341100 C 10029 100-12

1972344 D 10029 100-12

1093456 C 10029 100-12

1184567 C 10029 100-12

2341100 A 12451 100-42341100 A 12451 100 4

1972344 B 12452 100-4

18

Page 19: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Data Warehouse DesignData Warehouse Design

“A data warehouse is a subject-oriented, integrated, time-variant, gand nonvolatile collection of data in support of management’s decision making process.”*

19*Inmon, W. H. (1996). Building the Data Warehouse. New York: John Wiley and Sons, Inc.

Page 20: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

OLTP vs Data WarehouseOLTP vs. Data Warehouse

Data Warehouse OLTPData Warehouse OLTP

Subject-oriented Denormalized integrated

Process-oriented (or transaction-oriented)Denormalized, integrated

Stores data to be reported on, analyzed, tested

Normalized, separated Stores data to be processed,

collected managed Data is historical, no

longer used in operations Data is static

collected, managed Data is necessary for day-to-

day operations of the business Data is static Granularity is a design

issue

Data will be updated Granularity to the most

detailed level

20

Page 21: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Sources of DW DataSources of DW Data External data Data not specific to the organization Economic indicators, weather

O ti l d t Operational data From the OLTP system

Independent data mart Independent data mart Like a data warehouse only focuses on one subject Belongs to the organization – but maybe to a different g g y

department

21

Page 22: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

ETLETL Extract –Transform – Load A routine whereby data is brought into the data warehouse from

other sources

Transform Transform Data cleaning Resolve granularity issuesg y Correct data inconsistencies Time-stamp data records

22

Page 23: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Data in a DW is StaticData in a DW is Static Once data is in the data warehouse it is read-only

Not always true

23

Page 24: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Data Warehouse FormatData Warehouse Format

Multidimensional array of data – not based on relational modely

Star schema – based on relational model

24

Page 25: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Star SchemaStar Schema

25

Page 26: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Star SchemaStar Schema

26

http://www.executionmih.com/data-warehouse/star-snowflake-schema.php

Page 27: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Fact TableFact Table Defines the dimensions of the multi-dimensional space being

created

Each record in a fact table contains two types of data F t Facts Dimension keys

Fact table key is a composite key made of keys for each Fact table key is a composite key made of keys for each dimension table

27

Page 28: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Dimension TableDimension Table Data specific to a dimension

One-to-many relation from dimension table to fact table

28

Page 29: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Multidimensional DatabaseMultidimensional Database Definition “A multidimensional database

is structured around measures dimensions measures, dimensions, hierarchies, and cubes rather than tables, rows, columns,

d l ”and relations.” Larson, B. (2008). Delivering

Business Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.

http://gerardnico.com/wiki/database/database_multidimensional

29

Page 30: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

[Data] Cube[Data] Cube

Definition “A cube is a structure that contains a value Definitionfor one or more measures for each unique combination of the members of all its d Th d l l f l l dimensions. These are detail, or leaf-level values. The cube also contains aggregated values formed by the dimension hierarchies yor when one or more of the dimensions is left out of the hierarchy.” Larson, B. (2008). Delivering Business

Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.

30

Page 31: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

A Point Within a Cube is a Value of the M (th F t)Measure (the Fact) The intersection of all

dimensions is a point

That point represents a value of the measure for value of the measure for the particular unique combination of dimension values

The point is called a detail for leaf-level value

31

Page 32: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Snowflake SchemaSnowflake Schema

Dimension tables are normalized – hence they are broken down into ytwo or more tables

32

Page 33: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Constellation SchemaConstellation Schema

More than one fact table

33

Page 34: Data Warehouse Part 01 - University of Houstonsmiertsc/4397cis/Data_Warehouse_Part_01.pdfData Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial

Data Warehouse – Part 01

Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial-Based Primer by Roiger and Mining: A Tutorial Based Primer by Roiger and

Geatz

34