data warehouse part 01 - university of houstonsmiertsc/4397cis/data_warehouse_part_01.pdfdata...

Post on 01-May-2018

222 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Warehouse – Part 01

Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial-Based Primer by Roiger and Mining: A Tutorial Based Primer by Roiger and

Geatz

1

What’s the Problem with Data?What’s the Problem with Data?

2http://www.techrepublic.com/whitepapers/surviving-the-data-explosion-through-data-reduction/1125783?tag=content;siu-container

Why No Just Use Operational Dbs?Why No Just Use Operational Dbs?

Operational Decision Support SystemsOperational Decision Support Systems

Transactional OLTP

For example, systems that support decisions through OLTP

Transaction-oriented, i.e., designed for

support decisions through data mining

Subject-orientedg Quick processing of an

individual transaction h

j

e.g. a purchase

3

Review – Process for Building an O ti l DbOperational Db First Step: data modeling – create the entity relationship

diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model There is no consideration for use in the data model

4

ERDsERDs

5

http://www.sqlservercentral.com/articles/Miscellaneous/designadatabaseusinganentityrelationshipdiagram/1159/

http://www.umsl.edu/~sauterv/analysis/er/er_intro.html

Entity RelationshipEntity - Relationship

Entity RelationshipEntity Relationship

Concept Represents a class of

Between two entities

One to oneRepresents a class of persons, places, things

May have attributes

One-to-one Husband-to-wife in US

culture and society (at any Some combination of

attributes can uniquely identify each instance of an

y yone time)

One-to-manyidentify each instance of an entity Key

Father-to-child

Many-to-many Student to teacher

6

Student-to-teacher

Sample of Credit Card Promotion Data (f T bl 2 3)(from Table 2.3)Income Range

Magazine Promo

Watch Promo

Life InsPromo

CC Ins Sex AgeRange Promo Promo Promo

40-50K Yes No No No Male 45

30-40K Yes Yes Yes No Female 40

40 0 l 4240-50K No No No No Male 42

30-40K Yes Yes Yes Yes Male 43

50-60K Yes No Yes No Female 38

20-30K No No No No Female 55

30-40K Yes No Yes Yes Male 35

20-30K No Yes No No Male 2720 30K No Yes No No Male 27

30-40K Yes No No No Male 43

30-40K Yes Yes Yes No Female 41

7 e.g., What is the cardinality of the relationship customer-to-promotion?

Review – Process for Building an O ti l DbOperational Db First Step: data modeling – create the entity relationship

diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model There is no consideration for use in the data model

Second Step: normalization (db normalization not mathematical normalization))

8

NormalizationNormalization Reduces duplication of data within tables

Result is more tables with fewer columns per table

Effective NormalizationEffective Normalization Improves data integrity/validity by reducing data redundancy

Faster sorting of data

Queries run efficiently

Can Have Too Much NormalizationCan Have Too Much Normalization Too many relationships

Too many slim, small tables

To retrieve one piece of information requires access to many bl h h tables through many joins Compromises performance Compromised maintenanceCompromised maintenance

NormalizationNormalization

A formal processp

12

First Normal Form (1NF)First Normal Form (1NF) Eliminate any repeating groups of information (a row-

column intersection contains only one value, not a list)

No duplicate rows (a primary key can be assigned)

(P bl ) T bl E l(Problem) Table: EmployeeEmployee_ID Last_Name Children

100 Patel Babaraj, Salleh, Sara110 Washington Martha, Ted120 Cortez Sam, Jorge

Second Normal Form (2NF)Second Normal Form (2NF) 1NF plus

Each column within the table must depend on the whole primary key

(P bl ) T bl C D t il(Problem) Table: Course_DetailsPrefix Course Credits College

CIS 3320 3 TechnologygyCIS 4380 3 TechnologyCHEM 3505 5 NSMMIS 3320 3 BusinessMIS 3320 3 Business

Third Normal Form (3NF)Third Normal Form (3NF) 2NF plus

No column is dependant on any other column within the table that is not defined as a key.

N d d d f h d h h bl No data is derived from other data within the table.

(Problem) Table: Course_SectionS i I Offi Ph N bSection Instructor Office Phone Number

12345 100-4 M 355 5-701112467 101-6 B 424 5-632215083 100-4 M 355 5-701116078 210-8 B 434 5-332156701 101-6 B 424 5-632256701 101 6 B 424 5 632212554 100-12 M 201 A 5-7337

Relational ModelRelational ModelEntities are realized as two-dimensional tables where the columns are the

ib f h i d h d i f ( l f) h iattributes of the entity and the rows are data instances of (examples of) the entity.Relationships between entities are realized as relationship that maps the primary key attribute set of one table to one or more columns of the related entity’s table.

16

C id T ti O i t ti Consider Transaction Orientation vs…Table: Course Section

S ti I t tSection Instructor

12451 100-4

19372 101-7

10029 100-12

12452 100-4

T bl G d dTable: Grade Record

Student Section Grade

1093456 12451 B

1184567 12452 B

2341100 10029 C

1972344 10029 D

17

1972344 10029 D

Subject OrientationSubject OrientationTable: Grade_by_Instructor

St d t G d S ti I t tStudent Grade Section Instructor

1093456 B 12451 100-4

84 67 24 2 00 41184567 B 12452 100-4

2341100 C 10029 100-12

1972344 D 10029 100-12

1093456 C 10029 100-12

1184567 C 10029 100-12

2341100 A 12451 100-42341100 A 12451 100 4

1972344 B 12452 100-4

18

Data Warehouse DesignData Warehouse Design

“A data warehouse is a subject-oriented, integrated, time-variant, gand nonvolatile collection of data in support of management’s decision making process.”*

19*Inmon, W. H. (1996). Building the Data Warehouse. New York: John Wiley and Sons, Inc.

OLTP vs Data WarehouseOLTP vs. Data Warehouse

Data Warehouse OLTPData Warehouse OLTP

Subject-oriented Denormalized integrated

Process-oriented (or transaction-oriented)Denormalized, integrated

Stores data to be reported on, analyzed, tested

Normalized, separated Stores data to be processed,

collected managed Data is historical, no

longer used in operations Data is static

collected, managed Data is necessary for day-to-

day operations of the business Data is static Granularity is a design

issue

Data will be updated Granularity to the most

detailed level

20

Sources of DW DataSources of DW Data External data Data not specific to the organization Economic indicators, weather

O ti l d t Operational data From the OLTP system

Independent data mart Independent data mart Like a data warehouse only focuses on one subject Belongs to the organization – but maybe to a different g g y

department

21

ETLETL Extract –Transform – Load A routine whereby data is brought into the data warehouse from

other sources

Transform Transform Data cleaning Resolve granularity issuesg y Correct data inconsistencies Time-stamp data records

22

Data in a DW is StaticData in a DW is Static Once data is in the data warehouse it is read-only

Not always true

23

Data Warehouse FormatData Warehouse Format

Multidimensional array of data – not based on relational modely

Star schema – based on relational model

24

Star SchemaStar Schema

25

Star SchemaStar Schema

26

http://www.executionmih.com/data-warehouse/star-snowflake-schema.php

Fact TableFact Table Defines the dimensions of the multi-dimensional space being

created

Each record in a fact table contains two types of data F t Facts Dimension keys

Fact table key is a composite key made of keys for each Fact table key is a composite key made of keys for each dimension table

27

Dimension TableDimension Table Data specific to a dimension

One-to-many relation from dimension table to fact table

28

Multidimensional DatabaseMultidimensional Database Definition “A multidimensional database

is structured around measures dimensions measures, dimensions, hierarchies, and cubes rather than tables, rows, columns,

d l ”and relations.” Larson, B. (2008). Delivering

Business Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.

http://gerardnico.com/wiki/database/database_multidimensional

29

[Data] Cube[Data] Cube

Definition “A cube is a structure that contains a value Definitionfor one or more measures for each unique combination of the members of all its d Th d l l f l l dimensions. These are detail, or leaf-level values. The cube also contains aggregated values formed by the dimension hierarchies yor when one or more of the dimensions is left out of the hierarchy.” Larson, B. (2008). Delivering Business

Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.

30

A Point Within a Cube is a Value of the M (th F t)Measure (the Fact) The intersection of all

dimensions is a point

That point represents a value of the measure for value of the measure for the particular unique combination of dimension values

The point is called a detail for leaf-level value

31

Snowflake SchemaSnowflake Schema

Dimension tables are normalized – hence they are broken down into ytwo or more tables

32

Constellation SchemaConstellation Schema

More than one fact table

33

Data Warehouse – Part 01

Based on Chapter 06 The Data Warehouse in Data-Mining: A Tutorial-Based Primer by Roiger and Mining: A Tutorial Based Primer by Roiger and

Geatz

34

top related