data warehousing course outline for intro to business intelligence unit 1 by don krapohl overview...

41
Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview Basic Definitions Normalization Entity Relationship Diagrams (ERDs) Normal Forms Many to Many relationships Warehouse Considerations Dimension Tables Fact Tables Star Schema Snowflake Schema Further Warehouse Design Considerations Changing Dimensions Conformed Dimensions

Upload: jocelin-copeland

Post on 25-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Data WarehousingCourse Outline for Intro to Business Intelligence Unit 1 by Don Krapohl

• Overview

– Basic Definitions– Normalization

• Entity Relationship Diagrams (ERDs)• Normal Forms• Many to Many relationships

– Warehouse Considerations• Dimension Tables• Fact Tables• Star Schema• Snowflake Schema• Further Warehouse Design Considerations

– Changing Dimensions– Conformed Dimensions

Page 2: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Data warehouse – A data warehouse is a copy of transaction data specifically structured for

querying and reporting.– “a collection of computerized data that is organized to most optimally

support reporting and analysis activity”

• OLTP - On-Line Transaction Processing– OLTP describes a type of processing that databases are designed to

support.

– OLTP applications need to support a high number of transactions per unit of time.

– A transaction is a set of Insert, Update, and sometimes Delete statements that must succeed or fail as a unit. Transactions typically perform such functions as recording orders, depleting inventory, etc.

– Electronic banking and order processing are common OLTP applications.

• OLAP - On-Line Analytical Processing– In its broadest usage, the term "OLAP" is used as a synonym for "data

warehousing".

– The term "On-Line Analytical Processing" was developed to distinguish data warehousing activities from On-Line Transaction Processing.

– In a narrower usage, the term OLAP is used to refer to the tools used for Multidimensional Analysis…

Page 3: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Sample Star Schema:

… but when people speak of OLAP they may properly be referring to a schema like this one in a relational database.

Page 4: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Database Normalization

– Normalization reduces redundant data storage by organizing data efficiently.

– There are many ways to normalize a database consistently within a set of business requirements.

– Normalization reduces the potential for anomalies during data manipulation operations.

– Non-normalized databases are vulnerable to data anomalies when they store data redundantly.

• If data is stored in two locations, but is later updated in only one location, then the data becomes inconsistent; this is referred to as an update anomaly.

• To avoid data anomalies, non-primary key data in a normalized database are stored in only one location.

– If you need a Department’s physical location, you should need to look in the Department Table.

Page 5: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Unnormalized Table

• We could design a database so that each record we would read about specific types of business object would have all the information we’d typically need about those object types.

But

Page 6: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• We could generate the information on the previous page with this query:

Select e.EmployeeID, e.LastName, e.FirstName, d.DeptID, d.Name, d.Location

From Department dInner Join Employee e

on d.DeptID = e.DeptID

This schema is more typical of a normalized database.

Page 7: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

When we normalize, we’re building a logical hierarchy.

Page 8: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Entities – classes of objects that are of interest from a business standpoint, about

which information needs to be maintained

– In the process of modeling, they evolve into database tables.

• Entities are always nouns in business narratives (but not all nouns in business narratives are entities).

– Examples: Employee, Department, Project

• Entities must have attributes, or properties, that need to be known, which become columns.

– Employee:• Name, Birth Date, Salary

– Department:• Name, Number, Location

• Each entity is representative of a class of objects, and each instance of an well-formed entity will map to a row in a table.

• Each instance of an entity must be uniquely distinguishable from other instances of the same entity.

– An attribute or set of attributes that uniquely identify an entity is called a Unique Identifier (UID).

Page 9: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Relationship

– A bi-directional, significant association between two entities, or between an entity and itself

– Each (direction of a) relationship has:

• Name• Optionality

– Either Must Be or May Be• Degree/Cardinality/Ordinality

– 1:1 or 1:M ( or M:M)– Degree = 0 is expressed as “may be.”

• Each employee must be assigned to one and only one department.

• Each department may be responsible for one or more employees.

• Our definitions for entities, attributes and relationships must have equal validity for each instance; not the normal case only.

• This point is critically important.

Page 10: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• First Normal Form

– 1NF requires that each attribute store only one value.– There can be no repeating groups ( = no “multivalued attributes”).– Each attribute of the table is said to be “atomic”.

• For example, each record in the Home table below should have only one owner.

•Unnormalized EntityWhat if some homes have more than three owners? How would we write stored procs to read from this table?

Each cell, which is the intersection of a row and a column, can contain only one value.

Mention the PK convention.

Page 11: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• To support multiple owners we need another entity:

– This will always be the case when an entity has a 1:M relationship with one of its attributes.

Both entities are now in 1NF.

Page 12: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Second Normal Form

– To be in 2NF, a table must be in 1NF.– In addition, each non-key attribute must be dependent on all parts

of its primary key.• There must be no “partial key dependencies”.

– In the previous example:– The Home entity is not in 2NF.

» The Mayor attribute doesn’t depend on the entire primary key.

» We need a new entity.– The Owner entity is not in 2NF.

» The Price of Tea does not depend on the Owner.» “We decide not to track this attribute.”

• In normalizing to 2NF, we attempt to reduce the amount of redundant data in a table by extracting it, placing it in new tables, and creating relationships between tables.

Page 13: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Tables are now in 2NF.

Page 14: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Third Normal Form

– To be in 3NF, a table must be in 2NF.

– Additionally, all attributes that are not wholly dependent upon the primary key must be remodeled.

• Each table attribute can depend on nothing other than its primary key.

• 3NF = “Every non-key attribute must depend on the key, the whole key, and nothing but the key.”

– In the previous example:• Sun sign depends on birth date, so it should be stored in a

different table.• A general modeling principle we see here is that when an

attribute depends on another attribute, a new table will be necessary to model the relationship.

Page 15: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Entities are now in 3NF.

Page 16: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Modeling the M:M relationship– How do we record the owners of individual homes?

Page 17: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• We need an intermediate table that has a M:1 relationship with each of its parent tables.

Page 18: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• The query below shows the name of each home’s owner(s).

Page 19: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• General Remarks:

• The definitions of normal forms provide guidelines for relational database design. Occasionally, it is necessary to stray from them to meet practical business requirements in an OLTP environment.

• There is not a single best way to normalize a database to conform with a specific set of business requirements.

• Insert, Update, and Delete operations run more quickly in a normalized database.

• Complex Select statements run more slowly.

Page 20: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Reasons to denormalize

• The fundamental reason to denormalize is to improve query performance.

• Consider the case of City, State, and CityStateZip tables.– These tables can be designed to conform to the third normal form.– But each time you need to write a query to extract Customer data,

you will need to join data from four tables.

• If no valid business reason exists to divide city, state, and ZIP Code information into separate tables, then it may make sense to denormalize.

• Dimension tables in a star schema are intentionally denormalized.

Page 21: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Normalized database:– Many “narrow” tables (i.e. fewer columns)– Optimized for Insert Update, and Delete operations– Slower Select statements because of the need for frequent

join operations– Few indexes – Necessary for large OLTP applications

• Non-normalized database:– Fewer (but “wider”) tables– Faster Select statements because we don’t need to join as

often– Transactions are more problematic because of the need to

maintain redundant instances of data during Insert, Update, and Delete operations

– Many indexes because data is relatively static– Necessary for large relational OLAP applications

Page 22: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Data Warehouses

• Data warehouses and data marts are storage mechanisms for read-only, historical, aggregated data.

• Consider this example: we sell 2 products, dog food and cat food. Each day, we record the sales of each product. Here is some sample OLTP data for a couple of days:

Page 23: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Our data warehouse would usually not record this level of detail.

– Instead, in a warehouse we would summarize, or aggregate, the data to daily totals. Our records in the data warehouse might look something like this:

– Here we have reduced the number of records by aggregating the individual transaction records into daily records that show the number of each product purchased each day.

– We can certainly generate this data set from the OLTP system by running a query…

Page 24: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• … but if we want to view our data as aggregated numbers broken down along a series of criteria (i.e. so-called “by conditions”), then query performance will improve if we store data in a denormalized format.

– That’s exactly what we do when implementing a star schema.

• It’s important to realize that OLTP is not meant to be the basis of a decision support system. OLTP applications are optimized for activities such as recording (high numbers of) orders, etc.

• A system optimized for processing transactions is not optimized to perform complex analyses designed to uncover hidden trends.

• Therefore, rather than tie up our OLTP system by performing expensive queries, we should build a less normalized structure that conforms better to our query needs.

Page 25: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• The Warehouse

– Typical business questions that drive warehouse design:

• How many units did we sell last week?• Are overall sales of individual products or product categories

higher or lower this year than in previous years?• On a quarterly or monthly basis, are sales for some

products/categories cyclical?• In what regions are sales down this year?

– What products/categories in those regions account for the greatest percentage of the decrease?

– Some characteristics of warehouse business questions:

• Many concern the element of time. • Many questions require the aggregation of data; sums and

counts are important in an OLAP environment, whereas individual transactions are important in an OLTP environment.

• Each questions looks at data in terms of “by” conditions. – “On a quarterly and then monthly basis, are Dairy Product

sales cyclical?” = “We need to see total sales of Dairy Products by quarter and by month.”

Page 26: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

These “by” conditions drive the design of our star schema.

Each “by” condition is represented by a Dimension table.

Page 27: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Dimension Tables – General Remarks

• Product and Geography are common dimensions.

• Date/Time information is almost always stored in a Dimension table.

• If our data happen to start on a particular date, do we care what sales have been since that date, or do we care more about how one year’s sales compares to other years’?

– Comparing one year to another is a common form of trend analysis accomplished through the use of a star schema.

Page 28: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Dimension Table Structure

– Dimension tables should have a single-field primary key. This key is often an identity column.

• The value of the primary key is irrelevant; our information is stored in the other fields in the table.

• Because the fields are the full descriptions, the dimension tables are often “wide”, i.e. they contain many large fields.

For example, if we have a Product dimension, then we’ll have fields in it that contain the description, the category name, the sub-category name, etc. These fields do not contain codes that link us to other tables.

Dimension tables are often small in terms of row count relative to Fact tables.

Page 29: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Dimensional Hierarchies (Denormalization):

• In a star schema, the entire hierarchy for a dimension is stored in its corresponding Dimension table in the data warehouse.

• The product dimension, for example, contains individual products. – Products are normally grouped into categories, and these categories may

contain sub-categories. • For example, a product with a product number of M1652 may be a

refrigerator. Thus it belongs in the major appliance category, and in the refrigerator sub-category.

– We may have more levels of sub-categories to further classify each product.

– In an OLAP environment, it is preferable to maintain the product hierarchy in a single table, although this hierarchy would certainly be distributed among Product, Category, and SubCategory tables in an OLTP environment.

• This hierarchy allows us to perform “drill-down” functions on the data. We can perform a query that performs sums by category. We can then drill-down into that category by calculating sums for the subcategories for that category. We can the calculate the sums for the individual products in a particular subcategory.

– The actual sums we are calculating are based on numbers stored in the fact table.

Page 30: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

– Fact tables

• When we talk about the way we want to look at data, we usually want to see some sort of aggregated data. These data are called measures.

• Measures are numeric values that are measurable and additive.

– Sales dollars are a very common measure.– The Number of Customers we have is also a typical measure.– We’d probably track both of these by day.

• Fact tables are used to store measures, or facts, which are numeric and additive across some or all dimensions.

• In the following star schema, sales dollars are numeric, and we can examine total sales in terms of product, category, and time period.

• Fact tables are “narrow” in the sense that they contain few (and numeric) columns, but they do contain large numbers of rows.

– Fact tables are responsible for most of the disk space used in a warehouse.

Page 31: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Fact Table Granularity

– Granularity refers to the level of detail in a fact table and is one of the most important design decisions in data warehouse planning.

• Granularity is often determined by the time dimension.– For example, you may elect to store only weekly or

monthly totals for sales dollars.

• Granularity determines how far we can drill down without recourse to the source OLTP data.

– Many if not most OLAP systems have daily grain in the Time dimension.

• Selecting a finer grain results in more records in the fact table.

• Choose data types for fact table columns that keep the table as small as possible.

Page 32: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Aggregations

– Fact table data consists of aggregations that are based on the fact table’s granularity.

– Frequently we’ll want to aggregate to a higher level.

• We may choose to keep total sales dollars at a quarterly or monthly level.

• We may be interested in only a particular product or category in this case.

• A better alternative is to build a cube structure…

Page 33: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Simple Star Schema:

• To obtain total sales for all major appliances during March of 1999:

Select Sum (sf.SalesDollars) as TotalSales

From SalesFact sfInner Join TimeDimension td

On td.TimeID = sf.TimeIDInner Join ProductDimension pd

On pd.ProductID = sf.ProductID

Where pd.Category = ‘Major Appliance’And td.Month = 3 And td.Year = 1999

Page 34: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Snowflake Schemas

• Sometimes dimension tables have hierarchies broken out into separate tables. This will result in a different schema type known as a snowflake.– This is a more normalized structure, but leads to more difficult

queries and slower response times.– It does conserve more disk space than a star schema that contains

the same data.

Page 35: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Graphical comparison of Star and Snowflake schemas

Star Schema

Snowflake Schema

Page 36: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Further Warehouse Design Considerations

– Changing Dimensions– In the schema below, consider a scenario in which we have

realigned some of our stores, placing them in different territories and regions.

Page 37: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

– In the StoreDimension table, we have each store in a particular region, territory, and zone.

– If we simply update the StoreDimension table with new territory/region information, and then examine historical sales for a region, the numbers will no longer be accurate.

– To address this issue, consider creating new records for affected stores.

• Every new record will contain each store’s new region, but leaves old store records intact along with the old regional sales data.

• This approach, however, prevents us from comparing this stores current sales to its historical sales unless we keep track of its previous StoreID. This may require an extra field called PreviousStoreID or something similar.

– There are no right and wrong answers. Each case may require a different solution.

Page 38: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• When building an enterprise warehouse from local data marts:

– It is necessary to produce a set of conformed dimensions.• It will also be necessary to standardize the definitions of facts.

– A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined.

– Generally, this means that a conformed dimension is identical in each data mart.

– The conformed Product dimension is the enterprise’s agreed-upon master list of products, including all product attributes and all product rollups such as category, subcategory, and department.

– The conformed Calendar dimension will almost always be a table of individual days, spanning a decade or more. Each day will have many useful attributes drawn from the legal calendars of the various states and countries the enterprise deals with, as well as special fiscal calendar periods and marketing seasons relevant only to internal managers.

– Most conformed dimensions will naturally be defined at the most granular level possible.

• The grain of the Customer dimension will be the individual customer.

Page 39: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Simplified Star Schema with Conformed Dimensions

Page 40: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

• Permissible Variations of Conformed Dimensions

– It is possible to create a subset of a conformed dimension table for certain data marts if you know that the domain of the associated fact table only contains that subset.

– For example, the Product table for a specific data mart may be restricted so as to include only those products manufactured at that location, if the data mart in question pertains to that location only.

Page 41: Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl Overview –Basic Definitions –Normalization Entity Relationship

Links:

• Wikipedia page on normalization• Datbases.About.Com page on normalization• MSDN Glossary• Oracle-specific site where I got some schema diagrams• Ralph Kimball's Data Warehousing site• Kimball on Fact and Dimension Tables• BI and Data Warehouse Glossary