data warehouse why data warehouse and olap? the intelligence stage of decision making needs correct...

60
Data Warehouse

Upload: philomena-dickerson

Post on 23-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Data Warehouse

Why Data Warehouse and OLAP?

The Intelligence Stage of Decision Making needs correct data

Data also should be clean, organized with fast accessOften the data should only be used for subsequent

stages and should not be changed (in other words decision makers are users of data, not generators of data)

The data for decision making comes from multiple sources OLTP databases, XML files, Flat files like CSV, PDF, etc.

They need to be combined for one version of the truthThese needs cannot be easily served by the traditional

data sources, hence the concept of data warehouse/data marts

MakingDecisions

Increasing potentialto support

business decisions End User

BusinessAnalyst

Data Analyst

DBAData Sources

Paper, Files, Information Providers, Database Systems, OLTP

Data ExplorationStatistical Analysis, Querying and Reporting

Data Warehouses / Data Marts OLAP, MDA

Data MiningInformation Discovery

Data PresentationVisualization Techniques

Business Intelligence Process

Data Warehouse: Definitions

Data warehouse There is no single definition, as it can encompass many

aspects of the data management. Some of the ones found in the literature are: A physical repository where relational data are specially

organized to provide enterprise-wide, cleansed data in a standardized format Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from the organization’s operational database.

A consistent database source that bring together information from multiple sources for decision support queries

Support information processing by providing a solid platform of consolidated, historical data for analysis

Data warehousing A process by which data from multiple sources are

extracted, transformed and loaded into a data warehouse in a planned operation

Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc. Aims at reliable and efficient processing of a large number

of transaction and ensuring data consistency

• OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Aims at efficient multidimensional processing of large data

volumes Fast, interactive answers to large aggregate queries

• Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP

User Clerk, IT Professional Knowledge worker

Function Day to day operations Decision support

DB Design Application-oriented Subject-oriented

Data Current, Isolated Historical, Consolidated

View Detailed, Flat relationalSummarized and

Multidimensional

Usage Structured, Repetitive Ad hoc

Unit of workShort, Simple

transaction Complex query

Access Read/write Read Mostly

Operations Index/hash on prim. Key Lots of Scans

# Rec. accessed Tens Millions

#Users Thousands Hundreds

Db size 100 MB-GB 100GB-TB

Metric Trans. throughput Query throughput, response

Data Warehousing Characteristics

Subject oriented Integrated Time variant (time series)Nonvolatile Web based Relational/multidimensional Client/server Real-time Include metadata

Data Warehouse Architecture

Data Warehouse Architecture Inmon (Bill Inmon) Model: EDW approach Kimball (Ralph Kimball) Model: Data mart

approach

Which model is best? There is no one-size-fits-all strategy to data

warehousing One alternative is the hosted warehouse

Alternative DW Architectures

Teradata’s DW Architecture

DW Architecture

Categories/Types of Data Warehouses

Data Mart A departmental data

warehouse that stores only relevant data

Dependent Data Mart A subset that is created

directly from a data warehouse

Independent Data Mart A small data warehouse

designed for a strategic business unit or a department

Operational Data Stores A type of database often

used as an interim area for a data warehouse, especially for customer information files

Operational Data Marts An operational data mart.

An operational data mart is a small-scale data mart typically used by a single department or functional area in an organization

Various OLAPs

ROLAP: Relational OLAP. Analysis cube supported by a Multi-Dimensional Data Warehouse (or Data Mart) based on Relational database.

MOLAP: Multidimensional OLAP. Analysis cube supported by a multi-dimensional data warehouse based on another multi-dimensional data storage.

HOLAP: Hybrid OLAP. Combination of the above two to optimize performance.

Your task: Search the web to find the relative advantages and disadvantages of each.

Data Warehousing Architectures

Issues to consider when deciding which architecture to use: Which database management system (DBMS) should

be used? Will parallel processing and/or partitioning be used? Will data migration tools be used to load the data

warehouse? What tools will be used to support data retrieval and

analysis?

Data Warehouse

Development and Implementation

Data Warehouse and BI Life CycleStructured data

Unstructured data

Data Warehouse(Enterprise)

Business IntelligenceReporting,

Ad-hoc QueryAnalysis

OLTP

Data Warehouse and BI Life Cycle

Source Data

Discover

ODS

Data Warehouse(Enterprise)

Extract -Transform - LoadData Integration

Develop

Deliver

Reporting Tools

Phase I

Data Warehouse and BI Life Cycle

Discover

DevelopDeliver

Multi-Dimensional Cubes(Subject wise)

Data Warehouse(Enterprise)

Data Marts(Departments) Business Intelligence

Phase II

Data Warehouse and BI Life Cycle

Data Warehouse OLAP Cube

From: http://oraclezine.blogspot.com/2009/01/data-warehousing-and-olap-cube.html, accessed January 20, 2011

Total Sales for Customer 3, Product C, for July.

Data Integration and the Extraction, Transformation, and Load (ETL) Process

Extraction reading data from a database

Transformation Converting the extracted data from its

previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database

Load Putting the data into the data warehouse

Data Warehouse Schema

Dimensional Modeling The Star Schema

Dimension Tables that contain the Dimension for Analysis Example: Time, Region, Salesperson, etc.

Fact Tables that contains the measures and aggregates Example: Average sales, total commission, total sales,

etc. The Snowflake Schema

Very similar to Star-schema with a central fact table, but the dimensions are in hierarchical fashion. Example: Listing agent is a part of the listing company,

one city can have multiple zip code etc. Reduces the redundant data but can be inefficient for

queries that do not follow patterns.

Steps for DW OLP Design Decide on your Information Needs

I want to know the Total Sales by month, by region and by salesperson Decide the sources of data for each information need

Total Sales is not available, must be calculated from unit and quantity sold Most of the unit and qty sold are in OLTP, but some are also in CSV files for

some of the stores not connected to the database The Monthly information can be obtained from the date and is in OLTP The region information is also in OLTP transaction records vi a region code The Salesperson information is maintained in an Excel file and is coded by

region Declare the grain of the fact table (preferably at the most atomic level)

A grain is the lowest level of information you are interested in. The finer is the grain (tending towards atomic), the more are the dimensions.

Atomic grain is preferred as it can allow easy roll-ups. However, it can take a lot of space and processing.

Often the grains in the fact tables are much coarser with ability to drill down Example of Grains

Sales of Each Customer, for each day for each store. Monthly total sales for a particular region for all female customers

Add dimensions for "everything you know" about this grain Add numeric measured facts true to the grain.

Creating a Star* Schema

Identify the dimensions (typically this is the analysis by). In our real estate listing example, it can be the city name, bedrooms, listing agent, etc.

Identify the measures. In our case, it can be the average price/sq ft, total number of houses in a city, the average price, etc.

Identify the attributes of each dimension. Attributes are the properties of a dimension. For example, if the listing agent is a dimension, then the first name, last name, phone number etc, of a listing agent will be the attributes.

* Snowflake schema design is very similar with hierarchy of the dimensions separated in another table.

Star Schema continued.

Create the dimension tables with a surrogate Primary Key (PK). Include all the necessary attributes.

Decide on the measures and calculations. Those will be in the fact table.

For each PK in the dimension tables, create foreign keys (FK) in the star table.

Example of Star Schema

Example of Snowflake Schema

Implementing the Star Schema

1. Extract Data From Multiple Sources2. Integrate, Transform, and Restructure Data3. Load Data Into Dimension Tables and Fact Tables

The Star Schema Data Load

Northwind Northwind OLTPOLTP

Staging Area

Data WarehouseHeterogeneous

Data Sources

ExternalExternalFilesFiles

External External FilesFiles

Internal Internal FilesFiles

Inventory Inventory StarStar

Sales Sales StarStar

Extracting Data From Extracting Data From Transforming Loading the Transforming Loading the Heterogeneous SourcesHeterogeneous Sources Data Data Star Schema Star Schema

DTSDTS

DTSDTS DTSDTSFinancial Financial

DTSDTS

Verifying the Dimension Source Data

Verifying Accuracy of Source Data

Integrating data from multiple sources

Applying business rules Checking structural requirements

Managing Invalid Data Rejecting invalid data

Saving invalid data to a log

Correcting Invalid Data Transforming data

Reassigning data values

Dimension Data Load Examples:

buyer_namebuyer_namebuyer_namebuyer_name

Barr, AdamBarr, Adam

Chai, SeanChai, Sean

O’Melia, ErinO’Melia, Erin

......

reg_idreg_idreg_idreg_id

22

44

66

......

buyer_firstbuyer_firstbuyer_firstbuyer_first

AdamAdam

SeanSean

ErinErin

......

buyer_lastbuyer_lastbuyer_lastbuyer_last

BarrBarr

ChaiChai

O’MeliaO’Melia

......

reg_idreg_idreg_idreg_id

22

44

66

......DTSDTS

buyer_codebuyer_codebuyer_codebuyer_code

A123A123

B456B456

......

buyer_lastbuyer_lastbuyer_lastbuyer_last

BarrBarr

ChaiChai

O’MeliaO’Melia

......

reg_idreg_idreg_idreg_id

22

44

66

......

buyer_codebuyer_codebuyer_codebuyer_code

U999U999

A123A123

B456B456

......

buyer_lastbuyer_lastbuyer_lastbuyer_last

BarrBarr

ChaiChai

O’MeliaO’Melia

......

reg_idreg_idreg_idreg_id

22

44

66

......

buyer_namebuyer_namebuyer_namebuyer_name

Barr, AdamBarr, Adam

Chai, SeanChai, Sean

Smith, JaneSmith, Jane

Paper, AnnePaper, Anne

reg_idreg_idreg_idreg_id

22

44

22

44

DTSDTS

DTSDTS

buyer_namebuyer_namebuyer_namebuyer_name

Barr, AdamBarr, Adam

Chai, SeanChai, Sean

reg_idreg_idreg_idreg_id

IIII

IVIV

buyer_namebuyer_namebuyer_namebuyer_name

Smith, JaneSmith, Jane

Paper, AnnePaper, Anne

reg_idreg_idreg_idreg_id

22

44

Maintaining Integrity of the Dimension

Assigning a Surrogate Key to Each Record Defines the dimension’s primary key Relates to the foreign key fields of the fact

tableLoading One Record Per Application

Key Maintains uniqueness in the dimension Depends on how you manage changing

dimension data Maintains integrity of the fact table

Managing Changing Dimension Data

One of the problem in DW is to deal with the changing data on the reload and refresh. OLTP is operational and thus does not face this problem. These change areas are often called SCD or Slow Changing Dimensions

Dimensions with Changing Column Values Inserts of new data Updates of existing data

Slowly-Changing Dimension Design Solutions Type 1: Overwrite the dimension record Type 2: Write another dimension record Type 3: Add attributes to the dimension record

Examining the Star Schema

DimensionTables Dimension Table

Fact Table

Sales Star Schema

Type 1: Overwriting the Dimension Slide

Existing recordis changed

product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcat...

product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcat...

Product Dimension

001Rice Puffs10 oz.BagGroceryDry GoodsSnacks...

001Rice Puffs10 oz.BagGroceryDry GoodsSnacks...

Before After001Rice Puffs12 OzBagGroceryDry GoodsSnacks...

001Rice Puffs12 OzBagGroceryDry GoodsSnacks...

12 oz.

Type 2: Writing Another Dimension Record

Adds a new record

product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcateffective_date…

product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcateffective_date…

Product Dimension001Rice Puffs10 oz.BagGroceryDry GoodsSnacks05-01-1995...

001Rice Puffs10 oz.BagGroceryDry GoodsSnacks05-01-1995...

Before After001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-1995...

001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-1995...

10 oz. 12 oz.Rice Puffs12 OzBagGroceryDry GoodsSnacks10-15-1998...

Rice Puffs12 OzBagGroceryDry GoodsSnacks10-15-1998...

731

Type 3: Adding Attributes in the Dimension Record

Additional information is storedin an existing record

Product Dimension

product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcatcurrent product size dateprevious product sizeprevious product size date2nd previous product size2nd previous product size date...

product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcatcurrent product size dateprevious product sizeprevious product size date2nd previous product size2nd previous product size date...

product size

previous product sizeprevious product size date

Before001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-199511 Oz03-20-1994(null)(null)...

001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-199511 Oz03-20-1994(null)(null)...

10 oz.

11 oz.03-20-1994

After001Rice Puffs12 oz.BagGroceryDry GoodsSnacks10-15-199810 oz.05-01-199511 Oz03-20-1994...

001Rice Puffs12 oz.BagGroceryDry GoodsSnacks10-15-199810 oz.05-01-199511 Oz03-20-1994...

12 oz

10-15-1998

11 oz.03-20-1994

05-01-1995

Verifying the Fact Table Source Data

Verifying Accuracy of Source Data

Integrating data from multiple sources

Applying business rules

Checking structural requirements

Creating calculated fields

Managing Invalid Data Rejecting invalid data

Saving invalid data to a log

Correcting Invalid Data Transforming data

Reassigning data values

Assigning Foreign Keys

DimensionTables

DimensionTables

customer_dimcustomer_dimcustomer_dimcustomer_dim201 ALFI Alfreds201 ALFI Alfreds

product_dimproduct_dimproduct_dimproduct_dim 25 123 Chai 25 123 Chai

Source Data

customer idcustomer id

ALFI 123 1/1/2000 400

134 1/1/2000134 1/1/2000

time_dimtime_dimtime_dimtime_dim

product idproduct id order dateorder date quantity_salesquantity_sales amount_salesamount_sales

10,789123 1/1/2000 400 10,789

cust_keycust_key

123 1/1/2000 400

prod_keyprod_key time_keytime_key quantity_salesquantity_sales amount_salesamount_sales

25 134 400 10,789201

Sales Fact Data

Defining Measures

Loading Measures from the Source System

Calculating Additional Measures

Source System Data

Fact Table Data

customer_idcustomer_idcustomer_idcustomer_id

VINETVINET

ALFIALFI

HANARHANAR

......

product_idproduct_idproduct_idproduct_id

9GZ9GZ

1KJ1KJ

0ZA0ZA

......

pricepricepriceprice

.55.55

1.101.10

.98.98

......

qtyqtyqtyqty

3232

4848

99

......

customer_keycustomer_keycustomer_keycustomer_key

100100

238238

437437

......

product_keyproduct_keyproduct_keyproduct_key

512512

207207

338338

......

qtyqtyqtyqty

3232

4848

99

......

total_salestotal_salestotal_salestotal_sales

17.6017.60

52.8052.80

8.828.82

......

Maintaining Data Integrity

Adhering to the Fact Table Grain A fact table can only have one grain You must load a fact table with data at the same level

of detail as defined by the grain

Enforcing Column Constraints NOT NULL constraints FOREIGN KEY constraints

Implementing Staging Tables

Centralize and Integrate Source Data Break Up Complex Data

TransformationsFacilitate Error Recovery

Staging Area sales_stagesales_stage

inventory_stageinventory_stage

market_stagemarket_stage

shipments_stageshipments_stage

DTS Functionality

Accessing Heterogeneous Data SourcesImporting, Exporting, and Transforming DataCreating Reusable Transformations and

FunctionsAutomating Data LoadsManaging MetadataCustomizing and Extending Functionality

Defining DTS Packages

Identifies Data Sources and DestinationsDefines Tasks or ActionsImplements Transformation Logic Defines Order of Operations

Identifying Package Components

Connections: Access Data Sources and Destinations

Tasks: Describe Data Transformations or Functions

Steps: Define the Order of Task Operations or Workflow

Global Variables: Store Data that Can Be Shared Across Tasks

Creating Packages

Using the DTS Import / Export Wizard Perform ad-hoc table and data transfers Develop a prototype package

Using DTS Package Designer Edit packages created with the DTS Import/Export

Wizard Create packages with a wide range of functionality

Programming DTS Applications Directly access the functionality of the DTS Object

Model Requires Microsoft Visual Basic or Microsoft Visual

C++

Using DTS to Populate the Sales Star

Populating the Sales Star DimensionsPopulating the Sales Star Fact Table

Populating the Sales Star Dimensions

Product Product Tab DelimitedTab Delimited

FilesFiles

Northwind Northwind OLTPOLTP

DTSDTS

DTSDTS

time_dimtime_dim

customer_dimcustomer_dim

product_dimproduct_dim

SQL Server SQL Server Stored ProcedureStored Procedure

DTSDTS

Populating the Sales Star Fact Table

DTSDTS

sales_factDTSDTS

sales_stagesales_stage

time_dimtime_dimcustomer_dimcustomer_dim

product_dimproduct_dim sales_stagesales_stage

Sales DataSales DataFileFile

Why not Just Excel Pivot Tables?

For small project, Pivot Tables are excellentHowever, if you insert/delete records, you have to refresh the

pivot table and may have to change the table reference (though Table features make it easy)

Calculation cannot be done for one field that will be reflected for all records. You have to copy it for all the records!

No pre-processing can be done. The pivot table is refreshed and recalculated every time. Can lead to inefficiencies

No built-in integration mechanism with a data source (can be done by connecting to cube servers and ODBC compliant databases, but again an extra step and work)

Selective drill-down is difficult, mainly all or none (again grouping can help, but you have to be skilled in Pivot Tables)

Data Warehouse Implementation

Implementation issues Implementing a data warehouse is

generally a massive effort that must be planned and executed according to established methods

There are many facets to the project lifecycle, and no single person can be an expert in each area

Major Tasks for Successful Implementation of a DW

1. Establishment of service-level agreements and data-refresh requirements

2. Identification of data sources and their governance policies

3. Data quality planning4. Data model design5. ETL tool selection

6. Relational database software and platform selection

7. Data transport8. Data conversion9. Reconciliation

process10.Purge and archive

planning11.End-user support

From Solomon, 2005

Data Warehouse Implementation

Implementation factors that can be categorized into three criteria Organizational issues Project issues Technical issues

User participation in the development of data and access modeling is a critical success factor in data warehouse development

Data Warehouse Management

Data warehouse administrator (DWA) A person responsible for the administration and

management of a data warehouse Security Concerns

“All eggs in one basket” Unauthorized access

Usage Concerns Users need to understand how to use the data

correctly Why was data collected, how was it stored, how

do these affect the current use of the data

Data Warehouse Management

Effective security in a data warehouse should focus on four main areas: Establishing effective corporate and security

policies and procedures Implementing logical security procedures and

techniques to restrict access Limiting physical access to the data center

environment Establishing an effective internal control review

process with an emphasis on security and privacy

Data Warehouse Benefits

Direct benefits of a data warehouse Allows end users to perform extensive analysis Allows a consolidated view of corporate data Better and more timely information Enhanced system performance Simplification of data access

Data Warehouse Benefits

Indirect benefits result from end users using these direct benefits Enhance business knowledge Present competitive advantage Enhance customer service and satisfaction Facilitate decision making Help in reforming business processes

Data Warehouse Vendor Selection

Important guidelines for selecting a DW vendor Financial strength ERP linkages Qualified consultants Market share Industry experience Established partnerships

DW Development Best Practices

Project must fit with corporate strategy and business objectives

There must be complete buy-in to the project by executives, managers, and users

It is important to manage user expectations about the completed project

The data warehouse must be built incrementally

Build in adaptability

The project must be managed by both IT and business professionals

Develop a business/supplier relationship

Only load data that have been cleansed and are of a quality understood by the organization

Do not overlook training requirements

Be politically aware

From Weir, 2002

DW Development: What NOT to do Cultural issues being ignored Inappropriate architecture Unclear business objectives Missing information Unrealistic expectations Low levels of data

summarization Low data quality Believing promises of

performance, capacity, and scalability

Believing that your problems are over when the data warehouse is up and running

Believing that DW database design is the same as transactional database design

Choosing a DW manager who is technology oriented rather than user oriented

Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and, perhaps, sound and video

Focusing on ad hoc data mining and periodic reporting instead of alerts

Delivering data with overlapping and confusing definitions