data warehouse why data warehouse and olap? the intelligence stage of decision making needs correct...
TRANSCRIPT
Why Data Warehouse and OLAP?
The Intelligence Stage of Decision Making needs correct data
Data also should be clean, organized with fast accessOften the data should only be used for subsequent
stages and should not be changed (in other words decision makers are users of data, not generators of data)
The data for decision making comes from multiple sources OLTP databases, XML files, Flat files like CSV, PDF, etc.
They need to be combined for one version of the truthThese needs cannot be easily served by the traditional
data sources, hence the concept of data warehouse/data marts
MakingDecisions
Increasing potentialto support
business decisions End User
BusinessAnalyst
Data Analyst
DBAData Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data ExplorationStatistical Analysis, Querying and Reporting
Data Warehouses / Data Marts OLAP, MDA
Data MiningInformation Discovery
Data PresentationVisualization Techniques
Business Intelligence Process
Data Warehouse: Definitions
Data warehouse There is no single definition, as it can encompass many
aspects of the data management. Some of the ones found in the literature are: A physical repository where relational data are specially
organized to provide enterprise-wide, cleansed data in a standardized format Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the organization’s operational database.
A consistent database source that bring together information from multiple sources for decision support queries
Support information processing by providing a solid platform of consolidated, historical data for analysis
Data warehousing A process by which data from multiple sources are
extracted, transformed and loaded into a data warehouse in a planned operation
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc. Aims at reliable and efficient processing of a large number
of transaction and ensuring data consistency
• OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Aims at efficient multidimensional processing of large data
volumes Fast, interactive answers to large aggregate queries
• Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
User Clerk, IT Professional Knowledge worker
Function Day to day operations Decision support
DB Design Application-oriented Subject-oriented
Data Current, Isolated Historical, Consolidated
View Detailed, Flat relationalSummarized and
Multidimensional
Usage Structured, Repetitive Ad hoc
Unit of workShort, Simple
transaction Complex query
Access Read/write Read Mostly
Operations Index/hash on prim. Key Lots of Scans
# Rec. accessed Tens Millions
#Users Thousands Hundreds
Db size 100 MB-GB 100GB-TB
Metric Trans. throughput Query throughput, response
Data Warehousing Characteristics
Subject oriented Integrated Time variant (time series)Nonvolatile Web based Relational/multidimensional Client/server Real-time Include metadata
Data Warehouse Architecture
Data Warehouse Architecture Inmon (Bill Inmon) Model: EDW approach Kimball (Ralph Kimball) Model: Data mart
approach
Which model is best? There is no one-size-fits-all strategy to data
warehousing One alternative is the hosted warehouse
Categories/Types of Data Warehouses
Data Mart A departmental data
warehouse that stores only relevant data
Dependent Data Mart A subset that is created
directly from a data warehouse
Independent Data Mart A small data warehouse
designed for a strategic business unit or a department
Operational Data Stores A type of database often
used as an interim area for a data warehouse, especially for customer information files
Operational Data Marts An operational data mart.
An operational data mart is a small-scale data mart typically used by a single department or functional area in an organization
Various OLAPs
ROLAP: Relational OLAP. Analysis cube supported by a Multi-Dimensional Data Warehouse (or Data Mart) based on Relational database.
MOLAP: Multidimensional OLAP. Analysis cube supported by a multi-dimensional data warehouse based on another multi-dimensional data storage.
HOLAP: Hybrid OLAP. Combination of the above two to optimize performance.
Your task: Search the web to find the relative advantages and disadvantages of each.
Data Warehousing Architectures
Issues to consider when deciding which architecture to use: Which database management system (DBMS) should
be used? Will parallel processing and/or partitioning be used? Will data migration tools be used to load the data
warehouse? What tools will be used to support data retrieval and
analysis?
Data Warehouse and BI Life CycleStructured data
Unstructured data
Data Warehouse(Enterprise)
Business IntelligenceReporting,
Ad-hoc QueryAnalysis
OLTP
Data Warehouse and BI Life Cycle
Source Data
Discover
ODS
Data Warehouse(Enterprise)
Extract -Transform - LoadData Integration
Develop
Deliver
Reporting Tools
Phase I
Data Warehouse and BI Life Cycle
Discover
DevelopDeliver
Multi-Dimensional Cubes(Subject wise)
Data Warehouse(Enterprise)
Data Marts(Departments) Business Intelligence
Phase II
Data Warehouse OLAP Cube
From: http://oraclezine.blogspot.com/2009/01/data-warehousing-and-olap-cube.html, accessed January 20, 2011
Total Sales for Customer 3, Product C, for July.
Data Integration and the Extraction, Transformation, and Load (ETL) Process
Extraction reading data from a database
Transformation Converting the extracted data from its
previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database
Load Putting the data into the data warehouse
Data Warehouse Schema
Dimensional Modeling The Star Schema
Dimension Tables that contain the Dimension for Analysis Example: Time, Region, Salesperson, etc.
Fact Tables that contains the measures and aggregates Example: Average sales, total commission, total sales,
etc. The Snowflake Schema
Very similar to Star-schema with a central fact table, but the dimensions are in hierarchical fashion. Example: Listing agent is a part of the listing company,
one city can have multiple zip code etc. Reduces the redundant data but can be inefficient for
queries that do not follow patterns.
Steps for DW OLP Design Decide on your Information Needs
I want to know the Total Sales by month, by region and by salesperson Decide the sources of data for each information need
Total Sales is not available, must be calculated from unit and quantity sold Most of the unit and qty sold are in OLTP, but some are also in CSV files for
some of the stores not connected to the database The Monthly information can be obtained from the date and is in OLTP The region information is also in OLTP transaction records vi a region code The Salesperson information is maintained in an Excel file and is coded by
region Declare the grain of the fact table (preferably at the most atomic level)
A grain is the lowest level of information you are interested in. The finer is the grain (tending towards atomic), the more are the dimensions.
Atomic grain is preferred as it can allow easy roll-ups. However, it can take a lot of space and processing.
Often the grains in the fact tables are much coarser with ability to drill down Example of Grains
Sales of Each Customer, for each day for each store. Monthly total sales for a particular region for all female customers
Add dimensions for "everything you know" about this grain Add numeric measured facts true to the grain.
Creating a Star* Schema
Identify the dimensions (typically this is the analysis by). In our real estate listing example, it can be the city name, bedrooms, listing agent, etc.
Identify the measures. In our case, it can be the average price/sq ft, total number of houses in a city, the average price, etc.
Identify the attributes of each dimension. Attributes are the properties of a dimension. For example, if the listing agent is a dimension, then the first name, last name, phone number etc, of a listing agent will be the attributes.
* Snowflake schema design is very similar with hierarchy of the dimensions separated in another table.
Star Schema continued.
Create the dimension tables with a surrogate Primary Key (PK). Include all the necessary attributes.
Decide on the measures and calculations. Those will be in the fact table.
For each PK in the dimension tables, create foreign keys (FK) in the star table.
Implementing the Star Schema
1. Extract Data From Multiple Sources2. Integrate, Transform, and Restructure Data3. Load Data Into Dimension Tables and Fact Tables
The Star Schema Data Load
Northwind Northwind OLTPOLTP
Staging Area
Data WarehouseHeterogeneous
Data Sources
ExternalExternalFilesFiles
External External FilesFiles
Internal Internal FilesFiles
Inventory Inventory StarStar
Sales Sales StarStar
Extracting Data From Extracting Data From Transforming Loading the Transforming Loading the Heterogeneous SourcesHeterogeneous Sources Data Data Star Schema Star Schema
DTSDTS
DTSDTS DTSDTSFinancial Financial
DTSDTS
Verifying the Dimension Source Data
Verifying Accuracy of Source Data
Integrating data from multiple sources
Applying business rules Checking structural requirements
Managing Invalid Data Rejecting invalid data
Saving invalid data to a log
Correcting Invalid Data Transforming data
Reassigning data values
Dimension Data Load Examples:
buyer_namebuyer_namebuyer_namebuyer_name
Barr, AdamBarr, Adam
Chai, SeanChai, Sean
O’Melia, ErinO’Melia, Erin
......
reg_idreg_idreg_idreg_id
22
44
66
......
buyer_firstbuyer_firstbuyer_firstbuyer_first
AdamAdam
SeanSean
ErinErin
......
buyer_lastbuyer_lastbuyer_lastbuyer_last
BarrBarr
ChaiChai
O’MeliaO’Melia
......
reg_idreg_idreg_idreg_id
22
44
66
......DTSDTS
buyer_codebuyer_codebuyer_codebuyer_code
A123A123
B456B456
......
buyer_lastbuyer_lastbuyer_lastbuyer_last
BarrBarr
ChaiChai
O’MeliaO’Melia
......
reg_idreg_idreg_idreg_id
22
44
66
......
buyer_codebuyer_codebuyer_codebuyer_code
U999U999
A123A123
B456B456
......
buyer_lastbuyer_lastbuyer_lastbuyer_last
BarrBarr
ChaiChai
O’MeliaO’Melia
......
reg_idreg_idreg_idreg_id
22
44
66
......
buyer_namebuyer_namebuyer_namebuyer_name
Barr, AdamBarr, Adam
Chai, SeanChai, Sean
Smith, JaneSmith, Jane
Paper, AnnePaper, Anne
reg_idreg_idreg_idreg_id
22
44
22
44
DTSDTS
DTSDTS
buyer_namebuyer_namebuyer_namebuyer_name
Barr, AdamBarr, Adam
Chai, SeanChai, Sean
reg_idreg_idreg_idreg_id
IIII
IVIV
buyer_namebuyer_namebuyer_namebuyer_name
Smith, JaneSmith, Jane
Paper, AnnePaper, Anne
reg_idreg_idreg_idreg_id
22
44
Maintaining Integrity of the Dimension
Assigning a Surrogate Key to Each Record Defines the dimension’s primary key Relates to the foreign key fields of the fact
tableLoading One Record Per Application
Key Maintains uniqueness in the dimension Depends on how you manage changing
dimension data Maintains integrity of the fact table
Managing Changing Dimension Data
One of the problem in DW is to deal with the changing data on the reload and refresh. OLTP is operational and thus does not face this problem. These change areas are often called SCD or Slow Changing Dimensions
Dimensions with Changing Column Values Inserts of new data Updates of existing data
Slowly-Changing Dimension Design Solutions Type 1: Overwrite the dimension record Type 2: Write another dimension record Type 3: Add attributes to the dimension record
Type 1: Overwriting the Dimension Slide
Existing recordis changed
product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcat...
product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcat...
Product Dimension
001Rice Puffs10 oz.BagGroceryDry GoodsSnacks...
001Rice Puffs10 oz.BagGroceryDry GoodsSnacks...
Before After001Rice Puffs12 OzBagGroceryDry GoodsSnacks...
001Rice Puffs12 OzBagGroceryDry GoodsSnacks...
12 oz.
Type 2: Writing Another Dimension Record
Adds a new record
product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcateffective_date…
product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcateffective_date…
Product Dimension001Rice Puffs10 oz.BagGroceryDry GoodsSnacks05-01-1995...
001Rice Puffs10 oz.BagGroceryDry GoodsSnacks05-01-1995...
Before After001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-1995...
001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-1995...
10 oz. 12 oz.Rice Puffs12 OzBagGroceryDry GoodsSnacks10-15-1998...
Rice Puffs12 OzBagGroceryDry GoodsSnacks10-15-1998...
731
Type 3: Adding Attributes in the Dimension Record
Additional information is storedin an existing record
Product Dimension
product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcatcurrent product size dateprevious product sizeprevious product size date2nd previous product size2nd previous product size date...
product keyproduct nameproduct sizeproduct packageproduct deptproduct catproduct subcatcurrent product size dateprevious product sizeprevious product size date2nd previous product size2nd previous product size date...
product size
previous product sizeprevious product size date
Before001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-199511 Oz03-20-1994(null)(null)...
001Rice Puffs10 OzBagGroceryDry GoodsSnacks05-01-199511 Oz03-20-1994(null)(null)...
10 oz.
11 oz.03-20-1994
After001Rice Puffs12 oz.BagGroceryDry GoodsSnacks10-15-199810 oz.05-01-199511 Oz03-20-1994...
001Rice Puffs12 oz.BagGroceryDry GoodsSnacks10-15-199810 oz.05-01-199511 Oz03-20-1994...
12 oz
10-15-1998
11 oz.03-20-1994
05-01-1995
Verifying the Fact Table Source Data
Verifying Accuracy of Source Data
Integrating data from multiple sources
Applying business rules
Checking structural requirements
Creating calculated fields
Managing Invalid Data Rejecting invalid data
Saving invalid data to a log
Correcting Invalid Data Transforming data
Reassigning data values
Assigning Foreign Keys
DimensionTables
DimensionTables
customer_dimcustomer_dimcustomer_dimcustomer_dim201 ALFI Alfreds201 ALFI Alfreds
product_dimproduct_dimproduct_dimproduct_dim 25 123 Chai 25 123 Chai
Source Data
customer idcustomer id
ALFI 123 1/1/2000 400
134 1/1/2000134 1/1/2000
time_dimtime_dimtime_dimtime_dim
product idproduct id order dateorder date quantity_salesquantity_sales amount_salesamount_sales
10,789123 1/1/2000 400 10,789
cust_keycust_key
123 1/1/2000 400
prod_keyprod_key time_keytime_key quantity_salesquantity_sales amount_salesamount_sales
25 134 400 10,789201
Sales Fact Data
Defining Measures
Loading Measures from the Source System
Calculating Additional Measures
Source System Data
Fact Table Data
customer_idcustomer_idcustomer_idcustomer_id
VINETVINET
ALFIALFI
HANARHANAR
......
product_idproduct_idproduct_idproduct_id
9GZ9GZ
1KJ1KJ
0ZA0ZA
......
pricepricepriceprice
.55.55
1.101.10
.98.98
......
qtyqtyqtyqty
3232
4848
99
......
customer_keycustomer_keycustomer_keycustomer_key
100100
238238
437437
......
product_keyproduct_keyproduct_keyproduct_key
512512
207207
338338
......
qtyqtyqtyqty
3232
4848
99
......
total_salestotal_salestotal_salestotal_sales
17.6017.60
52.8052.80
8.828.82
......
Maintaining Data Integrity
Adhering to the Fact Table Grain A fact table can only have one grain You must load a fact table with data at the same level
of detail as defined by the grain
Enforcing Column Constraints NOT NULL constraints FOREIGN KEY constraints
Implementing Staging Tables
Centralize and Integrate Source Data Break Up Complex Data
TransformationsFacilitate Error Recovery
Staging Area sales_stagesales_stage
inventory_stageinventory_stage
market_stagemarket_stage
shipments_stageshipments_stage
DTS Functionality
Accessing Heterogeneous Data SourcesImporting, Exporting, and Transforming DataCreating Reusable Transformations and
FunctionsAutomating Data LoadsManaging MetadataCustomizing and Extending Functionality
Defining DTS Packages
Identifies Data Sources and DestinationsDefines Tasks or ActionsImplements Transformation Logic Defines Order of Operations
Identifying Package Components
Connections: Access Data Sources and Destinations
Tasks: Describe Data Transformations or Functions
Steps: Define the Order of Task Operations or Workflow
Global Variables: Store Data that Can Be Shared Across Tasks
Creating Packages
Using the DTS Import / Export Wizard Perform ad-hoc table and data transfers Develop a prototype package
Using DTS Package Designer Edit packages created with the DTS Import/Export
Wizard Create packages with a wide range of functionality
Programming DTS Applications Directly access the functionality of the DTS Object
Model Requires Microsoft Visual Basic or Microsoft Visual
C++
Using DTS to Populate the Sales Star
Populating the Sales Star DimensionsPopulating the Sales Star Fact Table
Populating the Sales Star Dimensions
Product Product Tab DelimitedTab Delimited
FilesFiles
Northwind Northwind OLTPOLTP
DTSDTS
DTSDTS
time_dimtime_dim
customer_dimcustomer_dim
product_dimproduct_dim
SQL Server SQL Server Stored ProcedureStored Procedure
DTSDTS
Populating the Sales Star Fact Table
DTSDTS
sales_factDTSDTS
sales_stagesales_stage
time_dimtime_dimcustomer_dimcustomer_dim
product_dimproduct_dim sales_stagesales_stage
Sales DataSales DataFileFile
Why not Just Excel Pivot Tables?
For small project, Pivot Tables are excellentHowever, if you insert/delete records, you have to refresh the
pivot table and may have to change the table reference (though Table features make it easy)
Calculation cannot be done for one field that will be reflected for all records. You have to copy it for all the records!
No pre-processing can be done. The pivot table is refreshed and recalculated every time. Can lead to inefficiencies
No built-in integration mechanism with a data source (can be done by connecting to cube servers and ODBC compliant databases, but again an extra step and work)
Selective drill-down is difficult, mainly all or none (again grouping can help, but you have to be skilled in Pivot Tables)
Data Warehouse Implementation
Implementation issues Implementing a data warehouse is
generally a massive effort that must be planned and executed according to established methods
There are many facets to the project lifecycle, and no single person can be an expert in each area
Major Tasks for Successful Implementation of a DW
1. Establishment of service-level agreements and data-refresh requirements
2. Identification of data sources and their governance policies
3. Data quality planning4. Data model design5. ETL tool selection
6. Relational database software and platform selection
7. Data transport8. Data conversion9. Reconciliation
process10.Purge and archive
planning11.End-user support
From Solomon, 2005
Data Warehouse Implementation
Implementation factors that can be categorized into three criteria Organizational issues Project issues Technical issues
User participation in the development of data and access modeling is a critical success factor in data warehouse development
Data Warehouse Management
Data warehouse administrator (DWA) A person responsible for the administration and
management of a data warehouse Security Concerns
“All eggs in one basket” Unauthorized access
Usage Concerns Users need to understand how to use the data
correctly Why was data collected, how was it stored, how
do these affect the current use of the data
Data Warehouse Management
Effective security in a data warehouse should focus on four main areas: Establishing effective corporate and security
policies and procedures Implementing logical security procedures and
techniques to restrict access Limiting physical access to the data center
environment Establishing an effective internal control review
process with an emphasis on security and privacy
Data Warehouse Benefits
Direct benefits of a data warehouse Allows end users to perform extensive analysis Allows a consolidated view of corporate data Better and more timely information Enhanced system performance Simplification of data access
Data Warehouse Benefits
Indirect benefits result from end users using these direct benefits Enhance business knowledge Present competitive advantage Enhance customer service and satisfaction Facilitate decision making Help in reforming business processes
Data Warehouse Vendor Selection
Important guidelines for selecting a DW vendor Financial strength ERP linkages Qualified consultants Market share Industry experience Established partnerships
DW Development Best Practices
Project must fit with corporate strategy and business objectives
There must be complete buy-in to the project by executives, managers, and users
It is important to manage user expectations about the completed project
The data warehouse must be built incrementally
Build in adaptability
The project must be managed by both IT and business professionals
Develop a business/supplier relationship
Only load data that have been cleansed and are of a quality understood by the organization
Do not overlook training requirements
Be politically aware
From Weir, 2002
DW Development: What NOT to do Cultural issues being ignored Inappropriate architecture Unclear business objectives Missing information Unrealistic expectations Low levels of data
summarization Low data quality Believing promises of
performance, capacity, and scalability
Believing that your problems are over when the data warehouse is up and running
Believing that DW database design is the same as transactional database design
Choosing a DW manager who is technology oriented rather than user oriented
Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and, perhaps, sound and video
Focusing on ad hoc data mining and periodic reporting instead of alerts
Delivering data with overlapping and confusing definitions