sesi 03 data warehouse introduction · o olap is the technology used to study the data in terms of...
TRANSCRIPT
1
1
Data Warehouse & OLAP Introduction
Sesi: 03-04Dosen Pembina : Danang Junaedi
2SPK – IF UTAMA
The Knowledge Discovery Process
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Preprocessing
DataSources
TargetData
TransformedData
PreprocessedData
Patterns /Models
Knowledge
Selection
Interpretation/Evaluation
Transformation
Data Mining
3SPK – IF UTAMA
Data Sources
Relational DatabasesData WarehousesWWWAudioVideoPrinted Materials::
4SPK – IF UTAMA
Relational Databases
©
5SPK – IF UTAMA
Multidimensional Data Cube
Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000
Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000
6SPK – IF UTAMA
Evolution of DSSTransaction Processing Systems (TPS)o Operational data stores and OLTPo Batch reports, hard to find and analyze information, inflexible
and expensive, reprogram every new request (circa 60’s)MISo Management reporting from transactions in TPSo Still inflexible, not integrated with desktop tools (circa 70’s)
DSSo Combine data with analytic models or expert ruleso Integration with desktop tools (80’s)
Data Warehousing o Data integrated after (cleaning and scrubbing) from multiple
sources (both internal and external to the organization)o OLAP is the technology used to study the data in terms of
operations on a multi-dimensional data seto Data warehousing also supports processing of data by
analytic methods and permits data mining (90’s)
2
7SPK – IF UTAMA
Applications
Retail - inventory management, promotionsManufacturing - order shipmentInsurance – policy and claims trackingTelecommunications - call analysisFinancial – account trackingCRM/eCRM – customer profiling, clickstream analysisHealthcare – disease management, patient and physician profiling
8SPK – IF UTAMA
Databases for Decision Support
Transaction Processing systems are optimized for performanceData they capture are too detailed to be of use for decision support purposesOnline Analytical Processing (OLAP) imposes very different demands on databases than does Online Transaction Processing (OLTP)
9SPK – IF UTAMA
• Collects and combines information from disparate sources
• Provides integrated view, and a uniform user interface• Supports sharing of data between entities
Integration System
WorldWideWeb
Digital Libraries Scientific Databases
PersonalDatabases
Heterogeneous Database Integration
10SPK – IF UTAMA
Why look at data in this way?What would be the demand for services (forecasting)?Who are our key customers/patients, ando What are the margins/outcomes? (profitable
customers/satisfied patients)o How do we market to them/treat them?o What pricing/treatment strategy is desirable?o What are their preferences?o What type of customer/patient services are required?o What services when packaged result in higher/better
sales/revenues/margins/outcomes, efficient workflow?Which promotion/patient education/counseling works or does not work and why?What is the inventory/patient turnover?Which channel/technology is more effective/profitable?Why do margins/outcomes differ from one place to another or one patient to another?
11SPK – IF UTAMA
Data Warehousing and Industry
One of the hottest topic in IS.Over 90% of larger companies either have a DW or are starting one.Warehousing is big businesso $2 billion in 1995o $3.5 billion in early 1997o $8 billion in 1998 [Metagroup]o over $200 billion over next 5 years.
12SPK – IF UTAMA
Data Warehousing and Industry (2)
A 1996 study of 62 data warehousing projects showed:o An average return on investment of 321%, with
an average payback period of 2.73 years.
WalMart has largest warehouseo 900-CPU, 2,700 disk, 23 TB Teradata
systemo ~7TB in warehouseo 40-50GB per day
3
13SPK – IF UTAMA
Why Data Warehousing?
Advance of information technology.Data collected in huge amounts.Need to make good use of data?Architecture and tools too Bring together scattered information from multiple
sources to provide consistent data source for decision support.
o Support information processing by providing a solid platform of consolidated, historical data for analysis.
14SPK – IF UTAMA
Which are ourlowest/highest margin
customers ?
Which are ourlowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customersare most likely to go to the competition ?
Which customersare most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
A producer wants to know….
15SPK – IF UTAMA
Data, Data everywhere yet ...I can’t find the data I needo data is scattered over the networko many versions, subtle differences
I can’t get the data I needo need an expert to get the data
I can’t understand the data I foundo available data poorly documented
I can’t use the data I foundo results are unexpectedo data needs to be transformed from
one form to other
16SPK – IF UTAMA
What are the users saying...
Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over timeWhat-if capabilities are required
17SPK – IF UTAMA
What is a Data Warehouse?
Defined in many different ways non-rigorously.o A DB for decision support.o Maintained separately from an organization’s
operational database.A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.—W. H. Inmon90% of major organizations have or are building some kind of data warehouse.A decision support database that is maintained separately from the organization’s operational databases.
18SPK – IF UTAMA
Data Warehouse
A data warehouse is a o subject-oriented
o integrated
o time-varying
o non-volatile
collection of data that is used primarily in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
4
19SPK – IF UTAMA
Data Warehouse
Subject Oriented: The data is grouped under business headings, such as: Customers, products, sales analysis repots (This subject orientation is achieved through data modeling).Integrated: The contents of the data warehouses are defined such that they are valid across the enterprise and its operational and external data sources.Time Dimensioned: All data in the data warehouse is time stamped at time of entry into the warehouse or when it is summarized.Non-volatile: Once loaded into the data warehouse, the data is not updated. Thus it acts as a stable resource for consistent reporting and comparative analysis.
20SPK – IF UTAMA
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
Data
Information
21SPK – IF UTAMA
Data Warehousing -- It is a process
Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possibleA decision support database maintained separately from the organization’s operational database
22SPK – IF UTAMA
Explorers, Farmers and Tourists
Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
Farmers: Harvest informationfrom known access paths
Tourists: Browse information harvested by farmers
23SPK – IF UTAMA
Data Warehouse Architecture
Datasource
Datasource
Datasource
Relational database (warehouse)
User queries
Data extractionprograms
Data cleaning/scrubbing
OLAP / Decision support/
Data cubes/ data mining
24SPK – IF UTAMA
Data Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
5
SPK – IF UTAMA
Data Warehouse Component
26SPK – IF UTAMA
Data Warehouse Architecture
ClinicalSystem
PayrollSystem
Billing System
ExternalData
Other Internal
Data
(e.g.,AS400)
OracleFinancialson HP 9000
Access,Files (Industry Reports)
TransformationIntegration
DataWarehouse
Meta-Data
Excel Web Other
DataMining Tools
OLAP serversOLAP servers
27SPK – IF UTAMA
Prod
Mkt
HR
Fin
Acctg
Data Sources
Transaction Data
IBM
IMS
VSAM
Oracle
Sybase
ETL Software Data Stores Data AnalysisTools and Applications
Users
Other Internal Data
ERP SAP
Clickstream Informix
Web Data
External Data
Demographic Harte-Hanks
STAG ING
AREA
OPERAT IONAL
DATA
STORE
Ascential
Extract
Sagent
SAS
Clean/ScrubTransformFirst logic
Load
Informatica
Data MartsTeradataIBM
Data Warehouse
Meta Data
Finance
Marketing
Sales
Essbase
Microsoft
ANALYSTS
MANAGERS
EXECUTIVES
OPERATIONALPERSONNEL
CUSTOMERS/SUPPLIERS
SQL
Cognos
SAS
Queries,Reporting,DSS/EIS, Data Mining
Micro Strategy
Siebel
BusinessObjects
WebBrowser
28SPK – IF UTAMA
Extraction, Transformation, & Load (ETL)
ETL is a set of tools and techniques used to populate a data warehouse
ExtractionExtract data from sources (e.g., operational DBMSs, file systems, Web pages)
TransformationClean dataConvert from legacy/host format to warehouse format (e.g., convert “surname” to “last name”)
29SPK – IF UTAMA
Extraction, Transformation, & Load (ETL)
Load Sort, summarize, consolidate, compute views, check integrity, build indexes, partitionHuge volumes of data to be loaded, yet small time window (usually at night) when the warehouse can be taken off-lineTechniques: batch, sequential load often too slow; incremental, parallel loading techniques may be used
RefreshPropagate updates from sources to the warehouseWhen to refresh - on every update, periodically (e.g., every 24 hours), or after “significant” eventsHow to refresh – full extract from base tables vs. incremental techniques
30SPK – IF UTAMA
Data Mart
• A data mart stores data for a limited number ofsubject areas, such as marketing and sales data. It isused to support specific applications.
• An independent data mart is created directly fromsource systems.
• A dependent data mart is populated from a datawarehouse.
6
31SPK – IF UTAMA
Data Warehouse vs. Data Marts
Enterprise warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization.o Requires extensive business modelingo May take years to design and build
Data Marts: departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales.o Faster roll out, but complex integration in the long
run.
32SPK – IF UTAMA
Data Warehouse for Decision Support & OLAP
Putting Information technology to help the knowledge worker make faster and better decisionso Which of my customers are most likely to go to the
competition?o What product promotions have the biggest impact on
revenue?o How did the share price of software companies correlate with
profits over last 10 years?
33SPK – IF UTAMA
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can be ad-hoc
Used by managers and end-users to understand the business and make judgements
34SPK – IF UTAMA
Information Sources Data Warehouse Server(Tier 1)
OLAP Servers(Tier 2)
Clients(Tier 3)
OperationalDB’s
SemistructuredSources
extracttransformloadrefreshetc.
Data Marts
DataWarehouse
e.g., MOLAP
e.g., ROLAP
serve
Analysis
Query/Reporting
Data Mining
serve
serve
The Complete Decision Support System (Source: Franconi)
35SPK – IF UTAMA
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory
Data Mining provides the Enterprise with intelligence
36SPK – IF UTAMA
We want to know ...Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
7
37SPK – IF UTAMA
Why Separate Data Warehouse?
Performanceo Op dbs designed & tuned for known txs &
workloads.o Complex OLAP queries would degrade
perf. for op txs.o Special data organization, access &
implementation methods needed for multidimensional views & queries.
Functiono Missing data: Decision support requires
historical data, which op dbs do not typically maintain. 38SPK – IF UTAMA
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration: o Build wrappers/mediators on top of heterogeneous databases o Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set
• Complex information filtering, compete for resourcesData warehouse: update-driven, high performanceo Information from heterogeneous sources is integrated in
advance and stored in warehouses for direct query and analysis
39SPK – IF UTAMA
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)o Major task of traditional relational DBMSo Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.OLAP (on-line analytical processing)o Major task of data warehouse systemo Data analysis and decision making
Distinct features (OLTP vs. OLAP):o User and system orientation: customer vs. marketo Data contents: current, detailed vs. historical, consolidatedo Database design: ER + application vs. star + subjecto View: current, local vs. evolutionary, integratedo Access patterns: update vs. read-only but complex queries
40SPK – IF UTAMA
RDBMS used for OLTP
Database Systems have been used traditionally for OLTPo clerical data processing taskso detailed, up to date datao structured repetitive taskso read/update a few recordso isolation, recovery and integrity are critical
41SPK – IF UTAMA
Operational Systems
Run the business in real timeBased on up-to-the-second dataOptimized to handle large numbers of simple read/write transactionsOptimized for fast response to predefined transactionsUsed by people who deal with customers, products -- clerks, salespeople etc.They are increasingly used by customers
42SPK – IF UTAMA
Examples of Operational Data
Data Industry Usage Technology Volumes
CustomerFile
All TrackCustomerDetails
Legacy application, flatfiles, main frames
Small-medium
AccountBalance
Finance Controlaccountactivities
Legacy applications,hierarchical databases,mainframe
Large
Point-of-Sale data
Retail Generatebills, managestock
ERP, Client/Server,relational databases
Very Large
CallRecord
Telecomm-unications
Billing Legacy application,hierarchical database,mainframe
Very Large
ProductionRecord
Manufact-uring
ControlProduction
ERP,relational databases,AS/400
Medium
8
43SPK – IF UTAMA
Application-Orientation vs. Subject-Orientation
Application-Orientation
Operational Database
LoansCredit Card
Trust
Savings
Subject-Orientation
DataWarehouse
Customer
VendorProduct
Activity44SPK – IF UTAMA
OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date
detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc access read/write
index/hash on prim. key lots of scans
unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
45SPK – IF UTAMA
OLTP vs. Data Warehouse
OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouseSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)o e.g., average amount spent on phone calls between 9AM-5PM
in Pune during the month of December
46SPK – IF UTAMA
OLTP vs Data Warehouse
OLTPo Application Orientedo Used to run businesso Detailed datao Current up to dateo Isolated Datao Repetitive accesso Clerical User
Warehouse (DSS)o Subject Orientedo Used to analyze
businesso Summarized and
refinedo Snapshot datao Integrated Datao Ad-hoc accesso Knowledge User
(Manager)
47SPK – IF UTAMA
OLTP vs Data Warehouse
OLTPo Performance Sensitiveo Few Records accessed at a
time (tens)
o Read/Update Access
o No data redundancyo Database Size 100MB -
100 GB
Data Warehouseo Performance relaxedo Large volumes accessed at a
time(millions)o Mostly Read (Batch Update)o Redundancy presento Database Size 100 GB - few
terabytes
48SPK – IF UTAMA
OLTP vs Data Warehouse
OLTPo Transaction throughput is
the performance metrico Thousands of userso Managed in entirety
Data Warehouseo Query throughput is the
performance metrico Hundreds of userso Managed by subsets
9
49SPK – IF UTAMA
To summarize ...
OLTP Systems are used to “run” a business
The Data Warehouse helps to “optimize” the business
50SPK – IF UTAMA
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensionso Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
o Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
51SPK – IF UTAMA
A Sample Data Cube
Total annual salesof TVs in U.S.A.Date
Produ
ct
Cou
ntrysum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4QtrU.S.A
Canada
Mexico
sum
52SPK – IF UTAMA
Cube: A Lattice of Cuboids
all
time item location supplier
time,itemtime,location
time,supplier
item,location
item,supplier
location,supplier
time,item,locationtime,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
53SPK – IF UTAMA
Data (Hyper) cubes
2-d to 3-d cube
Rotating the cube
54SPK – IF UTAMA
Conceptual Modeling of Data Warehouses
ER design techniques not appropriate
Modeling data warehouses: dimensions & measureso Star schema: A fact table in the middle connected to a set of
dimension tables
o Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
o Fact constellations schema : Multiple fact tables share dimension tables, viewed as a collection of stars, therefore
called galaxy schema or fact constellation
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
10
SPK – IF UTAMA SPK – IF UTAMA
Problem with ER
ER models are NOT suitable for DW?End user cannot understand or remember an ER ModelMany DWs have failed because of overly complex ER designsNot optimized for complex, ad-hoc queriesData retrieval becomes difficult due to normalizationBrowsing becomes difficult
57SPK – IF UTAMA
Warehouse Models & Operators
Data Modelso relationso stars & snowflakeso cubes
Operatorso slice & diceo roll-up, drill downo pivotingo other
58SPK – IF UTAMA
Multidimensional Data Model
Database is a set of facts (points) in a multidimensional spaceA fact has a measure dimensiono quantity that is analyzed, e.g., sale, budget
A set of dimensions on which data is analyzedo e.g. , store, product, date associated with a sale amount
Dimensions form a sparsely populated coordinate systemEach dimension has a set of attributeso e.g., owner, city and county of store
Attributes of a dimension may be related by partial ordero Hierarchy: e.g., street > county >cityo Lattice: e.g., date> month>year, date>week>year
59SPK – IF UTAMA
Example: Patient profiling
A healthcare organization needed a longitudinal view of patients, including trends of services to patientsModelo Facts include Healthcare (e.g., diagnosis, procedure),
Financial (e.g., amount billed, number of claims), Resources (e.g., number of bed-days, inpatient and outpatient visits)
o Dimensions include Time, Provider, Claim Type, Demographics, Encounter Type, Diagnosis and Procedure, Person, Organization
Questions answered by system:o Which individuals are eligible for services but not obtaining
them?o Which individuals are registered for services, but not
receiving preventive healthcare?
60SPK – IF UTAMA
Star Schema
A single fact table and a single table for each dimensionEvery fact points to one tuple in each of the dimensions and has additional attributesDoes not capture hierarchies directlyStraightforward means of capturing a multiple dimension data model using relationsSlowly Changing Dimensions
11
61SPK – IF UTAMA
Fig. 2.4 Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
62SPK – IF UTAMA
Example of a Star Schema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer NameCustomer Name
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNameSalespersonName
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryDescriptionCategoryDescription
UnitPriceUnitPrice
DateKeyDateKey
DateDate
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersonSalesperson
CityCity
DateDate
ProductProduct
Fact TableFact Table
63SPK – IF UTAMA
Cardholder Key Purchase Key1 2
Fact TableAmountTime KeyLocation Key
101 14.50
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15 4 115 8.251 2 103 22.40
Location Key Street10 425 Church St
Location DimensionRegionStateCity
SCCharleston 3...
.
.
.
.
.
.
.
.
.
.
.
.
GenderMale
.
.
.
Female
Income Range50 - 70,000
.
.
.
70 - 90,000
Cardholder Key Name1 John Doe
.
.
.
.
.
.
2 Sara Smith
Cardholder Dimension
Purchase Key Category1 Supermarket
.
.
.
.
.
.
2 Travel & Entertainment
Purchase Dimension
3 Auto & Vehicle4 Retail5 Restarurant6 Miscellaneous
Time Key Month10 Jan
Time DimensionYearQuarterDay
15 2002...
.
.
.
.
.
.
.
.
.
.
.
.
A star schema for credit card purchases
64SPK – IF UTAMA
Snowflake Schema
Represent dimensional hierarchy directly by normalizing the dimension tablesEasy to maintainSaves storage, but may reduce effectiveness of browsing (Kimball)
65SPK – IF UTAMA
Fig. 2.5 Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycityprovince_or_streetcountry
city
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
66SPK – IF UTAMA
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer NameCustomer Name
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNameSalespersonName
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryCategory
UnitPriceUnitPrice
DateKeyDateKey
DateDate
MonthMonth
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersonSalesperson
CityCity
DateDate
ProductProduct
Fact TableFact TableCategoryNameCategoryName
CategoryDescrCategoryDescr
MonthMonth
YearYear YearYear
StateNameStateName
CountryCountry
CategoryCategory
StateState
MonthMonthYearYear
Example of a Snowflake Schema
12
67SPK – IF UTAMA
Fact Constellation Schema
Multiple fact tables share dimension tables.This schema is viewed as collection of stars hence called galaxy schema or fact constellation.Sophisticated applications require such schema.
68SPK – IF UTAMA
Fig 2.6 Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
69SPK – IF UTAMA
Example Fact Constellation Schema
Price
Units
Period Key
Product Key
Store Key
Store Dimension
Product Dimension
SalesFact Table
Store Key
Region
State
City
Store Name
Product Desc
Product KeyShipper Key
Price
Units
Period Key
Product Key
Store Key
ShippingFact Table
70SPK – IF UTAMA
Cardholder Key Purchase Key1 2
Purchase Fact TableAmountTime KeyLocation Key
101 14.50
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15 4 115 8.251 2 103 22.40
Time Key Month5 Dec
Time DimensionYearQuarterDay
431 2001
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Jan 13 200210 Jan 15 2002
Promotion Key DescriptionPromotion Dimension
Cost
.
.
.
.
.
.
.
.
.
1 watch promo 15.25
Purchase Key Category1 Supermarket2 Travel & Entertainment
Purchase Dimension
3 Auto & Vehicle4 Retail5 Restarurant6 Miscellaneous
Location Key Street5 425 Church St
Location DimensionRegionStateCity
SCCharleston 3...
.
.
.
.
.
.
.
.
.
.
.
.
Cardholder Key Promotion Key1 1
Promotion Fact TableResponseTime Key
5 Yes
.
.
.
.
.
.
.
.
.
.
.
.
2 1 5 No
GenderMale
.
.
.
Female
Income Range50 - 70,000
.
.
.
70 - 90,000
Cardholder Key Name1 John Doe
.
.
.
.
.
.
2 Sara Smith
Cardholder Dimension
A constellation schema for credit card purchases and promotions
71SPK – IF UTAMA
What is OLAP?
Software tool providing multi-dimensional view of data for business analysisExample of “Decision Support” or “Business Intelligence” toolFast data access and fast computationsInteractive, flexible user interface“Slice, dice, drill-down”Excel Pivot Table and Pivot Chart are examples of simple OLAP tools
72SPK – IF UTAMA
Defining OLAP - ANALYSIS
Business logic and statistical analysis relevant to end userShould not require programming for everythingAnalysis can be via vendors’ tools or link to generic analytical platform such as spreadsheetExamples include time series analysis, cost allocation, currency translation, goal seeking, ad-hoc multi-dimensional structural changes (cube building), non-procedural modeling, exception alerting, and data mining.Capabilities vary widely by vendor and market
13
73SPK – IF UTAMA
OLAP Operations
74SPK – IF UTAMA
Common cube operations
Pivot or Rotate – change which dimensions and/or levels within dimensions are shown on row and column axesRoll-up – aggregate or combine cells within a dimension according to some mathematical operation
o Uses a hierarchy definition for the dimensiono Commonly this is summation or count
Drill down – examine data a greater level of detailo Add another row or column header which is further down the
concept hierarchySlice – select a subset of a cube by constraining the value of some dimension
o Ex: Select cells for month = January in time dimensionDice – select a subset of a cube by constraining two or more dimensionsDrill through – access atomic level detail data
75SPK – IF UTAMA
OLAP Operations & SQL Sample
76SPK – IF UTAMA
77SPK – IF UTAMA 78SPK – IF UTAMA
Cube Operation (SQL)
14
79SPK – IF UTAMA
A Few Products
Microsoft Analysis Serviceso Part of SQL Server 2005o Create OLAP cubes, 10 data mining algorithms
Tableauo A new, pretty amazing pivoting tool
Cognoso Recently bought by IBM
Hyperion Essbaseo Full suite of business intelligence developer and end user toolso Purchased by Oracle
Business Objects (Crystal)o Full suite of business intelligence developer and end user tools
MicrostrategyOracleInformation Builders
o Home of WebFocus, a web based OLAP toolPentaho
o A new open source business intelligence projecto http://www.pentaho.org/
80SPK – IF UTAMA
A Data Mining Query Language, DMQL: Language Primitives
Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition ( Dimension Table )define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)o First time as “cube definition”o define dimension <dimension_name> as
<dimension_name_first_time> in cube<cube_name_first_time>
81SPK – IF UTAMA
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
82SPK – IF UTAMA
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))
83SPK – IF UTAMA
Defining a Fact Constellation in DMQLdefine cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as
location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales 84SPK – IF UTAMA
First Example
Enrollment Data
Sumber: Dr. Mourad YKHLEF,Decision Support System, King Saud University, 2009
15
85SPK – IF UTAMA
Class Day Time Prof Enrolled 1336 T 8 Aars 14 1430 M 2 Aars 28 1430 M 11 Booth 30 1430 T 11 Booth 26 1430 T 2 Booth 23 1430 T 2 Fry 27 1430 T 12 Aars 29 1440 M 1 Aars 11 2334 M 9 Fry 27 2350 M 10 Maurer 19 3101 T 12 Grabow 16 3303 M 11 Aars 11 3324 M 8 Gaitros 20 3330 T 11 Fry 5 3331 T 12 Aars 11 3334 M 11 Hamerly 20 3335 M 2 Donahoo 17 3336 T 8 Sturgill 9 3342 T 2 Aars 10 3439 T 9 Poucher 10
Enroll Table
86SPK – IF UTAMA
Group by rollup
Select Prof, Sum(Students) From enroll Group by rollup (prof)
Prof Enrolled Aars 114 Booth 79 Donahoo 17 Fry 59 Gaitros 20 Grabow 16 Hamerly 20 Maurer 19 Poucher 10 Sturgill 9
363
87SPK – IF UTAMA
Group by cubeSelect Day, Time, Sum(Students)From enrollGroup By cube (Day,Time) {rollup(Day,Time) -- all rows except ([NULL])}
Day Time Enrolled363
[NULL] 1 11 [NULL] 2 105 [NULL] 8 43 [NULL] 9 37 [NULL] 10 19 [NULL] 11 92 [NULL] 12 56 M 183 M 1 11 M 2 45 M 8 20 M 9 27 M 10 19 M 11 61 T 180 T 2 60 T 8 23 T 9 10 T 11 31 T 12 56
88SPK – IF UTAMA
Group by rollup
Select Day, Time, Prof, Sum(Students) From enroll Group by rollup(day,time),rollup(prof)
89SPK – IF UTAMA
Second Example
Sales Data
90SPK – IF UTAMA
Sales Cube
Market
Time
Product
16
91SPK – IF UTAMA
Sales Example (Cont.)
Simple Cross-Tabular Report
598,000 319,000 279,000 Total
193,000 97,000 96,000 West
238,000 137,000 101,000 East
167,000 85,000 82,000 Central
Total Profit Video Sales Profit
CassetteSalesProfit
Department Region
1997
92SPK – IF UTAMA
Sales Example (Cont.)
Roll up – the querySELECT Time, Region, Department, sum(Profit)FROM salesGROUP BY ROLLUP(Time, Region, Dept)
this query returns the following sets of rows:
• Regular aggregation rows that would be produced by GROUP BY without using ROLLUP .
• First-level subtotals aggregating across Department for each combination of Time and Region .
• Second-level subtotals aggregating across Region and Department for each Time value .
• A grand total row .
93SPK – IF UTAMA
Sales Example (Cont.)
Roll up – the result of queryTime Region Dept Profit1996 Central CassetteSales 75,0001996 Central VideoSales 74,0001996 Central [NULL] 149,0001996 East CassetteSales 89,0001996 East VideoSales 115,0001996 East [NULL] 204,0001996 West CassetteSales 87,0001996 West VideoSales 86,0001996 West [NULL] 173,0001996 [NULL] [NULL] 526,0001997 Central CassetteSales 82,0001997 Central VideoSales 85,0001997 Central [NULL] 167,0001997 East CassetteSales 101,0001997 East VideoSales 137,0001997 East [NULL] 238,0001997 West CassetteSales 96,0001997 West VideoSales 97,0001997 West [NULL] 193,0001997 [NULL] [NULL] 598,000[NULL] [NULL] [NULL] 1,124,000
94SPK – IF UTAMA
Calculating Subtotals without ROLLUPThe result set could be generated by the UNION of four SELECTstatements, as shown below. This is a subtotal across three dimensions. Notice that a complete set of ROLLUP-style subtotals in n dimensions would require n+1 SELECT statements linked with UNION ALL.
SELECT Time, Region, Department, SUM(Profit)FROM SalesGROUP BY Time, Region, Department
UNION ALLSELECT Time, Region, '' , SUM(Profit)FROM SalesGROUP BY Time, Region
UNION ALLSELECT Time, '', '', SUM(Profits)FROM SalesGROUP BY Time
UNION ALLSELECT '', '', '', SUM(Profits)FROM Sales;
Sales Example (Cont.)
Roll up
95SPK – IF UTAMA
Sales Example (Cont.)
Cube - the querySELECT Time, Region, Department, sum(Profit)FROM salesGROUP BY CUBE (Time, Region, Dept)
96SPK – IF UTAMA
Time Region Dept Profit 1996 Central CassetteSales 75,000 1996 Central VideoSales 74,000 1996 Central [NULL] 149,000 1996 East CassetteSales 89,000 1996 East VideoSales 115,000 1996 East [NULL] 204,000 1996 West CassetteSales 87,000 1996 West VideoSales 86,000 1996 West [NULL] 173,000 1996 [NULL] CassetteSales 251,000 1996 [NULL] VideoSales 275,000 1996 [NULL] [NULL] 526,000 1997 Central CassetteSales 82,000 1997 Central VideoSales 85,000 1997 Central [NULL] 167,000 1997 East CassetteSales 101,000 1997 East VideoSales 137,000 1997 East [NULL] 238,000 1997 West CassetteSales 96,000 1997 West VideoSales 97,000 1997 West [NULL] 193,000 1997 [NULL] CassetteSales 279,000 1997 [NULL] VideoSales 319,000 1997 [NULL] [NULL] 598,000 [NULL] Central CassetteSales 157,000 [NULL] Central VideoSales 159,000 [NULL] Central [NULL] 316,000 [NULL] East CassetteSales 190,000 [NULL] East VideoSales 252,000 [NULL] East [NULL] 442,000 [NULL] West CassetteSales 183,000 [NULL] West VideoSales 183,000 [NULL] West [NULL] 366,000 [NULL] [NULL] CassetteSales 530,000 [NULL] [NULL] VideoSales 594,000 [NULL] [NULL] [NULL] 1,124,000
Sales Example (Cont.) Cube – the result of query
17
97SPK – IF UTAMA
Calculating Subtotals without CUBEJust as for ROLLUP, multiple SELECT statements combined with UNION statements could provide the same information gathered through CUBE. However, this may require many SELECT statements: for an n-dimensional cube, 2n SELECT statements are needed. In our 3-dimension example, this would mean issuing 8 SELECTS linked with UNION ALL.
Sales Example (Cont.)
Cube
98SPK – IF UTAMA
Sales Example (Cont.)
Grouping
Two challenges arise with the use of ROLLUP and CUBE. First, how can we programmatically determine which result set rows are subtotals, and how do we find the exact level of aggregation of a given subtotal? We will often need to use subtotals in calculations such as percent-of-totals, so we need an easy way to determine which rows are the subtotals we seek. Second, what happens if query results contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE? How does an application or developer differentiate between the two?
99SPK – IF UTAMA
Sales Example (Cont.)
Grouping
To handle these issues, we have a function called GROUPING. Using a single column as its argument, Grouping returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, will return a 0.
100SPK – IF UTAMA
Sales Example (Cont.)
Grouping – the querySELECT Time, Region, Department, SUM(Profit)GROUPING (Time) as T, GROUPING (Region) as R, GROUPING (Department) as DFROM Sales
GROUP BY ROLLUP (Time, Region, Department)
101SPK – IF UTAMA
Time Region Dept Profit T R D1996 Central CassetteSales 75,000 0 0 0 1996 Central Video Sales 74,000 0 0 0 1996 Central [NULL] 149,000 0 0 1 1996 East CassetteSales 89,000 0 0 0 1996 East Video Sales 115,000 0 0 0 1996 East [NULL] 204,000 0 0 1 1996 West CassetteSales 87,000 0 0 0 1996 West Video Sales 86,000 0 0 0 1996 West [NULL] 173,000 0 0 1 1996 [NULL] [NULL] 526,000 0 1 1 1997 Central CassetteSales 82,000 0 0 0 1997 Central Video Sales 85,000 0 0 0 1997 Central [NULL] 167,000 0 0 1 1997 East CassetteSales 101,000 0 0 0 1997 East Video Sales 137,000 0 0 0 1997 East [NULL] 238,000 0 0 1 1997 West VideoRental 96,000 0 0 0 1997 West VideoSales 97,000 0 0 0 1997 West [NULL] 193,000 0 0 1 1997 [NULL] [NULL] 598,000 0 1 1 [NULL] [NULL] [NULL] 1,124,000 1 1 1
Sales Example (Cont.)
Grouping – the result of query
102SPK – IF UTAMA
Grouping
Time Region Profit 1996 East 200,000 1996 [NULL] 200,000 [NULL] East 200,000 [NULL] [NULL] 190,000 [NULL] [NULL] 190,000 [NULL] [NULL] 190,000 [NULL] [NULL] 390,000
This table shows an ambiguous result set created using the CUBE extension.
18
103SPK – IF UTAMA
Grouping (Cont.)
We can resolve the ambiguity by using the GROUPING and other functions in the code below
SELECTdecode(grouping(Time), 1, 'All Times', Time) as Time, decode(grouping(region), 1, 'All Regions', Region) asRegion, sum(Profit)
FROM Sales GROUB BY CUBE(Time, Region)
104SPK – IF UTAMA
Grouping (Cont.)
The code result
Time Region Profit
1996 East 200,000 1996 All Regions 200,000All Times East 200,000[NULL] [NULL] 190,000[NULL] All Regions 190,000All Times [NULL] 190,000All Times All Regions 390,000
105SPK – IF UTAMA
Grouping (Cont.)Also we can use GROUPING function for this
purposewe retrieve a subset of the subtotals created by a CUBE and noneof the base-level aggregations. The HAVING clause constrains
columns which use GROUPING functions
SELECT Time, Region, Department, SUM(Profit) AS Profit, GROUPING (Time) AS T, GROUPING (Region) AS R, GROUPING (Department) AS D
FROM Sales GROUP BY CUBE (Time, Region, Department) HAVING (D=1 AND R=1 AND T=1)
OR (R=1 AND D=1) OR (T=1 AND D=1)
106SPK – IF UTAMA
Grouping (Cont.)
The query result
Time Region Department Profit 1996 [NULL] [NULL] 526,000 1997 [NULL] [NULL] 598,000 [NULL] Central [NULL] 316,000 [NULL] East [NULL] 442,000 [NULL] West [NULL] 366,000 [NULL] [NULL] [NULL] 1,124,000
107SPK – IF UTAMA
Roll up Example
SELECT Year, Quarter, Month, SUM(Profit) AS Profit
FROM sales GROUP BY ROLLUP(Year, Quarter, Month)
108SPK – IF UTAMA
Year Quarter Month Profit1997 Winter Jan 55,000 1997 Winter Feb 64,000 1997 Winter March 71,000 1997 Winter [NULL] 190,000 1997 Spring April 75,000 1997 Spring May 86,000 1997 Spring June 88,000 1997 Spring [NULL] 249,000 1997 Summer July 91,000 1997 Summer August 87,000 1997 Summer September 101,000 1997 Summer [NULL] 279,000 1997 Fall October 109,000 1997 Fall November 114,000 1997 Fall December 133,000 1997 Fall [NULL] 356,000 1997 [NULL] [NULL] 1,074,000
The query result
Roll up Example
19
109SPK – IF UTAMA
Referensi1. Keith C.C. Chan,2003, Data Warehousing & Data Mining, The
Hong Kong Polytechnic University2. Dr. Mourad YKHLEF,2009,Decision Support System, King Saud
University, 3. -,-, Decision Support Technology,The Heinz School4. Ahmed M. Zeki, 2004, Data Mining & Data Warehousing, INFO
66305. Dan St. Clair, 2002, Lect 1 – Intro. To Data Mining & Data
Warehouses, University of Missouri-Rolla6. S. Sudarshan; Krithi Ramamritham,-, Data Warehouse and Data
Mining, IIT Bombay7. Chris Clifton, 2004, Data Warehousing, Purdue University8. Richard J. Roiger,-,The Data Warehouse,-9. Hugh J. Watson,-, Recent Developments in Data Warehousing,
http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt, TanggalAkses:17-09-2010
10. Mark Isken,-, Data Warehousing and Online Analytical Processing (OLAP),-
11. Ari Cahyono,-,Introduction to Data Warehouse, MagisterTeknologi Informasi UGM
110
Reference Library
111SPK – IF UTAMA
BI ResourcesThe Data Warehousing Institute http://www.tdwi.org/Kimball and Associates http://www.ralphkimball.com./html/articles.htmlA Dimensional Modeling Manifesto – Kimball, R.http://www.dbmsmag.com/9708d15.htmlDSS Resources http://dssresources.com/Data Warehousing Information Center http://www.dwinfocenter.org/Intelligent Enterprise http://www.intelligententerprise.com/DM Review http://dmreview.com/KDNuggets http://www.kdnuggets.com/IT Toolbox http://www.ittoolbox.com/ http://businessintelligence.ittoolbox.com/http://datawarehouse.ittoolbox.com/OLAP Report http://www.olapreport.com/
o Some free stuff (nice history of OLAP and commentary on industry trendso Other stuff costs $
http://www.mosha.com/msolap/o Awesome set of resources from the lead developer on MS SQL Server
Analysis Server team
112SPK – IF UTAMA
Some Good Books and Articles
The Data Warehouse Toolkit – Kimball, R.o Definitive, Microsoft SQL Server 2005 based 3rd edition now out
OLAP Solutions – Thomsen, E.o Definitive, abstract and dense, good
MDX Solutions: With Microsoft SQL Server Analysis Services 2005 and Hyperion Essbase by George Spofford
o MDX is the “SQL” for cubesData Mining with SQL Server 2005 (Paperback) by ZhaoHui Tang (Author), Jamie MacLennanData Warehouse Design Solutions – Adamson and Venerable
o Multi-D DW designs from lots of different industrieso Very practical, uses realistic situations to reinforce the concepts
Summers Rubber Company designs its data warehouseGorla, Narasimhaiah; Krehbiel, SteveInterfaces; Mar/Apr 1999; 29, 2; ABI/INFORM Global
113SPK – IF UTAMA
More AS Tutorials and Resources
http://www.mosha.com/msolap/o This is the granddaddy of MS SQL Server Analysis Services
resoures. Mosha Pasumansky is the MS development lead on AS engine.
o Site is gold mine of information and links regarding AS and related software
o He participates in microsoft.public.sqlserver.olapo Great Blog at http://www.sqljunkies.com/WebLog/mosha/
Introduction to Analysis Services - by William Pearson (series of articles) http://www.databasejournal.com/article.php/1459531/
o Very nice series of MS AS tutorialsBest practices for Business Intelligence using the Microsoft Data Warehousing Framework
o A white paper from Microsoft