sesi 03 data warehouse introduction · o olap is the technology used to study the data in terms of...

19
1 1 Data Warehouse & OLAP Introduction Sesi: 03-04 Dosen Pembina : Danang Junaedi 2 SPK – IF UTAMA The Knowledge Discovery Process Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Preprocessing Data Sources Target Data Transformed Data Preprocessed Data Patterns / Models Knowledge Selection Interpretation/ Evaluation Transformation Data Mining 3 SPK – IF UTAMA Data Sources Relational Databases Data Warehouses WWW Audio Video Printed Materials : : 4 SPK – IF UTAMA Relational Databases © SPK – IF UTAMA Multidimensional Data Cube Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 6 SPK – IF UTAMA Evolution of DSS Transaction Processing Systems (TPS) o Operational data stores and OLTP o Batch reports, hard to find and analyze information, inflexible and expensive, reprogram every new request (circa 60’s) MIS o Management reporting from transactions in TPS o Still inflexible, not integrated with desktop tools (circa 70’s) DSS o Combine data with analytic models or expert rules o Integration with desktop tools (80’s) Data Warehousing o Data integrated after (cleaning and scrubbing) from multiple sources (both internal and external to the organization) o OLAP is the technology used to study the data in terms of operations on a multi-dimensional data set o Data warehousing also supports processing of data by analytic methods and permits data mining (90’s)

Upload: others

Post on 04-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

1

1

Data Warehouse & OLAP Introduction

Sesi: 03-04Dosen Pembina : Danang Junaedi

2SPK – IF UTAMA

The Knowledge Discovery Process

Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.

Preprocessing

DataSources

TargetData

TransformedData

PreprocessedData

Patterns /Models

Knowledge

Selection

Interpretation/Evaluation

Transformation

Data Mining

3SPK – IF UTAMA

Data Sources

Relational DatabasesData WarehousesWWWAudioVideoPrinted Materials::

4SPK – IF UTAMA

Relational Databases

©

5SPK – IF UTAMA

Multidimensional Data Cube

Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000

Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000

6SPK – IF UTAMA

Evolution of DSSTransaction Processing Systems (TPS)o Operational data stores and OLTPo Batch reports, hard to find and analyze information, inflexible

and expensive, reprogram every new request (circa 60’s)MISo Management reporting from transactions in TPSo Still inflexible, not integrated with desktop tools (circa 70’s)

DSSo Combine data with analytic models or expert ruleso Integration with desktop tools (80’s)

Data Warehousing o Data integrated after (cleaning and scrubbing) from multiple

sources (both internal and external to the organization)o OLAP is the technology used to study the data in terms of

operations on a multi-dimensional data seto Data warehousing also supports processing of data by

analytic methods and permits data mining (90’s)

2

7SPK – IF UTAMA

Applications

Retail - inventory management, promotionsManufacturing - order shipmentInsurance – policy and claims trackingTelecommunications - call analysisFinancial – account trackingCRM/eCRM – customer profiling, clickstream analysisHealthcare – disease management, patient and physician profiling

8SPK – IF UTAMA

Databases for Decision Support

Transaction Processing systems are optimized for performanceData they capture are too detailed to be of use for decision support purposesOnline Analytical Processing (OLAP) imposes very different demands on databases than does Online Transaction Processing (OLTP)

9SPK – IF UTAMA

• Collects and combines information from disparate sources

• Provides integrated view, and a uniform user interface• Supports sharing of data between entities

Integration System

WorldWideWeb

Digital Libraries Scientific Databases

PersonalDatabases

Heterogeneous Database Integration

10SPK – IF UTAMA

Why look at data in this way?What would be the demand for services (forecasting)?Who are our key customers/patients, ando What are the margins/outcomes? (profitable

customers/satisfied patients)o How do we market to them/treat them?o What pricing/treatment strategy is desirable?o What are their preferences?o What type of customer/patient services are required?o What services when packaged result in higher/better

sales/revenues/margins/outcomes, efficient workflow?Which promotion/patient education/counseling works or does not work and why?What is the inventory/patient turnover?Which channel/technology is more effective/profitable?Why do margins/outcomes differ from one place to another or one patient to another?

11SPK – IF UTAMA

Data Warehousing and Industry

One of the hottest topic in IS.Over 90% of larger companies either have a DW or are starting one.Warehousing is big businesso $2 billion in 1995o $3.5 billion in early 1997o $8 billion in 1998 [Metagroup]o over $200 billion over next 5 years.

12SPK – IF UTAMA

Data Warehousing and Industry (2)

A 1996 study of 62 data warehousing projects showed:o An average return on investment of 321%, with

an average payback period of 2.73 years.

WalMart has largest warehouseo 900-CPU, 2,700 disk, 23 TB Teradata

systemo ~7TB in warehouseo 40-50GB per day

3

13SPK – IF UTAMA

Why Data Warehousing?

Advance of information technology.Data collected in huge amounts.Need to make good use of data?Architecture and tools too Bring together scattered information from multiple

sources to provide consistent data source for decision support.

o Support information processing by providing a solid platform of consolidated, historical data for analysis.

14SPK – IF UTAMA

Which are ourlowest/highest margin

customers ?

Which are ourlowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customersare most likely to go to the competition ?

Which customersare most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

A producer wants to know….

15SPK – IF UTAMA

Data, Data everywhere yet ...I can’t find the data I needo data is scattered over the networko many versions, subtle differences

I can’t get the data I needo need an expert to get the data

I can’t understand the data I foundo available data poorly documented

I can’t use the data I foundo results are unexpectedo data needs to be transformed from

one form to other

16SPK – IF UTAMA

What are the users saying...

Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over timeWhat-if capabilities are required

17SPK – IF UTAMA

What is a Data Warehouse?

Defined in many different ways non-rigorously.o A DB for decision support.o Maintained separately from an organization’s

operational database.A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.—W. H. Inmon90% of major organizations have or are building some kind of data warehouse.A decision support database that is maintained separately from the organization’s operational databases.

18SPK – IF UTAMA

Data Warehouse

A data warehouse is a o subject-oriented

o integrated

o time-varying

o non-volatile

collection of data that is used primarily in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

4

19SPK – IF UTAMA

Data Warehouse

Subject Oriented: The data is grouped under business headings, such as: Customers, products, sales analysis repots (This subject orientation is achieved through data modeling).Integrated: The contents of the data warehouses are defined such that they are valid across the enterprise and its operational and external data sources.Time Dimensioned: All data in the data warehouse is time stamped at time of entry into the warehouse or when it is summarized.Non-volatile: Once loaded into the data warehouse, the data is not updated. Thus it acts as a stable resource for consistent reporting and comparative analysis.

20SPK – IF UTAMA

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

Data

Information

21SPK – IF UTAMA

Data Warehousing -- It is a process

Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possibleA decision support database maintained separately from the organization’s operational database

22SPK – IF UTAMA

Explorers, Farmers and Tourists

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Farmers: Harvest informationfrom known access paths

Tourists: Browse information harvested by farmers

23SPK – IF UTAMA

Data Warehouse Architecture

Datasource

Datasource

Datasource

Relational database (warehouse)

User queries

Data extractionprograms

Data cleaning/scrubbing

OLAP / Decision support/

Data cubes/ data mining

24SPK – IF UTAMA

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

5

SPK – IF UTAMA

Data Warehouse Component

26SPK – IF UTAMA

Data Warehouse Architecture

ClinicalSystem

PayrollSystem

Billing System

ExternalData

Other Internal

Data

(e.g.,AS400)

OracleFinancialson HP 9000

Access,Files (Industry Reports)

TransformationIntegration

DataWarehouse

Meta-Data

Excel Web Other

DataMining Tools

OLAP serversOLAP servers

27SPK – IF UTAMA

Prod

Mkt

HR

Fin

Acctg

Data Sources

Transaction Data

IBM

IMS

VSAM

Oracle

Sybase

ETL Software Data Stores Data AnalysisTools and Applications

Users

Other Internal Data

ERP SAP

Clickstream Informix

Web Data

External Data

Demographic Harte-Hanks

STAG ING

AREA

OPERAT IONAL

DATA

STORE

Ascential

Extract

Sagent

SAS

Clean/ScrubTransformFirst logic

Load

Informatica

Data MartsTeradataIBM

Data Warehouse

Meta Data

Finance

Marketing

Sales

Essbase

Microsoft

ANALYSTS

MANAGERS

EXECUTIVES

OPERATIONALPERSONNEL

CUSTOMERS/SUPPLIERS

SQL

Cognos

SAS

Queries,Reporting,DSS/EIS, Data Mining

Micro Strategy

Siebel

BusinessObjects

WebBrowser

28SPK – IF UTAMA

Extraction, Transformation, & Load (ETL)

ETL is a set of tools and techniques used to populate a data warehouse

ExtractionExtract data from sources (e.g., operational DBMSs, file systems, Web pages)

TransformationClean dataConvert from legacy/host format to warehouse format (e.g., convert “surname” to “last name”)

29SPK – IF UTAMA

Extraction, Transformation, & Load (ETL)

Load Sort, summarize, consolidate, compute views, check integrity, build indexes, partitionHuge volumes of data to be loaded, yet small time window (usually at night) when the warehouse can be taken off-lineTechniques: batch, sequential load often too slow; incremental, parallel loading techniques may be used

RefreshPropagate updates from sources to the warehouseWhen to refresh - on every update, periodically (e.g., every 24 hours), or after “significant” eventsHow to refresh – full extract from base tables vs. incremental techniques

30SPK – IF UTAMA

Data Mart

• A data mart stores data for a limited number ofsubject areas, such as marketing and sales data. It isused to support specific applications.

• An independent data mart is created directly fromsource systems.

• A dependent data mart is populated from a datawarehouse.

6

31SPK – IF UTAMA

Data Warehouse vs. Data Marts

Enterprise warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization.o Requires extensive business modelingo May take years to design and build

Data Marts: departmental subsets that focus on selected subjects: Marketing data mart: customer, products, sales.o Faster roll out, but complex integration in the long

run.

32SPK – IF UTAMA

Data Warehouse for Decision Support & OLAP

Putting Information technology to help the knowledge worker make faster and better decisionso Which of my customers are most likely to go to the

competition?o What product promotions have the biggest impact on

revenue?o How did the share price of software companies correlate with

profits over last 10 years?

33SPK – IF UTAMA

Decision Support

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-hoc

Used by managers and end-users to understand the business and make judgements

34SPK – IF UTAMA

Information Sources Data Warehouse Server(Tier 1)

OLAP Servers(Tier 2)

Clients(Tier 3)

OperationalDB’s

SemistructuredSources

extracttransformloadrefreshetc.

Data Marts

DataWarehouse

e.g., MOLAP

e.g., ROLAP

serve

Analysis

Query/Reporting

Data Mining

serve

serve

The Complete Decision Support System (Source: Franconi)

35SPK – IF UTAMA

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

36SPK – IF UTAMA

We want to know ...Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?

Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

7

37SPK – IF UTAMA

Why Separate Data Warehouse?

Performanceo Op dbs designed & tuned for known txs &

workloads.o Complex OLAP queries would degrade

perf. for op txs.o Special data organization, access &

implementation methods needed for multidimensional views & queries.

Functiono Missing data: Decision support requires

historical data, which op dbs do not typically maintain. 38SPK – IF UTAMA

Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration: o Build wrappers/mediators on top of heterogeneous databases o Query driven approach

• When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

• Complex information filtering, compete for resourcesData warehouse: update-driven, high performanceo Information from heterogeneous sources is integrated in

advance and stored in warehouses for direct query and analysis

39SPK – IF UTAMA

Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)o Major task of traditional relational DBMSo Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.OLAP (on-line analytical processing)o Major task of data warehouse systemo Data analysis and decision making

Distinct features (OLTP vs. OLAP):o User and system orientation: customer vs. marketo Data contents: current, detailed vs. historical, consolidatedo Database design: ER + application vs. star + subjecto View: current, local vs. evolutionary, integratedo Access patterns: update vs. read-only but complex queries

40SPK – IF UTAMA

RDBMS used for OLTP

Database Systems have been used traditionally for OLTPo clerical data processing taskso detailed, up to date datao structured repetitive taskso read/update a few recordso isolation, recovery and integrity are critical

41SPK – IF UTAMA

Operational Systems

Run the business in real timeBased on up-to-the-second dataOptimized to handle large numbers of simple read/write transactionsOptimized for fast response to predefined transactionsUsed by people who deal with customers, products -- clerks, salespeople etc.They are increasingly used by customers

42SPK – IF UTAMA

Examples of Operational Data

Data Industry Usage Technology Volumes

CustomerFile

All TrackCustomerDetails

Legacy application, flatfiles, main frames

Small-medium

AccountBalance

Finance Controlaccountactivities

Legacy applications,hierarchical databases,mainframe

Large

Point-of-Sale data

Retail Generatebills, managestock

ERP, Client/Server,relational databases

Very Large

CallRecord

Telecomm-unications

Billing Legacy application,hierarchical database,mainframe

Very Large

ProductionRecord

Manufact-uring

ControlProduction

ERP,relational databases,AS/400

Medium

8

43SPK – IF UTAMA

Application-Orientation vs. Subject-Orientation

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

VendorProduct

Activity44SPK – IF UTAMA

OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date

detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc access read/write

index/hash on prim. key lots of scans

unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response

45SPK – IF UTAMA

OLTP vs. Data Warehouse

OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouseSpecial data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)o e.g., average amount spent on phone calls between 9AM-5PM

in Pune during the month of December

46SPK – IF UTAMA

OLTP vs Data Warehouse

OLTPo Application Orientedo Used to run businesso Detailed datao Current up to dateo Isolated Datao Repetitive accesso Clerical User

Warehouse (DSS)o Subject Orientedo Used to analyze

businesso Summarized and

refinedo Snapshot datao Integrated Datao Ad-hoc accesso Knowledge User

(Manager)

47SPK – IF UTAMA

OLTP vs Data Warehouse

OLTPo Performance Sensitiveo Few Records accessed at a

time (tens)

o Read/Update Access

o No data redundancyo Database Size 100MB -

100 GB

Data Warehouseo Performance relaxedo Large volumes accessed at a

time(millions)o Mostly Read (Batch Update)o Redundancy presento Database Size 100 GB - few

terabytes

48SPK – IF UTAMA

OLTP vs Data Warehouse

OLTPo Transaction throughput is

the performance metrico Thousands of userso Managed in entirety

Data Warehouseo Query throughput is the

performance metrico Hundreds of userso Managed by subsets

9

49SPK – IF UTAMA

To summarize ...

OLTP Systems are used to “run” a business

The Data Warehouse helps to “optimize” the business

50SPK – IF UTAMA

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensionso Dimension tables, such as item (item_name, brand, type), or

time(day, week, month, quarter, year)

o Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

51SPK – IF UTAMA

A Sample Data Cube

Total annual salesof TVs in U.S.A.Date

Produ

ct

Cou

ntrysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4QtrU.S.A

Canada

Mexico

sum

52SPK – IF UTAMA

Cube: A Lattice of Cuboids

all

time item location supplier

time,itemtime,location

time,supplier

item,location

item,supplier

location,supplier

time,item,locationtime,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

53SPK – IF UTAMA

Data (Hyper) cubes

2-d to 3-d cube

Rotating the cube

54SPK – IF UTAMA

Conceptual Modeling of Data Warehouses

ER design techniques not appropriate

Modeling data warehouses: dimensions & measureso Star schema: A fact table in the middle connected to a set of

dimension tables

o Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

o Fact constellations schema : Multiple fact tables share dimension tables, viewed as a collection of stars, therefore

called galaxy schema or fact constellation

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

10

SPK – IF UTAMA SPK – IF UTAMA

Problem with ER

ER models are NOT suitable for DW?End user cannot understand or remember an ER ModelMany DWs have failed because of overly complex ER designsNot optimized for complex, ad-hoc queriesData retrieval becomes difficult due to normalizationBrowsing becomes difficult

57SPK – IF UTAMA

Warehouse Models & Operators

Data Modelso relationso stars & snowflakeso cubes

Operatorso slice & diceo roll-up, drill downo pivotingo other

58SPK – IF UTAMA

Multidimensional Data Model

Database is a set of facts (points) in a multidimensional spaceA fact has a measure dimensiono quantity that is analyzed, e.g., sale, budget

A set of dimensions on which data is analyzedo e.g. , store, product, date associated with a sale amount

Dimensions form a sparsely populated coordinate systemEach dimension has a set of attributeso e.g., owner, city and county of store

Attributes of a dimension may be related by partial ordero Hierarchy: e.g., street > county >cityo Lattice: e.g., date> month>year, date>week>year

59SPK – IF UTAMA

Example: Patient profiling

A healthcare organization needed a longitudinal view of patients, including trends of services to patientsModelo Facts include Healthcare (e.g., diagnosis, procedure),

Financial (e.g., amount billed, number of claims), Resources (e.g., number of bed-days, inpatient and outpatient visits)

o Dimensions include Time, Provider, Claim Type, Demographics, Encounter Type, Diagnosis and Procedure, Person, Organization

Questions answered by system:o Which individuals are eligible for services but not obtaining

them?o Which individuals are registered for services, but not

receiving preventive healthcare?

60SPK – IF UTAMA

Star Schema

A single fact table and a single table for each dimensionEvery fact points to one tuple in each of the dimensions and has additional attributesDoes not capture hierarchies directlyStraightforward means of capturing a multiple dimension data model using relationsSlowly Changing Dimensions

11

61SPK – IF UTAMA

Fig. 2.4 Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

62SPK – IF UTAMA

Example of a Star Schema

Order NoOrder No

Order DateOrder Date

Customer NoCustomer No

Customer NameCustomer Name

Customer Customer AddressAddress

CityCity

SalespersonIDSalespersonID

SalespersonNameSalespersonName

CityCity

QuotaQuota

OrderNOOrderNO

SalespersonIDSalespersonID

CustomerNOCustomerNO

ProdNoProdNo

DateKeyDateKey

CityNameCityName

QuantityQuantity

Total Price

ProductNOProductNO

ProdNameProdName

ProdDescrProdDescr

CategoryCategory

CategoryDescriptionCategoryDescription

UnitPriceUnitPrice

DateKeyDateKey

DateDate

CityNameCityName

StateState

CountryCountry

OrderOrder

CustomerCustomer

SalespersonSalesperson

CityCity

DateDate

ProductProduct

Fact TableFact Table

63SPK – IF UTAMA

Cardholder Key Purchase Key1 2

Fact TableAmountTime KeyLocation Key

101 14.50

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15 4 115 8.251 2 103 22.40

Location Key Street10 425 Church St

Location DimensionRegionStateCity

SCCharleston 3...

.

.

.

.

.

.

.

.

.

.

.

.

GenderMale

.

.

.

Female

Income Range50 - 70,000

.

.

.

70 - 90,000

Cardholder Key Name1 John Doe

.

.

.

.

.

.

2 Sara Smith

Cardholder Dimension

Purchase Key Category1 Supermarket

.

.

.

.

.

.

2 Travel & Entertainment

Purchase Dimension

3 Auto & Vehicle4 Retail5 Restarurant6 Miscellaneous

Time Key Month10 Jan

Time DimensionYearQuarterDay

15 2002...

.

.

.

.

.

.

.

.

.

.

.

.

A star schema for credit card purchases

64SPK – IF UTAMA

Snowflake Schema

Represent dimensional hierarchy directly by normalizing the dimension tablesEasy to maintainSaves storage, but may reduce effectiveness of browsing (Kimball)

65SPK – IF UTAMA

Fig. 2.5 Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

66SPK – IF UTAMA

Order NoOrder No

Order DateOrder Date

Customer NoCustomer No

Customer NameCustomer Name

Customer Customer AddressAddress

CityCity

SalespersonIDSalespersonID

SalespersonNameSalespersonName

CityCity

QuotaQuota

OrderNOOrderNO

SalespersonIDSalespersonID

CustomerNOCustomerNO

ProdNoProdNo

DateKeyDateKey

CityNameCityName

QuantityQuantity

Total Price

ProductNOProductNO

ProdNameProdName

ProdDescrProdDescr

CategoryCategory

CategoryCategory

UnitPriceUnitPrice

DateKeyDateKey

DateDate

MonthMonth

CityNameCityName

StateState

CountryCountry

OrderOrder

CustomerCustomer

SalespersonSalesperson

CityCity

DateDate

ProductProduct

Fact TableFact TableCategoryNameCategoryName

CategoryDescrCategoryDescr

MonthMonth

YearYear YearYear

StateNameStateName

CountryCountry

CategoryCategory

StateState

MonthMonthYearYear

Example of a Snowflake Schema

12

67SPK – IF UTAMA

Fact Constellation Schema

Multiple fact tables share dimension tables.This schema is viewed as collection of stars hence called galaxy schema or fact constellation.Sophisticated applications require such schema.

68SPK – IF UTAMA

Fig 2.6 Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.

69SPK – IF UTAMA

Example Fact Constellation Schema

Price

Units

Period Key

Product Key

Store Key

Store Dimension

Product Dimension

SalesFact Table

Store Key

Region

State

City

Store Name

Product Desc

Product KeyShipper Key

Price

Units

Period Key

Product Key

Store Key

ShippingFact Table

70SPK – IF UTAMA

Cardholder Key Purchase Key1 2

Purchase Fact TableAmountTime KeyLocation Key

101 14.50

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15 4 115 8.251 2 103 22.40

Time Key Month5 Dec

Time DimensionYearQuarterDay

431 2001

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8 Jan 13 200210 Jan 15 2002

Promotion Key DescriptionPromotion Dimension

Cost

.

.

.

.

.

.

.

.

.

1 watch promo 15.25

Purchase Key Category1 Supermarket2 Travel & Entertainment

Purchase Dimension

3 Auto & Vehicle4 Retail5 Restarurant6 Miscellaneous

Location Key Street5 425 Church St

Location DimensionRegionStateCity

SCCharleston 3...

.

.

.

.

.

.

.

.

.

.

.

.

Cardholder Key Promotion Key1 1

Promotion Fact TableResponseTime Key

5 Yes

.

.

.

.

.

.

.

.

.

.

.

.

2 1 5 No

GenderMale

.

.

.

Female

Income Range50 - 70,000

.

.

.

70 - 90,000

Cardholder Key Name1 John Doe

.

.

.

.

.

.

2 Sara Smith

Cardholder Dimension

A constellation schema for credit card purchases and promotions

71SPK – IF UTAMA

What is OLAP?

Software tool providing multi-dimensional view of data for business analysisExample of “Decision Support” or “Business Intelligence” toolFast data access and fast computationsInteractive, flexible user interface“Slice, dice, drill-down”Excel Pivot Table and Pivot Chart are examples of simple OLAP tools

72SPK – IF UTAMA

Defining OLAP - ANALYSIS

Business logic and statistical analysis relevant to end userShould not require programming for everythingAnalysis can be via vendors’ tools or link to generic analytical platform such as spreadsheetExamples include time series analysis, cost allocation, currency translation, goal seeking, ad-hoc multi-dimensional structural changes (cube building), non-procedural modeling, exception alerting, and data mining.Capabilities vary widely by vendor and market

13

73SPK – IF UTAMA

OLAP Operations

74SPK – IF UTAMA

Common cube operations

Pivot or Rotate – change which dimensions and/or levels within dimensions are shown on row and column axesRoll-up – aggregate or combine cells within a dimension according to some mathematical operation

o Uses a hierarchy definition for the dimensiono Commonly this is summation or count

Drill down – examine data a greater level of detailo Add another row or column header which is further down the

concept hierarchySlice – select a subset of a cube by constraining the value of some dimension

o Ex: Select cells for month = January in time dimensionDice – select a subset of a cube by constraining two or more dimensionsDrill through – access atomic level detail data

75SPK – IF UTAMA

OLAP Operations & SQL Sample

76SPK – IF UTAMA

77SPK – IF UTAMA 78SPK – IF UTAMA

Cube Operation (SQL)

14

79SPK – IF UTAMA

A Few Products

Microsoft Analysis Serviceso Part of SQL Server 2005o Create OLAP cubes, 10 data mining algorithms

Tableauo A new, pretty amazing pivoting tool

Cognoso Recently bought by IBM

Hyperion Essbaseo Full suite of business intelligence developer and end user toolso Purchased by Oracle

Business Objects (Crystal)o Full suite of business intelligence developer and end user tools

MicrostrategyOracleInformation Builders

o Home of WebFocus, a web based OLAP toolPentaho

o A new open source business intelligence projecto http://www.pentaho.org/

80SPK – IF UTAMA

A Data Mining Query Language, DMQL: Language Primitives

Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:

<measure_list>

Dimension Definition ( Dimension Table )define dimension <dimension_name> as

(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)o First time as “cube definition”o define dimension <dimension_name> as

<dimension_name_first_time> in cube<cube_name_first_time>

81SPK – IF UTAMA

Defining a Star Schema in DMQL

define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

82SPK – IF UTAMA

Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key, province_or_state, country))

83SPK – IF UTAMA

Defining a Fact Constellation in DMQLdefine cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as

location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales 84SPK – IF UTAMA

First Example

Enrollment Data

Sumber: Dr. Mourad YKHLEF,Decision Support System, King Saud University, 2009

15

85SPK – IF UTAMA

Class Day Time Prof Enrolled 1336 T 8 Aars 14 1430 M 2 Aars 28 1430 M 11 Booth 30 1430 T 11 Booth 26 1430 T 2 Booth 23 1430 T 2 Fry 27 1430 T 12 Aars 29 1440 M 1 Aars 11 2334 M 9 Fry 27 2350 M 10 Maurer 19 3101 T 12 Grabow 16 3303 M 11 Aars 11 3324 M 8 Gaitros 20 3330 T 11 Fry 5 3331 T 12 Aars 11 3334 M 11 Hamerly 20 3335 M 2 Donahoo 17 3336 T 8 Sturgill 9 3342 T 2 Aars 10 3439 T 9 Poucher 10

Enroll Table

86SPK – IF UTAMA

Group by rollup

Select Prof, Sum(Students) From enroll Group by rollup (prof)

Prof Enrolled Aars 114 Booth 79 Donahoo 17 Fry 59 Gaitros 20 Grabow 16 Hamerly 20 Maurer 19 Poucher 10 Sturgill 9

363

87SPK – IF UTAMA

Group by cubeSelect Day, Time, Sum(Students)From enrollGroup By cube (Day,Time) {rollup(Day,Time) -- all rows except ([NULL])}

Day Time Enrolled363

[NULL] 1 11 [NULL] 2 105 [NULL] 8 43 [NULL] 9 37 [NULL] 10 19 [NULL] 11 92 [NULL] 12 56 M 183 M 1 11 M 2 45 M 8 20 M 9 27 M 10 19 M 11 61 T 180 T 2 60 T 8 23 T 9 10 T 11 31 T 12 56

88SPK – IF UTAMA

Group by rollup

Select Day, Time, Prof, Sum(Students) From enroll Group by rollup(day,time),rollup(prof)

89SPK – IF UTAMA

Second Example

Sales Data

90SPK – IF UTAMA

Sales Cube

Market

Time

Product

16

91SPK – IF UTAMA

Sales Example (Cont.)

Simple Cross-Tabular Report

598,000 319,000 279,000 Total

193,000 97,000 96,000 West

238,000 137,000 101,000 East

167,000 85,000 82,000 Central

Total Profit Video Sales Profit

CassetteSalesProfit

Department Region

1997

92SPK – IF UTAMA

Sales Example (Cont.)

Roll up – the querySELECT Time, Region, Department, sum(Profit)FROM salesGROUP BY ROLLUP(Time, Region, Dept)

this query returns the following sets of rows:

• Regular aggregation rows that would be produced by GROUP BY without using ROLLUP .

• First-level subtotals aggregating across Department for each combination of Time and Region .

• Second-level subtotals aggregating across Region and Department for each Time value .

• A grand total row .

93SPK – IF UTAMA

Sales Example (Cont.)

Roll up – the result of queryTime Region Dept Profit1996 Central CassetteSales 75,0001996 Central VideoSales 74,0001996 Central [NULL] 149,0001996 East CassetteSales 89,0001996 East VideoSales 115,0001996 East [NULL] 204,0001996 West CassetteSales 87,0001996 West VideoSales 86,0001996 West [NULL] 173,0001996 [NULL] [NULL] 526,0001997 Central CassetteSales 82,0001997 Central VideoSales 85,0001997 Central [NULL] 167,0001997 East CassetteSales 101,0001997 East VideoSales 137,0001997 East [NULL] 238,0001997 West CassetteSales 96,0001997 West VideoSales 97,0001997 West [NULL] 193,0001997 [NULL] [NULL] 598,000[NULL] [NULL] [NULL] 1,124,000

94SPK – IF UTAMA

Calculating Subtotals without ROLLUPThe result set could be generated by the UNION of four SELECTstatements, as shown below. This is a subtotal across three dimensions. Notice that a complete set of ROLLUP-style subtotals in n dimensions would require n+1 SELECT statements linked with UNION ALL.

SELECT Time, Region, Department, SUM(Profit)FROM SalesGROUP BY Time, Region, Department

UNION ALLSELECT Time, Region, '' , SUM(Profit)FROM SalesGROUP BY Time, Region

UNION ALLSELECT Time, '', '', SUM(Profits)FROM SalesGROUP BY Time

UNION ALLSELECT '', '', '', SUM(Profits)FROM Sales;

Sales Example (Cont.)

Roll up

95SPK – IF UTAMA

Sales Example (Cont.)

Cube - the querySELECT Time, Region, Department, sum(Profit)FROM salesGROUP BY CUBE (Time, Region, Dept)

96SPK – IF UTAMA

Time Region Dept Profit 1996 Central CassetteSales 75,000 1996 Central VideoSales 74,000 1996 Central [NULL] 149,000 1996 East CassetteSales 89,000 1996 East VideoSales 115,000 1996 East [NULL] 204,000 1996 West CassetteSales 87,000 1996 West VideoSales 86,000 1996 West [NULL] 173,000 1996 [NULL] CassetteSales 251,000 1996 [NULL] VideoSales 275,000 1996 [NULL] [NULL] 526,000 1997 Central CassetteSales 82,000 1997 Central VideoSales 85,000 1997 Central [NULL] 167,000 1997 East CassetteSales 101,000 1997 East VideoSales 137,000 1997 East [NULL] 238,000 1997 West CassetteSales 96,000 1997 West VideoSales 97,000 1997 West [NULL] 193,000 1997 [NULL] CassetteSales 279,000 1997 [NULL] VideoSales 319,000 1997 [NULL] [NULL] 598,000 [NULL] Central CassetteSales 157,000 [NULL] Central VideoSales 159,000 [NULL] Central [NULL] 316,000 [NULL] East CassetteSales 190,000 [NULL] East VideoSales 252,000 [NULL] East [NULL] 442,000 [NULL] West CassetteSales 183,000 [NULL] West VideoSales 183,000 [NULL] West [NULL] 366,000 [NULL] [NULL] CassetteSales 530,000 [NULL] [NULL] VideoSales 594,000 [NULL] [NULL] [NULL] 1,124,000

Sales Example (Cont.) Cube – the result of query

17

97SPK – IF UTAMA

Calculating Subtotals without CUBEJust as for ROLLUP, multiple SELECT statements combined with UNION statements could provide the same information gathered through CUBE. However, this may require many SELECT statements: for an n-dimensional cube, 2n SELECT statements are needed. In our 3-dimension example, this would mean issuing 8 SELECTS linked with UNION ALL.

Sales Example (Cont.)

Cube

98SPK – IF UTAMA

Sales Example (Cont.)

Grouping

Two challenges arise with the use of ROLLUP and CUBE. First, how can we programmatically determine which result set rows are subtotals, and how do we find the exact level of aggregation of a given subtotal? We will often need to use subtotals in calculations such as percent-of-totals, so we need an easy way to determine which rows are the subtotals we seek. Second, what happens if query results contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE? How does an application or developer differentiate between the two?

99SPK – IF UTAMA

Sales Example (Cont.)

Grouping

To handle these issues, we have a function called GROUPING. Using a single column as its argument, Grouping returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, will return a 0.

100SPK – IF UTAMA

Sales Example (Cont.)

Grouping – the querySELECT Time, Region, Department, SUM(Profit)GROUPING (Time) as T, GROUPING (Region) as R, GROUPING (Department) as DFROM Sales

GROUP BY ROLLUP (Time, Region, Department)

101SPK – IF UTAMA

Time Region Dept Profit T R D1996 Central CassetteSales 75,000 0 0 0 1996 Central Video Sales 74,000 0 0 0 1996 Central [NULL] 149,000 0 0 1 1996 East CassetteSales 89,000 0 0 0 1996 East Video Sales 115,000 0 0 0 1996 East [NULL] 204,000 0 0 1 1996 West CassetteSales 87,000 0 0 0 1996 West Video Sales 86,000 0 0 0 1996 West [NULL] 173,000 0 0 1 1996 [NULL] [NULL] 526,000 0 1 1 1997 Central CassetteSales 82,000 0 0 0 1997 Central Video Sales 85,000 0 0 0 1997 Central [NULL] 167,000 0 0 1 1997 East CassetteSales 101,000 0 0 0 1997 East Video Sales 137,000 0 0 0 1997 East [NULL] 238,000 0 0 1 1997 West VideoRental 96,000 0 0 0 1997 West VideoSales 97,000 0 0 0 1997 West [NULL] 193,000 0 0 1 1997 [NULL] [NULL] 598,000 0 1 1 [NULL] [NULL] [NULL] 1,124,000 1 1 1

Sales Example (Cont.)

Grouping – the result of query

102SPK – IF UTAMA

Grouping

Time Region Profit 1996 East 200,000 1996 [NULL] 200,000 [NULL] East 200,000 [NULL] [NULL] 190,000 [NULL] [NULL] 190,000 [NULL] [NULL] 190,000 [NULL] [NULL] 390,000

This table shows an ambiguous result set created using the CUBE extension.

18

103SPK – IF UTAMA

Grouping (Cont.)

We can resolve the ambiguity by using the GROUPING and other functions in the code below

SELECTdecode(grouping(Time), 1, 'All Times', Time) as Time, decode(grouping(region), 1, 'All Regions', Region) asRegion, sum(Profit)

FROM Sales GROUB BY CUBE(Time, Region)

104SPK – IF UTAMA

Grouping (Cont.)

The code result

Time Region Profit

1996 East 200,000 1996 All Regions 200,000All Times East 200,000[NULL] [NULL] 190,000[NULL] All Regions 190,000All Times [NULL] 190,000All Times All Regions 390,000

105SPK – IF UTAMA

Grouping (Cont.)Also we can use GROUPING function for this

purposewe retrieve a subset of the subtotals created by a CUBE and noneof the base-level aggregations. The HAVING clause constrains

columns which use GROUPING functions

SELECT Time, Region, Department, SUM(Profit) AS Profit, GROUPING (Time) AS T, GROUPING (Region) AS R, GROUPING (Department) AS D

FROM Sales GROUP BY CUBE (Time, Region, Department) HAVING (D=1 AND R=1 AND T=1)

OR (R=1 AND D=1) OR (T=1 AND D=1)

106SPK – IF UTAMA

Grouping (Cont.)

The query result

Time Region Department Profit 1996 [NULL] [NULL] 526,000 1997 [NULL] [NULL] 598,000 [NULL] Central [NULL] 316,000 [NULL] East [NULL] 442,000 [NULL] West [NULL] 366,000 [NULL] [NULL] [NULL] 1,124,000

107SPK – IF UTAMA

Roll up Example

SELECT Year, Quarter, Month, SUM(Profit) AS Profit

FROM sales GROUP BY ROLLUP(Year, Quarter, Month)

108SPK – IF UTAMA

Year Quarter Month Profit1997 Winter Jan 55,000 1997 Winter Feb 64,000 1997 Winter March 71,000 1997 Winter [NULL] 190,000 1997 Spring April 75,000 1997 Spring May 86,000 1997 Spring June 88,000 1997 Spring [NULL] 249,000 1997 Summer July 91,000 1997 Summer August 87,000 1997 Summer September 101,000 1997 Summer [NULL] 279,000 1997 Fall October 109,000 1997 Fall November 114,000 1997 Fall December 133,000 1997 Fall [NULL] 356,000 1997 [NULL] [NULL] 1,074,000

The query result

Roll up Example

19

109SPK – IF UTAMA

Referensi1. Keith C.C. Chan,2003, Data Warehousing & Data Mining, The

Hong Kong Polytechnic University2. Dr. Mourad YKHLEF,2009,Decision Support System, King Saud

University, 3. -,-, Decision Support Technology,The Heinz School4. Ahmed M. Zeki, 2004, Data Mining & Data Warehousing, INFO

66305. Dan St. Clair, 2002, Lect 1 – Intro. To Data Mining & Data

Warehouses, University of Missouri-Rolla6. S. Sudarshan; Krithi Ramamritham,-, Data Warehouse and Data

Mining, IIT Bombay7. Chris Clifton, 2004, Data Warehousing, Purdue University8. Richard J. Roiger,-,The Data Warehouse,-9. Hugh J. Watson,-, Recent Developments in Data Warehousing,

http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt, TanggalAkses:17-09-2010

10. Mark Isken,-, Data Warehousing and Online Analytical Processing (OLAP),-

11. Ari Cahyono,-,Introduction to Data Warehouse, MagisterTeknologi Informasi UGM

110

Reference Library

111SPK – IF UTAMA

BI ResourcesThe Data Warehousing Institute http://www.tdwi.org/Kimball and Associates http://www.ralphkimball.com./html/articles.htmlA Dimensional Modeling Manifesto – Kimball, R.http://www.dbmsmag.com/9708d15.htmlDSS Resources http://dssresources.com/Data Warehousing Information Center http://www.dwinfocenter.org/Intelligent Enterprise http://www.intelligententerprise.com/DM Review http://dmreview.com/KDNuggets http://www.kdnuggets.com/IT Toolbox http://www.ittoolbox.com/ http://businessintelligence.ittoolbox.com/http://datawarehouse.ittoolbox.com/OLAP Report http://www.olapreport.com/

o Some free stuff (nice history of OLAP and commentary on industry trendso Other stuff costs $

http://www.mosha.com/msolap/o Awesome set of resources from the lead developer on MS SQL Server

Analysis Server team

112SPK – IF UTAMA

Some Good Books and Articles

The Data Warehouse Toolkit – Kimball, R.o Definitive, Microsoft SQL Server 2005 based 3rd edition now out

OLAP Solutions – Thomsen, E.o Definitive, abstract and dense, good

MDX Solutions: With Microsoft SQL Server Analysis Services 2005 and Hyperion Essbase by George Spofford

o MDX is the “SQL” for cubesData Mining with SQL Server 2005 (Paperback) by ZhaoHui Tang (Author), Jamie MacLennanData Warehouse Design Solutions – Adamson and Venerable

o Multi-D DW designs from lots of different industrieso Very practical, uses realistic situations to reinforce the concepts

Summers Rubber Company designs its data warehouseGorla, Narasimhaiah; Krehbiel, SteveInterfaces; Mar/Apr 1999; 29, 2; ABI/INFORM Global

113SPK – IF UTAMA

More AS Tutorials and Resources

http://www.mosha.com/msolap/o This is the granddaddy of MS SQL Server Analysis Services

resoures. Mosha Pasumansky is the MS development lead on AS engine.

o Site is gold mine of information and links regarding AS and related software

o He participates in microsoft.public.sqlserver.olapo Great Blog at http://www.sqljunkies.com/WebLog/mosha/

Introduction to Analysis Services - by William Pearson (series of articles) http://www.databasejournal.com/article.php/1459531/

o Very nice series of MS AS tutorialsBest practices for Business Intelligence using the Microsoft Data Warehousing Framework

o A white paper from Microsoft