![Page 1: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/1.jpg)
Data Mining and Data Warehousing
Henryk Maciejewski
Data Warehousing and OLAP
![Page 2: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/2.jpg)
Part II Data Warehousing –Contents
• OLAP Approach to Data Analysis• Database for OLAP = Data Warehouse
– Logical model – Physical models (ROLAP, MOLAP, HOLAP)
• Querying multidimensional data• DW project methodologies
![Page 3: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/3.jpg)
Further Reading
• J. Han, M. Kamber, Data Mining: Concepts and Techniques, Second Edition, Elsevier 2006.
• W. Inmon: Building the data warehouse, Wiley 2005.• F. Silvers: Building and maintaining a data
warehouse, CRC Press 2008.• www.information-management.com
![Page 4: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/4.jpg)
From DBMS to Analytical Systems...
• The 1960s: first IT systems• The 1970s:
• DBMS systems• On-line transactional processing systems (OLTP)
• The 1990s:• On-line analytical processing (OLAP), data warehousing,
data mining – Business Intelligence (BI), DSS
![Page 5: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/5.jpg)
IT Systems Generate Data Deluge• IT Systems in:
• Retail trade – bar codes, credit cards, …• Banking, insurance, telecoms, healthcare, etc. etc.• Science (biology , weather/Earth monitoring, sky surveys,...)
• Data Deluge• WalMart: 20 million transactions per day• Mobil: ca. 100 TB of data (exploration of oil reserves)• Human Genome Project: ~GB of data• NASA Earth Observing System: 50 GB per hour (!)• DISS solar energy plant monitoring: ~ 800 numbers / 5 secs
![Page 6: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/6.jpg)
How to Get Information out of Data
• Efficient technologies available to gather and store data • Simple approaches to data analysis prove inefficient
• Spreadsheet based, SQL query based, ...
• Technologies + tools needed for efficient data analysis / knowledge extraction from data• Hence OLAP, KDD (Knowledge Discovery in Databases), DM emerged
• Information – data in context; data that have meaning, relevance and purpose
![Page 7: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/7.jpg)
Various Approaches to Data Analysis
SQL
Data Warehouse / OLAP
Data Mining
SQL queries to „raw” data
Multidimensional data model: y(w1,w2,...wn)
Database for OLAPIntegrated data(ETL – Extract-Transform-Load)
Discovering relationships in data
E.g., Customer profiles ,…Models to assess credit risk, etc.
![Page 8: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/8.jpg)
Data Analysis Techniques – SQL Queries
Data source
Data source
Data source
Programmer – DB admin generates an SQL program
SQL
„Cross-sectional” question
Drawbacks:Considerable coding effortHeavy load on OLTP serversMultiple versions of the truth…
SQL
SQL
Report
![Page 9: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/9.jpg)
Data Warehouse (W. Inmon 1992)
ETL: Data access Data integration (cleaning, transformation)
Source data
Source data
Source data
Data WarehouseData Mart
Specific structure of database optimized for OLAP (MDDB, „snowflake”, „star schema”, ROLAP, MOLAP, HOLAP)
OLAP / DSS
![Page 10: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/10.jpg)
Why OLAP Technology is BecomingIndispensable
• Getting information of out historical data• Integration of data sources in the enterprise• „Cross-sectional” analyses of enterprise data
→ discovering relationships / patterns in large amounts of data → trend analysis→ data mining
![Page 11: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/11.jpg)
OLAP/Data Warehouse – Key Design Issues
• Data organization• Multidimensional data model (facts seen as a function of dimensions)• Physical data storage that allows for fast (online) analysis of vast data
volumes
• Data integration• Ensure high quality of analytical data• „Taming the data chaos”• Single version of the truth
![Page 12: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/12.jpg)
OLAP vs. OLTP – Different Applications and Data Model
• OLTP – operational data– automation of day-to-day operations of organization:
→ phone-call billing, orders / invoices processing, banking / credit card transactions, etc., etc.
• OLAP– analytical data– getting information for decision support
→ Who are our best customers (characteristics)?→ Churn analysis→ How does increase in sales correlate with quality of service?
![Page 13: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/13.jpg)
OLAP vs. OLTP – SummaryProblem OLTP OLAP
Main applications Automation of operations of organization:- entering data on routine day-to-day
transactions- fixed structure reports / summaries
created on regular basis (daily, monthly, etc.)
Decision support - multidimensional statistical
analyses, forecasting, ad hoc queries,
- advanced reporting
Time horizon for data retention
Usually short term (90 days, 1 year) Long term data retention, to support historic data analyses, comparative reports, trend analysis over time
Data updates ‘On the fly’, during individual transaction Static data, updated on regular basis (e.g., monthly), data collected over time (time-stamped)
Data access Frequent access to small portions of data (a few or tens of records)
Simple, well structured queries
Rare access involving large amounts of data
Complex queries, ad-hoc
![Page 14: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/14.jpg)
Schedule
• OLAP Approach to data analysis– OLAP vs OLTP– OLAP – data integration
• Database for OLAP = Data Warehouse– Logical data model – multidimensionality– Physical data models (ROLAP, MOLAP, HOLAP)
![Page 15: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/15.jpg)
„Data chaos” – Why it is Hard to Run Analytics Based on OLTP
• Main obstacles for building successful OLAP ‘on top’ of transactional data:– Data awareness– Data understanding– Data variability– Data redundancy (and hence consistency)
• „Data islands” in disparate transactional systems
![Page 16: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/16.jpg)
Data Chaos – ExampleFaculty of EE
Courses DB
Teachers DB
Notes DB
Recruitment DB
Faculty of Architecture
Tutors DB
Exam results DB
Courses DB
Problems / difficulties:
→ how to find data
→ how to extract data
→ understand the meaning
→ clean the data
Data warehouse
![Page 17: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/17.jpg)
17
Business Intelligence Based on OLTP?
• How to get to the data in the DB?
• How to locate the right table / column ?
• How to understand the meaning of the data ?
• How to clean the data ?
![Page 18: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/18.jpg)
Dedicated System for BI (OLAP)• ETL (Extract Transform
Load)– Connect to source DB– Integrate / clean– Transform to the
multidimensional model
• Multidimensional model of data (facts vs. dimensions)
![Page 19: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/19.jpg)
Example: Multidimensional Model
Cubes:Over-hoursAvailabilityFuel consumption
![Page 20: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/20.jpg)
Example: ETL Process ETL for the cube Availability
![Page 21: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/21.jpg)
Data Warehouse – DefinitionDate Warehouse – subject-oriented, integrated, time-varying, non-volatile
collection of data that is used primarily in organizational decision making.
Subject oriented – data is organized around subjects of interest to data analyst (e.g., customer, product, supplier); transactional systems are process-oriented(e.g., order processing).
Integrated – data warehouse integrates data from several data sources; data characteristics (attributes) must be coded in a consistent way (e.g., consistent coding of SEX (‘male’-’female’, ‘m’-’f’, 0-1)).
Non-volatile – data loaded into data warehouse is a ‘snapshot’ of operational data at a specific point in time; once loaded, data in warehouse cannot be changed.
Time-varying – data elements in warehouse are time-stamped to facilitate analysis of changes / trends over time.
![Page 22: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/22.jpg)
Summary of This Part
• Concept of OLTP and OLAP– Different use, different requirements for
• Data organization (data model) • Database design
• Need for data integration – Overcoming „data chaos”– Ensuring high quality of analytical data in warehouse
![Page 23: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/23.jpg)
23
Example: OLAP for Student Notes
![Page 24: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/24.jpg)
Example: OLAP for Student Notes
![Page 25: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/25.jpg)
Example: OLAP for Student Notes
![Page 26: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/26.jpg)
Example: IBM Tivoli Monitoring Data Warehouse
• Monitoring agents –keep 24 h detaileddata
• Data Warehouse –aggregated, time-stamped data drawnfrom agents
![Page 27: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/27.jpg)
Agent Default attribute group
Monitoring Agent for Windows OS
Network_InterfaceNT_ProcessorNT_Logical_DiskNT_MemoryNT_Physical_DiskNT_ServerNT_System
Monitoring Agent for UNIX
Disk System
Monitoring Agent for Linux
Linux_CPULinux_CPU_AveragesLinux_CPU_ConfigLinux_DiskLinux_Disk_IOLinux_Disk_Usage_TrendsLinux_IO_ExtLinux_NetworkLinux_NFS_Statistics
Monitoring Agent for DB2
KUDDBASEGROUP00 KUDDBASEGROUP01 KUDBUFFERPOOL00 KUDINFO00 KUDTABSPACE
Example: IBM Tivoli Monitoring Data Warehouse
![Page 28: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/28.jpg)
Schedule
• Multidimensional Model of OLAP Data• Why OLAP Doesn’t Like Normalized DB
• Relational OLAP (ROLAP)• Multidimensional OLAP (MOLAP)• Hybrid OLAP (HOLAP)
![Page 29: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/29.jpg)
OLAP: Multidimensional Model of Data
• OLAP = multidimensional analysis of data• Multidimensional model of data:
– Measure as a value in multidimensional space of dimensions– Numeric measures – objects of analysis, also referred to as facts– Dimensions – variables on which the measure depends / that uniquely
determine the measure• E.g., measure: sales [$]
dimensions: product, shop, date
![Page 30: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/30.jpg)
OLAP: Multidimensional Model of Data
• Dimension hierarchies, e.g.,– Geographical hierarchy: shop – city – region – country – Time hierarchy: day of week – week – month – year – Product hierarchy: item – type – group
![Page 31: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/31.jpg)
Example – Model Built in Lab• Multidimensional model for analysis of students’ notes:
– Measure: Student’s grade (note)– Dimensions:
• Characteristics of students• Characteristics of teachers• Characteristics of courses (group of courses, type of courses, etc.)• Time hierarchy: calendar semester – year• Workload of students / teachers, etc.
– Various statistics will be of interest, e.g., average grade, number of grades, std deviation, distribution,...
![Page 32: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/32.jpg)
Useful Concepts
– Aggregation: e.g., computing total sales by year based on more detailed data
– Drill-down: create more detailed view (i.e., decrease level of aggregation)
– Rollup: increase level of aggregation– Slice-and-dice: reduce dimensionality of data: fix values of
some dimensions and observe how data depends on the remaining dimensions
![Page 33: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/33.jpg)
Schedule
• Multidimensional Model of OLAP Data• Why OLAP Doesn’t Like Normalized DB
• Relational OLAP (ROLAP)• Multidimensional OLAP (MOLAP)• Hybrid OLAP (HOLAP)
![Page 34: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/34.jpg)
Normalized DB (a Reminder)
• Database design for OLTP uses Entity Relationship diagrams and normalization techniques
• Normalized DB:– No data redundancy– Many tables with many-to-one relationships– Optimized for easy / fast updates of data– Efficient for constantly changing data– Efficient for OLTP
![Page 35: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/35.jpg)
Normalized DB - ExampleContact IDCustomer IDContact nameContact typeShipment ID
StatusOrder IDOrder item IDCustomer ID
Customer IDCustomer nameAddressCity...
Shipment
Customer
Contact
Order IDCustomer IDOrder dateSales rep ID
Sales rep IDSales rep nameDistrict ID
District IDDistrict namemanager
Order IDOrder item IDProduct IDQuantity
Product IDProduct nameProduct type...
Order
Sales rep
District
Order item
Product
Task – answer the following OLAP query:
Which products were sold to a particular group of customers within specified time frame?
![Page 36: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/36.jpg)
Normalized DB – Problems with OLAP Queries
• Many ‘join’ operations on tables low efficiency of SQL queries
• ‘Circular join paths’ – a query can be answered in two different ways different results possible
• Complicated database scheme SQL code difficult to build / maintain
![Page 37: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/37.jpg)
OLAP: Requirements for Database Design
• Simplicity of database scheme• Efficiency of multidimensional queries• Consistency and accuracy of data
• Database schemes to meet these requirements– Relational OLAP (ROLAP)– Multidimensional OLAP (MOLAP)– Hybrid OLAP (HOLAP)
![Page 38: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/38.jpg)
Schedule
• Multidimensional Model of OLAP Data• Why OLAP Doesn’t Like Normalized DB
• Relational OLAP (ROLAP)• Multidimensional OLAP (MOLAP)• Hybrid OLAP (HOLAP)
![Page 39: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/39.jpg)
Relational OLAP• Warehouse data stored using a relational database server• Multidimensional data model represented by a star-schema
database or snowflake-schema database
• Star schema:– Single fact table– Single table for each dimension– A fact table entry consist of:
• Aggregate value of the measure • Foreign keys to dimension tables (composite key of the fact table)
![Page 40: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/40.jpg)
Relational OLAP• Warehouse data stored using a relational database server• Multidimensional data model represented by a star-schema
database or snowflake-schema database
• Snowflake schema:– Variant of star schema with (some) dimension tables normalized (for
easier maintenance of dimension data)
![Page 41: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/41.jpg)
Example – Star Schema
Sales person IDProduct IDDate IDCustomer ID
Number soldamount
Sales(fact table)
Product ID
Prod codeProd nameProd typeProd category
Customer ID
NameSexAgeJob name
Sales person ID
NameRegion DivisionOffice
Date ID
DateYearMonthDay
Sales person
Date
Product
Customer
![Page 42: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/42.jpg)
Example – Snowflake Schema
Sales person IDProduct IDDate IDCustomer ID
Number soldamount
Sales(fact table)
Product ID
Prod codeProd nameProd typeProd category
Customer ID
NameSexAgeJob ID
Sales person ID
NameRegion DivisionOffice
Date ID
DateYearMonthDay
Sales person
Date
Product
Customer
Job ID
Job nameJob category
…
Job Code
![Page 43: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/43.jpg)
ROLAP – Example of OLAP Query• OLAP query:How many products were sold to a specific group of customers in a giventime frame?
Translates into the following SQL query:select sum(number_sold) as number_soldfrom fact_sales a,
dimension_date b,dimension_customer c
where b.date = ’21jan2001’dand c.sex = ‘F’and a.dateID = b.dateIDand a.customerID = c.customerID;
![Page 44: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/44.jpg)
Schedule
• Multidimensional Model of OLAP Data• Why OLAP Doesn’t Like Normalized DB
• Relational OLAP (ROLAP)• Multidimensional OLAP (MOLAP)• Hybrid OLAP (HOLAP)
![Page 45: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/45.jpg)
Multidimensional OLAP
• Warehouse data stored in a multidimensional database (MDDB)
• MDDB – Specialized storage facility that directly reflects multidimensional
model of data – MDDB can be viewed as an N-dimensional (hyper)cube in which values
of numerical measure (object of analysis) are stored– Data stored in MDDB is presummarized, i.e., values stored in cross
sections of dimensions have been aggregated at the MDDB build time (thus performance of multidimensional (OLAP) queries is high)
![Page 46: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/46.jpg)
MDDB – Idea
• Sample base table:– Analysis variable (fact):
note
– Classification variables (dimensions): attributes of students, attributes of teachers, semester, year, faculty, etc.
![Page 47: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/47.jpg)
MDDB – Ideaselect sum(note) as SUM, count(*) as N, spec, semester, yearfrom base_tablewhere spec='INF‘ and semester=8 and year=2001group by spec, semester, year
![Page 48: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/48.jpg)
MDDB – Data Aggregation• Each crossing of the cube
contains specified statistics for the analysis variable(s)
• Distributive measures can be stored in cube, such as N, SUM, SUMWGT, UWSUM, NMISS, USS, MIN, MAX
• Algebraic measures can be computed from stored measures, such as AVG=SUM/N
![Page 49: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/49.jpg)
MDDB – Data Aggregation• Problem with holistic measures,
ie. measures for which no algebraic aggregate function exists. E.g., MEDIAN
• In large cube applications approximate values of holistic measures are computed using algebraic measures
![Page 50: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/50.jpg)
Cubes and Subcubes
• OLAP queries related to a subset of dimensions– Result is aggregated at query time from the NWAY cube – E.g., report on sales of all products over subsequent years – sum for all
products and all months needs to be computed at run time– If there are many dimensions with high cardinality, this can be lengthy
• Subcubes are used to speed up performance for queries (related to subsets of dimensions) that users are likely to ask most frequently
![Page 51: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/51.jpg)
Which Subcubes to Store?Idea: find categories which will be usedmost frequently, with smallest cardinality
Starnet (spiral) model: put categories inascending order of cardinality
Draw spiral starting with YEAR (most frequent use anticipated, lowestcardinality) ⇒ lists of categories = subcubes
YEAR SECTOR REGION GRP_SUPP MONTH GRP SHOP SUPPLIER FAMILY DAY ARTICLE
YEAR SECTOR REGION GRP_SUPP MONTH GRP SHOP SUPPLIER FAMILY DAY
...
YEAR SECTOR
YEAR
![Page 52: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/52.jpg)
Example: Building MDDB (SAS)proc mddb data=grades out=grades_mddb
label='MDDB for analysis of grade data';class year sem sex faculty institute exam type id_title;var note /n sum min max;
hierarchy year sem /name=„Time Hierarchy";hierarchy faculty institute /name=„Affiliation Hierarchy";
run;
NOTE: SAS/MDDB(R) Server Software has been initialized.NOTE: N-way complete cells=1455.NOTE: „Time Hierarchy" computed from "NWAY" cells=10.NOTE: „Affiliation Hierarchy" computed from "NWAY" cells=26.NOTE: PROCEDURE MDDB used:
real time 1:26.54cpu time 1:19.82
![Page 53: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/53.jpg)
Example: Building MDDB (SAS)• DATA – specify base table for the MDDB• CLASS statement – specify classification variables (i.e., NWAY
cube dimensions)• VAR statement – specify analysis variables (with statistics to
be stored in MDDB – distributive aggregate functions)• HIERARCHY statements – specify subcubes to include in
MDDB
• Subcubes can be added / removed (ADDHIER, REMOVEHIERstatements)
![Page 54: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/54.jpg)
ROLAP vs. MOLAPMOLAP ROLAP
Very high query performance Very scalable
Easy maintenance Lower query performance
Less scalable (fixed max size of a cube) Design and maintenance more difficult
Problem with dimensions with very high cardinality
Problem with constantly growing database
„Rule of thumb”: use MOLAP as long as possible, then switch to ... HOLAP
![Page 55: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/55.jpg)
Schedule
• Multidimensional Model of OLAP Data• Why OLAP Doesn’t Like Normalized DB
• Relational OLAP (ROLAP)• Multidimensional OLAP (MOLAP)• Hybrid OLAP (HOLAP)
![Page 56: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/56.jpg)
HOLAP Data Model
Multidimensional data provider (MDP)
MDDB Relational DB Star schema
cache
viewer
Viewer (OLAP applications)sees a logical MDDB(or a proxy or virtual MDDB) whichis presented by the MDP
![Page 57: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/57.jpg)
HOLAP Techniques• „Racking” – individual MDDBs for
different values of one dimension (e.g., separate MDDBs for subsequent years)
• „Stacking” – different subcubes stored in separate MDDBs or tables (e.g., YEAR*COUNTRY*PRODUCT – local MDDB, YEAR*COUNTRY*PRODUCT*MONTH – on remoteserver)
year=2003 2004 2005 2006
Multidimensional data provider (MDP)
![Page 58: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/58.jpg)
When to Use HOLAP?
• Too much data for one MDDB• Access to existing ROLAP solutions• Ensuring scalability with growing data volume• Flexible integration of distributed data sources• Improved performance – distributed processing of queries
• Price: HOLAP metadata must be maintained
![Page 59: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/59.jpg)
DW Architectures – MOLAP
RDB
ERP
Flat files
OLTP Data Sources
DW(ODS)
ETL
Data Layer OLAP ApplicationLayer
PresentationLayer
RDBMS ServerMDDBS ServerMOLAP Engine
MDDBs MDXXML/A
Create/storecubes
![Page 60: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/60.jpg)
DW Architectures – ROLAP
RDB
ERP
Flat files
OLTP Data Sources
DW(ODS)
ETL
Data Layer OLAP ApplicationLayer
PresentationLayer
RDBMS ServerAnalyticalServer
MDXXML/A
ComplexSQL queries
![Page 61: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/61.jpg)
MS SQL Storage Settings• Proactive caching
– MOLAP – best performance; possible data latency (recent data changes not seen)– ROLAP – recent changes in data seen immediately; price – poor performance– Proactive caching: build MOLAP cache to boost performance
• ? How frequently MOLAP cube should be rebuilt• ? Should outdated MOLAP be queried while cube is rebuilt• ? Rebuild cubes on schedule or based on changes in data
• Minimize latency vs maximize performance • Partitions
– Vertical: cubes based on subsets of rows in fact table– Horizontal: cubes based on separate fact tables (e.g. for subsequent years)
![Page 62: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/62.jpg)
MS SQL Server Analysis Services Storage Settings
![Page 63: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/63.jpg)
Standarizing Access to OLAP Data Sources – XML/A
• XML for Analysis (XML/A)• Standard API between OLAP client and OLAP data provider• Design goals:
– Open standards based, not bound to any language or technology– Optimized for the Web: minimize round-trip transactions and stateless
• Client – server communicate using XML, HTTP, SOAP
![Page 64: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/64.jpg)
Standarizing Access to OLAP Data Sources – XML/A
• XML/A Methods:– Discover – retrieve information (metadata) from provider, such as list of available
cubes and their properties– Execute – request a command execution by server (MDX language command – e.g.,
OLAP MDX SELECT)
![Page 65: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/65.jpg)
Multidimensional Expressions Language (MDX)
• Introduced by Microsoft in OLE DB for OLAP• Now considered de facto standard for querying multidimensional data in OLAP
cubes
• Simple form of MDX query expression:
SELECT axis_specs ON COLUMNS,axis_specs ON ROWS
FROM cubeWHERE slicer_specs
![Page 66: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/66.jpg)
MDX – By Examples• Examples based on cube built in
lab
• A tuple– uniquelly identifies a cell in a cube– defined by a combination of attribute
members for different attributes
– if some attribute is not specified – its All (default) member is used
– if measure is not specified, the first (default) measure defined in the cube is used
![Page 67: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/67.jpg)
MDX – Tuples
• [Measures].[Note Count] is a tuple
• To identify a cell, the All member of other attributes was used
![Page 68: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/68.jpg)
MDX – Tuples
• Tuple points to male (M) students in Student Group (Studiengang) A
• Use ( ) to identify a tuple
![Page 69: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/69.jpg)
MDX – Sets of Tuples
• Two tuples (Note Avg and Note Count) form a set
• Use { } to identify a set of tuples
![Page 70: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/70.jpg)
MDX – Cartesian Products
• .Members MDX function lists members of an attribute
• on columns – axis 0on rows – axis 1(up to 128 axes)
More axes Cartesian product
![Page 71: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/71.jpg)
MDX – Cartesian Products
• Now set of tuples is used in Axis 0 (columns) specification
• Each cell is produced as an intesection of its attribute members
![Page 72: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/72.jpg)
MDX – Slicer Axis (WHERE)
• WHERE clause – used to specify set, tuple or member that restrict the members returned for rows and columns
![Page 73: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/73.jpg)
MDX – Slicer Axis (WHERE)
• WHERE clause – used to specify set, tuple or member that restrict the members returned for rows and columns
![Page 74: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/74.jpg)
MDX – Slicer Axis (WHERE)
• WHERE clause – used to specify set, tuple or member that restrict the members returned for rows and columns
![Page 75: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/75.jpg)
Data Warehouse Project Methodology(-ies)
• SAS Rapid Data Warehouse Methodology• IBM DW / BI Project Methodology• …
• Purpose: – Ensure disciplined, iterative, approach in the management and
implementation of data warehousing projects– Enable successful business and technical implementation of the data
warehouse
![Page 76: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/76.jpg)
DW Project Methodology - Phases• Assessment
– Determine whether there exists a realistic need and opprotunity to develop a successful DW– Project definition stage (team, sponsor, criteria for success, expectations)– Initial assessment of IT infrastructure (is project feasibile?)– Outcome: formal document
• Requirements– Requirements gathering (in-depth interviews with business people) – Reconciliation stage (analyze gap between expectations and IT capabilities)– Outcome: Requirements Definition Document (logical and physical data model; data extraction paths from
source OLTP systems; transformations required; DW update schedule)
• Desing / Implementation / deployment– Implement logical data model– Build ETL processes (validate, clean, integrate)– Load data to DW– Design, implement data analysis interfaces
• Train users• Review
![Page 77: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/77.jpg)
DW Specific Requirements -Remarks
• Analytical needs in company– Types of reports, time schedules (daily / weekly etc.) – Hierarchies of data / hierachies of reports– Identification of data sources
• Updates of data in DW– Data integration rules; handling missing / wrong data– Time schedule for DW updates
• Data latency / performance– Recent changes in OLTP seen immediately in OLAP? – What latency is acceptable?– OLAP query performance
![Page 78: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/78.jpg)
Data Integration • Analyze source OLTP systems
– Determine DBMS systems / data formats– Select most appropriate sources / columns (cleanest)
• Analyze required integration– Ensure the same coding conventions (‘m-w’, ‘male-female, ‘0-1’)– Identify synonyms, homonyms, analogies– Ensure data quality (integrity, accuracy, completeness)
• data value integrity• data structure integrity
– Define exception handling rules / missing data handling / default values– Finally, define data integration rule/algorithm for each variable
![Page 79: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/79.jpg)
Example – Synonyms, Homonyms, Analogies
• Define how to resolve name conficts between data sources / columns:– Homonyms: same name but different meaning, e.g., Type in one source
reffers to model of a car („AURIS”, „CLIO”, etc.), and in another source – to category („picup”, „truck”, „passenger”, etc. )
– Synonyms: different names but the same meaning, e.g., PersonID in one source, EmployeeCode in another
– Analogies: attributes describe the same object, but differently, e.g., PaymentMethod in one source refers to „cash”, „check”, „credit card”, and in another to „VISA”, „MasterCard”, „USD” etc.
![Page 80: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/80.jpg)
Example – Data Integrity Specify legal relationships between data values
Number of values in a relationship
Student can have 0,1 or n diplomas‘Undergraduate’ 0‘Graduate’ 1 or n
Employee Temporary PermanentName + +Date of birth + +Contract final date + --Anniversary date o +(+ required; -- not allowed; o optional)
![Page 81: Data Mining and Data Warehousing Henryk Maciejewski Data](https://reader030.vdocuments.site/reader030/viewer/2022012011/613d17ed736caf36b7593c36/html5/thumbnails/81.jpg)
Summary• Build dedicated database for OLAP – data mart / warehouse
– Data integration– Data quality assurance
• Database organization– Multidimensional model of data– Physical data organization
• Denormalization• Aggregation
• Benefits from user’s perspective– Integrated overall picture of the enterprise– Easy access to historical data– Trustworthy information returned (single version of the truth)– DSS queries with no impact on transactional systems
• DW Methodology to ensure successful implementation