data warehouse and data mining - wordpress.com · warehouse data • olap middleware to support...
TRANSCRIPT
![Page 1: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/1.jpg)
Naeem Ahmed
Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Email: [email protected]
Data Warehouse and Data Mining Lecture No. 08-15
OLAP and Multi-Dimensional Data
![Page 2: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/2.jpg)
On-Line Analytical Processing • A decision support system (DSS) that support ad-
hoc querying, i.e. enables managers and analysts to interactively manipulate data.
• Analysis of information in a database for the purpose of making management decision
• The idea is to allow the users to easy and quickly manipulate and visualize the data through multidimensional views (i.e. different perspectives)
• OLAP analyzes historical data (terabytes) using complex queries
![Page 3: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/3.jpg)
On-Line Analytical Processing • OLAP Council definition:
– A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user
• OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.
![Page 4: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/4.jpg)
On-Line Analytical Processing • OLAP primarily involves aggregating large
amounts of diverse data • OLAP functionality provides dynamic multi-
dimensional analysis, supporting analytical and navigational activities
• OLAP functionality is provided by the OLAP Server • OLAP Council defines OLAP Server as:
– ‘A high capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures.’
![Page 5: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/5.jpg)
Data Dimensionality
![Page 6: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/6.jpg)
Data Dimensionality: Cube
Date
Cou
ntry
sum
sum TV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr
Pakistan
China
India
sum
Total annual sales of TV in Pakistan
1st Qtr Sales of TV in Pakistan
Total annual sales of TV, PC & VCR in India
Cube: A group of data cells arranged by the dimensions of the data.
![Page 7: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/7.jpg)
Data Dimensionality Possible Views of Sale • How many Products sold at
Time to specific Customer(s)?
• How many Customers bought at specific Time the Product(s)?
• At which Time(s) the Customer(s) bought the specific Product(s)?
Products
Time
Customers
Sale
![Page 8: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/8.jpg)
Multi-dimensional Data • Measures - numerical data being tracked • Dimensions - business parameters that define a
transaction • Example: Analyst may want to view sales data
(measure) by geography, by time, and by product (dimensions)
• Dimensional modeling is a technique for structuring data around the business concepts
• ER models describe “entities” and “relationships” • Dimensional models describe “measures” and
“dimensions”
![Page 9: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/9.jpg)
Multi-dimensional Model “Sales by product line over the past six months” “Sales by store between 1990 and 1995”
Prod Code Time Code Store Code Sales Qty
Store Info
Product Info
Time Info . . .
Numerical Measures Key columns joining fact table
to dimension tables
Fact table for measures
Dimension tables
![Page 10: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/10.jpg)
Multi-dimensional Model • Every dimensional model (DM) is composed of one
table with a composite primary key, called the fact table, and a set of smaller tables called dimension tables
• Forms ‘star-like’ structure, which is called a star schema or star join
• Dimensions are organized into hierarchies – E.g., Time dimension: days → weeks → quarters – E.g., Product dimension: product → product line → brand
• Dimensions have attributes – e.g., owner city and county of store
![Page 11: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/11.jpg)
Dimension Hierarchies Store Dimension Product Dimension
District
Region
Total
Brand
Manufacturer
Total
Stores Products
![Page 12: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/12.jpg)
Operations in Multidimensional Data Model
• Aggregation (roll-up) – dimension reduction: e.g., total sales by city – summarization over aggregate hierarchy: e.g., total sales by
city and year total sales by region and by year • Selection (slice) defines a sub-cube
– e.g., sales where city = Palo Alto and date = 20/1/2014
• Navigation to detailed data (drill-down) – e.g., (sales - expense) by city, top 3% of cities by average
income • Visualization Operations (e.g., Pivot)
![Page 13: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/13.jpg)
A Visual Operation: Pivot
10
47
30 12
Juice
Cola
Milk Cream
3/1 3/2 3/3 3/4 Date
Reg
ion
Product
A pivot is a two dimensional lay-out of the summary data
The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures
![Page 14: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/14.jpg)
Drill-Down and Roll-Up
![Page 15: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/15.jpg)
Multi-dimensionality: Cube
![Page 16: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/16.jpg)
Multi-dimensionality
![Page 17: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/17.jpg)
On-Line Analytical Processing • OLAP Tools are Market Driven. That is, no
standards either academic or from an organization exist
• A common model approach is to use Star or Snowflake Database Schemata (common in Data Warehouse Modeling)
• End users look for the following, independent tool architecture or vendor, characteristics:
![Page 18: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/18.jpg)
On-Line Analytical Processing • Interactivity – How easy the end user interacts with
the tool? • Customization – How easy the end user make
changes on the data representation provided by the tool?
• Security – How easy the end user can access unauthorized data?
• Visualization – How easy the tool provide multi-dimensional graphical representations?
![Page 19: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/19.jpg)
OLAP Servers • Two possibilities for OLAP servers
– Relational OLAP (ROLAP) • Relational and specialized relational DBMS to store and manage
warehouse data • OLAP middleware to support missing pieces
– Multidimensional OLAP (MOLAP) • Array-based storage structures • Direct access to array data structures • No SQL (Structured Query Language)
– Special Language provided by vender (e.g. Multidimensional Expressions (MDX) of Microsoft)
![Page 20: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/20.jpg)
OLAP Taxonomy • Multi-dimensional OLAP (MOLAP)
– ‘A k-dimensional matrix based on a non relational storage structure.’ Agrawal et al.
• Relational OLAP (ROLAP) – ‘A relational back-end wherein operations of the data are
translated to relational queries.’ Agrawal et al. • Hybrid OLAP (HOLAP)
– Integration of MOLAP and ROLAP • Desktop OLAP (DOLAP)
– Provides a specific cube for analysis. Simplified version of MOLAP or ROLAP
![Page 21: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/21.jpg)
Multi-dimensional OLAP • Multi-dimensional data management in Multi-
Dimensional Database Management Systems (MDDBMS)
• A special-purpose server that directly implements multidimensional data and operations
• Advantages: Fast data access, many dimensions, performance
• Further Research on storage techniques and realization of transactional concepts
![Page 22: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/22.jpg)
MOLAP: Dimensional Modeling Using the Multi Dimensional Model
• MDDB: a special-purpose data model, MOLAP = “Cubes”
• Facts stored in multi-dimensional arrays • The Database system builds most of the
aggregates within a non-relational data store • Dimensions used to index array • Sometimes on top of relational DB • Products: Pilot, Arbor Essbase, Gentia • Limitations: Memory
![Page 23: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/23.jpg)
Relational OLAP • A multi-dimensional user view on relational data
storage using Star or Snowflake Database Schemata
Product Dimension
Time Dimension
Region Dimension
Customer Dimension
Product Dimension
Year Dimension
Country Dimension
Customer Dimension
Sales
Customer Characteristics
Product Kind
Region
Month
Snowflake Schema
Sales
Star Schema
![Page 24: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/24.jpg)
Relational OLAP • An extended relational DBMS that maps operations on
multidimensional data to standard relational operators (i.e., iterators like joins, loops, nested joins etc)
• Fact tables are too big to query directly, It incorporates Aggregate tables – Aggregate tables are built by running summarizing queries joining
the fact table with one or more dimensions and saving the result set – Users don’t need to specify the aggregate table, vendors provide
automatic support of aggregate tables in data warehouse
• Advantages: Easy to understand, easy to model, easy to implement
![Page 25: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/25.jpg)
ROLAP: Dimensional Modeling Using Relational DBMS
• Special schema design: star, snowflake • Special indexes: bitmap, multi-table join • Special tuning: maximize query throughput • Proven technology (relational model, DBMS), tend
to outperform specialized MDDB especially on large data sets
• Products – IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
• Limitations: Maintenance, Performance
![Page 26: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/26.jpg)
Hybrid OLAP • ‘A system, which supports (and integrates) multi-
dimensional and relational storage for data in an equivalent manner in order to benefit from the corresponding characteristics and optimization techniques.’ Dinter et al.
• Advantages: use of best techniques introduced on MOLAP and ROLAP, transparency between MOLAP and ROLAP systems
• Further Research on storage systems, on global multi-dimensional schema, on common interface and mutual integration of MOLAP and ROLAP
![Page 27: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/27.jpg)
Desktop OLAP (DOLAP) • All processing work is done in the desktop
– E.g, bring data into Excel and build a pivot table
• DOLAP can be inexpensive, easy and fast to setup on small data sets only (thousands of rows)
![Page 28: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/28.jpg)
OLTP versus OLAP OLTP OLAP
Operational processing Informational processing Transaction-oriented Analysis-oriented For operational staffs For managers, executive & analysts Daily operations Decision support Current, up-to-date data Historical data Primitive, highly detailed data Summarized, consolidated data Detailed, flat relational views Summarized, multi-dimensional views Short, simple transactions Complex aggregate queries Read/write Mostly read only Index on keys Many scans Many users Small number of users Large databases Very large databases
![Page 29: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/29.jpg)
OLTP versus OLAP • On-Line Transaction Processing
– Transfer $100 balance from my saving account to my checking account
• On-Line Analytical Processing
– What is the average balance of accounts by customer groups, account types, areas, account managers, and their combinations?
![Page 30: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/30.jpg)
Aggregate • A whole formed or calculated by the combination
of many separate units or items – Total • Operators: sum, count, max, min, median, avg
– Example: Add up amounts by day – Example in SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
![Page 31: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/31.jpg)
Aggregate • Add up amounts by day, product • SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId!
sale prodId date amtp1 1 62p2 1 19p1 2 48
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Roll-up
Drill-down
![Page 32: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/32.jpg)
MOLAP Cube
sale prodId storeId amtp1 s1 12p2 s1 11p1 s3 50p2 s2 8
s1 s2 s3p1 12 50p2 11 8
Fact table view Multi-dimensional cube
dimensions = 2
dimensions = 3
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
![Page 33: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/33.jpg)
Example: Cube Pr
oduc
t
Time
M T W Th F S S
Juice Milk Coke Cream Soap Bread
NY SF
LA 10 34 56 32 12 56
56 units of bread sold in LA on M
Dimensions: Time, Product, Store
Attributes: Product (upc, price, …) Store … …
Hierarchies: Product → Brand → … Day → Week → Quarter Store → Region →
Country
roll-up to week
roll-up to brand
roll-up to region
![Page 34: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/34.jpg)
Cube Aggregation
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 56 4 50p2 11 8
s1 s2 s3sum 67 12 50
sump1 110p2 19
129
. . .
Example: computing sums
Roll-up
Drill-down
![Page 35: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/35.jpg)
Aggregation Using Hierarchies
region A region Bp1 56 54p2 11 8
store
region
country
(store s1 in Region A; stores s2, s3 in Region B)
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
![Page 36: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/36.jpg)
Slicing • Slicing means taking out the slice of a cube, given
certain set of select dimension – e.g., sales where city =‘Karachi’ and date = ‘20/1/2014’
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 12 50p2 11 8
TIME = day 1
![Page 37: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/37.jpg)
Dicing • Dicing means viewing the slices from different
angles. – Example -Revenue for different products within a given
state or revenue for different states for a given product • Dicing is more zoom feature that selects a subset
over all the dimensions but for specific values of the dimension
• One form of Slicing and Dicing is called pivoting
![Page 38: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/38.jpg)
Dicing
![Page 39: Data Warehouse and Data Mining - WordPress.com · warehouse data • OLAP middleware to support missing pieces – Multidimensional OLAP (MOLAP) • Array-based storage structures](https://reader033.vdocuments.site/reader033/viewer/2022050418/5f8dc6eb1915904661554fa7/html5/thumbnails/39.jpg)