sunita sarawagi iit bombay sunita analyzing large multidimensional databases
TRANSCRIPT
![Page 1: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/1.jpg)
Sunita SarawagiIIT Bombay
http://www.it.iitb.ac.in/~sunita
Analyzing large multidimensional databases
![Page 2: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/2.jpg)
Data Analysis in decision support systems
Data analysis understanding the effect of various factors on a target variable Factors: region, time, sales channel Target: profit, sales volume
Tools for data analysis: SQL queries/reports: slow, manual, painful Multidimensional tools: OLAP, very popular Statistical packages: sophisticated Data mining: automated, lot of interest and hype
but several hurdles to meaningful adoption
![Page 3: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/3.jpg)
OLAP (On line Analytical Processing)
Data viewed as a multidimensional cube where factors are dimensions with hierarchies targets are values within cells
Analysis happens through fast interactive browsing of aggregates
Market share in 2002: US $3.5 billion Vendors: Microsoft, Hyperion, Cognos,
Business Objects about 30 such vendors
![Page 4: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/4.jpg)
Multidimensional Data analysis
Sales volume as a function of product, month, and region
Pro
duct
Regio
n
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
![Page 5: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/5.jpg)
Typical OLAP Operations Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary
or detailed data, or introducing new dimensions Slice and dice:
project and select Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
![Page 6: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/6.jpg)
Limitation of OLAP-based analysis
OLAP products provide a minimal set of tools for analysis: simple aggregates selects/drill-downs/roll-ups on the
multidimensional structure
Heavy reliance on manual operations for analysis tedious on large data with multiple dimensions and
levels of hierarchy
![Page 7: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/7.jpg)
I3: Intelligent, Interactive Investigation of multidimensional data
lightweight automation of tedious multi-step tasks
Three examples: Diff for specific why questions at aggregate level
most compactly represent the answer that user can quickly assimilate
Generalize from detailed data to more general cases: expand scope of problem case as far out as possible
Inform of interesting regions in data
![Page 8: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/8.jpg)
The Diff operator
![Page 9: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/9.jpg)
Unravel aggregate data
Total sales dropped 30%in Europe. Why?
What is the most compact answer that user can quickly assimilate?
![Page 10: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/10.jpg)
Solution
A new DIFF-operator added to OLAP systems that provides the answer in a single-step is easy-to-assimilate and compact --- configurable by user.
Obviates use of the lengthy and manual search for reasons in large multidimensional data.
![Page 11: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/11.jpg)
Example query
Plat_User (All)Plat_Type (All)Platform (All)Prod_Group (All)Prod_Category (All)Product (All)
Sum of Revenue YearGeography 1990 1991 1992 1993 1994Asia/Pacific 1440 1947 3454 5576 6310Rest of World 2170 2154 4577 5204 5510United States 6545 7524 10947 13545 15817Western Europe 4552 6061 10053 12578 13501
Plat_User (All)Plat_Type (All)Platform (All)Prod_Group (All)Prod_Category (All)Product (All)
Sum of Revenue YearGeography 1990 1991 1992 1993 1994Asia/Pacific 1440 1947 3454 5576 6310Rest of World 2170 2154 4577 5204 5510United States 6545 7524 10947 13545 15817Western Europe 4552 6061 10053 12578 13501
![Page 12: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/12.jpg)
Compact answerPRODUCT PLAT_USERPLAT_TYPEPLATFORM 1990 1991 RATIO ERROR(All)- (All)- (All) (All) 1620 1820 1.1 34Operating SystemsMulti (All)- (All) 254 198 0.8 23Operating SystemsMulti Other M. Multiuser Mainframe IBM98 2 0.0 0Operating SystemsSingle Wn16 (All) 94 11 0.1 0*Middleware & Oth.UtilitiesMulti Other M. Multiuser Mainframe IBM101 10 0.1 0EDA Multi Unix M. (All) 0.4 76 211.7 0EDA Single Unix S. (All) 0.1 13 210.8 0
![Page 13: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/13.jpg)
Example: explaining increasesPlat_Type (All)Geography (All)Prod_Group Soln
Sum of Revenue YearProd_Category 1990 1991 1992 1993 1994Cross Ind. Apps 1975 2484 4564 7407 8150Home software 294 575Other Apps 843 1172 3436Vertical Apps 898 1461 2827 7947 8663
![Page 14: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/14.jpg)
Compact answerPRODUCT GEOGRAPHYPLAT_TYPEPLATFORM 1992 1993 RATIO ERROR(All)- (All)- (All)- (All) 2113 2763 1.3 200Manufacturing - Process(All) (All) (All) 26 702 27.1 250Other Vertical Apps(All)- (All)- (All) 20 1858 91.4 251Other Vertical AppsUnited StatesUnix S. (All) 8 77 9.6 0Other Vertical AppsWestern EuropeUnix S. (All) 7 96 13.2 0Manufacturing - Discrete(All) (All) (All) 1135 0Health Care (All)- (All)- (All)- 7 820 118.2 98Banking/FinanceUnited StatesOther M. (All) 341 239 0.7 60Mechanical CADUnited States(All) (All) 328 243 0.7 34
![Page 15: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/15.jpg)
Model for summarization
The two aggregated values correspond to two subcubes in detailed data.
Products (All)Geography (All)
Year 1990 1991 19922000 1800
Year 1990Products
Geography OS DBMS Prog
Asia 100 80 80
USA 100 200 400
UK 140 100 56
Year 1991Products
Geography OS DBMS Prog
Asia 80 90 70
USA 120 240 480
UK 140 60 56
Cube-A Cube-B
![Page 16: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/16.jpg)
Detailed answersPRODUCT GEOGRAPHY PLATFORM Y1992 Y1993 RATIOOther Vertical AppsWestern Europe Multiuser Minicomputer OpenVMS 99.9 Other Vertical AppsAsia/Pacific Single-user MAC OS 92.5 Other Vertical AppsRest of World Multiuser Mainframe IBM 88.1 Other Vertical AppsWestern Europe Single-user UNIX 7.3 96.3 13.2Other Vertical AppsUnited States Multiuser Minicomputer Other 97.2 Other Vertical AppsUnited States Multiuser Minicomputer OS/400 99.5 Other Vertical AppsAsia/Pacific Multiuser Minicomputer OS/400 99.6 EDA Western Europe Multiuser UNIX 192.6 277.8 1.4Manufacturing - DiscreteUnited States Multiuser Mainframe IBM 88.4
Explain only 15% of total difference as against 90% with compact
![Page 17: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/17.jpg)
Summarizing similar changes
Product Manufacturing - Process
Sum of Revenue YearPlat_Type 1992 1993Other M. 10 473Other S. 0 22Unix M. 7 105Unix S. 1 17Wn16 3 85Wn32 0
PRODUCT GEOGRAPHYPLAT_TYPEPLATFORMYEAR_1992 YEAR_1993 RATIO ERROR(All)- (All)- (All)- (All) 2113.0 2763.5 1.3 200Manufacturing - Process (All) (All) (All) 25.9 702.5 27.1 250Other Vertical Apps (All)- (All)- (All) 20.3 1858.4 91.4 251Other Vertical Apps United StatesUnix S. (All) 8.1 77.5 9.6 0Other Vertical Apps Western EuropeUnix S. (All) 7.3 96.3 13.2 0Manufacturing - Discrete (All) (All) (All) 1135.2 0Health Care (All)- (All)- (All)- 6.9 820.4 118.2 98Banking/Finance United StatesOther M. (All) 341.3 239.3 0.7 60Mechanical CAD United States(All) (All) 327.8 243.4 0.7 34
![Page 18: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/18.jpg)
MDL model for summarization Given N, find the best N rows of answer such that:
if user knows cube-A and answer, number of bits needed to send cube-B is minimized.
Year 1990Products
Geography OS DBMS Prog
Asia 100 80 80
USA 100 200 400
UK 140 100 56Year 1991
ProductsGeography OS DBMS Prog
Asia 90 80 70
USA 89 39 67
UK 140 60 56
N row answer
Cube-A
Cube-B
![Page 19: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/19.jpg)
Integration
Single pass on data --- all indexing/sorting in the DBMS: interactive.
Low memory usage: independent of number of tuples: O(NL)
Easy to package as a stored procedure on the data server side.
When detailed subcube too large: work off aggregated data.
![Page 20: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/20.jpg)
The Relax operator
![Page 21: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/21.jpg)
Example query: generalizing drops
![Page 22: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/22.jpg)
![Page 23: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/23.jpg)
Ratio generalization
![Page 24: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/24.jpg)
The Inform operator
![Page 25: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/25.jpg)
User-cognizant data exploration: overview
Monitor to find regions of data user has visited
Model user’s expectation of unseen values Report most informative unseen values
How to
Model expected values?
Define information content?
![Page 26: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/26.jpg)
Modeling expected values
OS DB Word Prog
Asia 10 10 10 10
Afric 10 10 10 10
USA 10 10 10 10
UK 10 10 10 10
OS DB Word Prog
Asia 5 5 5 5
Afric 20 20 20 20
USA 12 12 12 12
UK 3 3 3 3
OS DB Word Prog
Asia 4 8 7 1
Afric 10 20 30 20
USA 5 9 1 33
UK 1 3 2 6
OS DB Word Prog
AsiaAfricUSAUK
Database hidden from user
Views seen by user
All
All 160
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
ALL 20 40 40 60
![Page 27: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/27.jpg)
The Maximum Entropy Principle Choose the most uniform distribution while adhering to
all the constraints E.T.Jaynes..[1990]
it agrees with everything that is known but carefully avoids assuming anything that is not known. It is transcription into mathematics of an ancient principle of wisdom…
Characterizing uniformity:
maximum when all pi-s are equal Solve the constrained optimization problem:
maximize H(p) subject to k constraints
i
ii pppH log)(
![Page 28: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/28.jpg)
Modeling expected values
OS DB Word Prog
Asia 4 8 7 1
Afric 10 20 30 20
USA 5 9 1 33
UK 1 3 2 6
All
All 160
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
Asia 5 5 5 5
Afric 20 20 20 20
USA 12 12 12 12
UK 3 3 3 3
OS DB Word Prog
ALL 20 40 40 60
DatabaseVisited views
OS DB Word Prog
Asia 2.5 5 5 7.5
Afric 10 20 20 30
USA 6 12 12 18
UK 1.5 3 3 4.5
Prog
Usa 33
OS DB Word Prog
Asia 3 6 6 4.8
Afric 12 24 24.3 19.2
USA 3 6 6 33
UK 2 4 3.6 3
OS DB Word Prog
Asia 10 10 10 10
Afric 10 10 10 10
USA 10 10 10 10
UK 10 10 10 10
![Page 29: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/29.jpg)
Change in entropy
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8E
ntr
op
y
View 1 View 2 View 3 View 4 Data
Visited views
![Page 30: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/30.jpg)
Finding expected values Solve the constrained optimization problem:
maximize H(p) subject to k constraints Each constraint is of the form: sum of arbitrary
sets of values Expected values can be expressed as a
product of k coefficients one from each of the k constraints
ki
jIijip
0
)(
![Page 31: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/31.jpg)
Iterative scaling algorithmInitially all p values are the same
While convergence not reached
For each constraint Ci in turn
Scale p values included in Ci by
Converges to optimal solution when all constraints are consistent.
)(
)(~
i
i
Cp
Cp
![Page 32: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/32.jpg)
All
Asia 40
Afric 40
USA 40
UK 40
OS DB Word Prog
ALL 40 40 40 40
Prog
Usa 40
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
ALL 20 40 40 60
Prog
Usa 33
OS DB Word Prog
Asia 10 10 10 10
Afric 10 10 10 10
USA 10 10 10 10
UK 10 10 10 10
OS DB Word Prog
ALL 40 40 40 40
Prog
Usa 12
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
ALL 20 40 40 60
Prog
Usa 33
OS DB Word Prog
Asia 5 5 5 5
Afric 20 20 20 20
USA 12 12 12 12
UK 3 3 3 3
All
Asia 20
Afric 80
USA 48
UK 12
Prog
Usa 18
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
ALL 20 40 40 60
Prog
Usa 33
OS DB Word Prog
Asia 3 5 5 7.5
Afric 10 20 20 30
USA 6 12 12 18
UK 2 3 3 4.5
All
Asia 20
Afric 80
USA 48
UK 12OS DB Word Prog
ALL 20 40 40 60
Prog
Usa 33
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
ALL 20 40 40 60
Prog
Usa 33
OS DB Word Prog
Asia 3 5 5 7.5
Afric 10 20 20 30
USA 6 12 12 33
UK 2 3 3 4.5
All
Asia 20
Afric 80
USA 63
UK 12
OS DB Word Prog
ALL 19 37 37 75
Prog
Usa 25
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
ALL 20 40 40 60
Prog
Usa 33
OS DB Word Prog
Asia 3 5 5 7.5
Afric 10 20 20 30
USA 5 9 9 25
UK 2 3 3 4.5
All
Asia 20
Afric 80
USA 48
UK 12
OS DB Word Prog
ALL 19 37 37 67
![Page 33: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/33.jpg)
Defined as how much adding it as a constraint will reduce distance between actual and expected values
Distance between actual and expected:
Information content of (k+1)th constraint Ck+1:
Can be approximated as:
Information content of an unvisited cell
i
ki
ii
k
p
ppppD
~log~)~,(
)~,()~,( 1 ppDppD kk
)(
)(~log)(~
1
11
kk
kk Cp
CpCp
![Page 34: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/34.jpg)
Information content of unseen data
0
0.005
0.01
0.015
0.02
0.025
0.03
OS DB Word Prog
AsiaAfric
USAUK
OS DB Word Prog
Asia 4 8 7 1
Afric 10 20 30 20
USA 5 9 1 33
UK 1 3 2 6
OS DB Word Prog
Asia 3 6 6 4.8
Afric 12 24 24.3 19.2
USA 3 6 6 33
UK 2 4 3.6 3
![Page 35: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/35.jpg)
Finding N most informative cells
In general, most informative cells can be any of value from any level of aggregation.
Single-pass algorithm that finds the best difference between actual and expected values [Diff algorithm]
![Page 36: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/36.jpg)
Information gain with focussed exploration
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
Constraint number
Rela
tive s
quare
err
or Random MaxEntropy
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
Constraint number
Rela
tive s
quare
err
or
Random MaxEntropy
0
0.1
0.2
0.3
0.4
0 5 10 15
Constraint number
Rel
ativ
e sq
uare
err
or
![Page 37: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/37.jpg)
Illustration from Student enrollment data
Student Sex Program Department YearCategory (9) Sex (2) Name (10) Name (28) Year (10) Category (3)
Sum 8206
PROGRAM Total2 Yr M.Sc 9.21%B.Tech 33.87%M.Tech 37.31%Ph.D 11.60%Others 1.60%
SEX TotalF 10.25%M 89.75%
CATEGORY TotalFull time Sponsored 5.28%Indian 81.55%Others 1.90%
Year Total1989 19.00%
Others 9.00%
35% of information in data captured in 12 out of 4560 cells: 0.25% of data
![Page 38: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/38.jpg)
Top few suprising values
Category Sex Program DeptIndian M Computer Science & EngineeringIndian M Metallurgical Engineering & Mat.Sc.
M.Mgnt. School of ManagementIndian M M.Tech Civil EngineeringIndian M M.Tech Chemical Engineering
M Bio-Technology
80% of information in data captured in 50 out of 4560 cells: 1% of data
![Page 39: Sunita Sarawagi IIT Bombay sunita Analyzing large multidimensional databases](https://reader035.vdocuments.site/reader035/viewer/2022062408/56649f2b5503460f94c461b5/html5/thumbnails/39.jpg)
Summary
Our goal: enhance OLAP with a suite of operations that are richer than simple OLAP and SQL queries more interactive than conventional mining
...and thus reduce the need for manual analysis
Proposed three new operators: Diff, Relax,Inform Formulations with theoretical basis Efficient algorithms for online answering Integrates smoothly with existing systems.