a visualization framework for exploring …...a visualization framework for exploring correlations...
TRANSCRIPT
A Visualization Framework for ExploringCorrelations among Attributes of a Large
Dataset and Its Applications in DataMining
This thesis is
presented to the
School of Computer Science & Software Engineering
for the degree of
Doctor of Philosophy
of
The University of Western Australia
By
Kesaraporn Techapichetvanich
2005
c© Copyright 2005
by
Kesaraporn Techapichetvanich
iii
iv
Abstract
Many databases in scientific and business applications have grown exponentially
in size in recent years. Accessing and using databases is no longer a specialized
activity as more and more ordinary users without any specialized knowledge are
trying to gain information from databases. Both expert and ordinary users face
significant challenges in understanding the information stored in databases. The
databases are so large in most cases that it is impossible to gain useful informa-
tion by inspecting data tables, which are the most common form of storing data
in relational databases. Visualization has emerged as one of the most important
techniques for exploring data stored in large databases. Appropriate visualization
techniques can reveal trends, correlations and associations in data that are very dif-
ficult to understand from a textual representation of the data. This thesis presents
several new frameworks for data visualization and visual data mining.
The first technique, VisEx, is useful for visual exploration of large multi-attribute
datasets and especially for exploring the correlations among the attributes in such
datasets. Most previous visualization techniques can display correlations among two
or three attributes at a time without excessive screen clutter. Though many data
exploration tasks require examining correlations among four or more attributes,
this can be done only indirectly using previous visualization tools. However, the
technique developed in this thesis allows the user to explore correlations among any
number of attributes seamlessly. This technique is also completely scalable in the
sense that it can handle small as well as very large datasets.
v
Many organizations are increasingly using data mining tools to discover important
associations in data stored in large data warehouses. Although many algorithms for
mining association rules have been researched extensively, they do not incorporate
users in the process and most of them generate a large number of association rules.
It is quite often difficult for the user to analyze a large number of rules to identify
a small subset of rules that is of importance to the user. In this thesis I present a
framework for the user to interactively mine association rules visually.
Another challenging task in data mining is to understand the correlations among
the mined association rules. It is often difficult to identify a relevant subset of
association rules from a large number of mined rules. A further contribution of this
thesis is a simple framework in the VisAR system that allows the user to explore a
large number of association rules visually.
A variety of businesses have adopted new technologies for storing large amounts
of data. Analysis of historical data quite often offers new insights into business
processes that may increase productivity and profit. On-line analytical process-
ing (OLAP) has become a powerful tool for business analysts to explore historical
data. Effective visualization techniques are very important for supporting OLAP
technology. A new technique for the visual exploration of OLAP data cubes is also
presented in this thesis.
vi
Preface
Much of the work presented in this thesis has been published as follows. The first
two papers are related to the material in Chapter 3. The third to fifth papers are
related to Chapter 4 and the last paper is related to the material in Chapter 5.
• K. Techapichetvanich, A. Datta and R. Owens. HDDV: Hierarchical dynamic
dimensional visualization. In Proceedings of IASTED International Confer-
ence on Databases and Applications, pages 157-162, 2004.
• K. Techapichetvanich and A. Datta. VisEx: A visualization framework for
exploring correlations among attributes in large multidimensional datasets,
Information Visualization, under review.
• K. Techapichetvanich and A. Datta. Visual mining of market basket asso-
ciation rules, In Proceedings of ICCSA 2004: International Conference on
Computational Science and Its Applications, Volume 3046 of Lecture Notes in
Computer Science, pages 479-488. Springer, 2004.
• K. Techapichetvanich and A. Datta. VisAR: A new technique for visualizing
mined association rules, In Proceedings of the First International Conference
on Advanced Data Mining and Applications (ADMA 2005), Volume 3584 of
Lecture Notes in Computer Science, pages 88-95. Springer, 2005.
• K. Techapichetvanich and A. Datta. Visual data mining for discovering asso-
ciation rules, In K. E. Voges and N. K. Ll.Pope (editors), Business Application
and Computational Intelligence, Chapter 11. Idea Group Publishing, 2005.
vii
• K. Techapichetvanich and A. Datta. Interactive visualization for OLAP, In
Proceedings of ICCSA 2005: International Conference on Computational Sci-
ence and Its Applications, Volume 3482 of Lecture Notes in Computer Science,
pages 206-215. Springer, 2005.
Though this thesis and all published papers are mainly similar, the structure and
all details of individual systems have been described in this thesis in more details
and thoroughly. The author of this thesis is responsible for the originality of the
presented research and is also the primary author for each of these publications.
viii
Acknowledgements
First and foremost, I would like to thank Associate Professor Amitava Datta. He
has been my supervisor for the last two years of my candidature. Over the last two
years, he has provided invaluable motivation, inspiration, and guidance. I am glad
to have you as a supervisor. Many thanks are also extended to Professor Robyn
Owens for her help.
During the first period of my candidature, I have also benefitted from the tremen-
dous support of Dr. Sato Juniper, and Margaret Jones has provided English support
by both teaching and proof reading.
General thanks go to all staff at the School of Computer Science & Software Engi-
neering at the University of Western Australia. Specifically thanks also go to Dr.
Nick Spadaccini, the head of school during the last two years of my candidature. I
have also had the good fortune to be surrounded by a great number of postgraduates
both in the school and outside.
Last, but not least, I would like to offer my special thanks to my family for their
support and encouragement. Thanks also go to the Pocathikorn family for their
support and for caring for me as one of the family.
ix
x
Contents
Abstract v
Preface vii
Acknowledgements ix
1 Introduction 1
1.1 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Visual exploration of large multidimensional datasets . . . . 5
1.1.2 Visual data mining and visualization of association rules . . 7
1.1.3 Interactive visualization for OLAP . . . . . . . . . . . . . . 10
1.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Previous Work 13
2.1 Information Visualization Techniques . . . . . . . . . . . . . . . . . 13
2.1.1 Geometric techniques . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Iconographic techniques . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Hierarchical techniques . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Pixel-based techniques . . . . . . . . . . . . . . . . . . . . . 26
2.1.5 Table-based techniques . . . . . . . . . . . . . . . . . . . . . 28
xi
2.2 The dynamic query framework . . . . . . . . . . . . . . . . . . . . . 30
2.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Association rules . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Visualization for OLAP . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 A New Technique for Visual Exploration of Large Datasets 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 VisEx system design . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 VisEx system architecture and implementation . . . . . . . . . . . . 50
3.4.1 Connection and Transformation in VisEx . . . . . . . . . . . 50
3.4.2 Visualizing multiple attributes in VisEx . . . . . . . . . . . 52
3.4.3 Querying in VisEx . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.4 User interaction . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Analysis scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.1 Analysis 1: 1990 U.S. Census Data . . . . . . . . . . . . . . 65
3.5.2 Analysis 2: 1985 The Current Population Survey . . . . . . 68
3.6 User study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 69
3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Visualization for Association Rule Mining 79
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
xii
4.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 The model for interactive association rule mining . . . . . . . . . . 82
4.3.1 Identifying Frequent Itemsets . . . . . . . . . . . . . . . . . 85
4.3.2 Selecting Interesting Association Rules . . . . . . . . . . . . 87
4.3.3 Visualizing Association Rules . . . . . . . . . . . . . . . . . 87
4.4 Data Structure used in VisDM . . . . . . . . . . . . . . . . . . . . . 89
4.5 A user study of VisDM . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 90
4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Visualization of many association rules . . . . . . . . . . . . . . . . 92
4.6.1 The VisAR system . . . . . . . . . . . . . . . . . . . . . . . 96
4.6.2 The advantages of VisAR . . . . . . . . . . . . . . . . . . . 102
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Interactive Visualization for On-line Analytical Processing 105
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 VisOLAP system architecture and implementation . . . . . . . . . . 111
5.3.1 System connection . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Visualizing OLAP data cubes . . . . . . . . . . . . . . . . . 113
5.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5 Visual Exploration and MDX query . . . . . . . . . . . . . . . . . . 119
5.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xiii
6 Conclusion 127
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Bibliography 131
Appendices 141
A 141
A.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.1.1 Tutorial Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.1.2 Experimental Tasks . . . . . . . . . . . . . . . . . . . . . . . 142
A.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B 147
B.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.1.1 Tasks from Dataset1 . . . . . . . . . . . . . . . . . . . . . . 147
B.1.2 Tasks from Dataset2 . . . . . . . . . . . . . . . . . . . . . . 148
B.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
xiv
List of Figures
1 The KDD process overview. . . . . . . . . . . . . . . . . . . . . . . 8
2 Scatterplot matrix visualization. . . . . . . . . . . . . . . . . . . . . 15
3 Parallel coordinates visualization. . . . . . . . . . . . . . . . . . . . 16
4 Star coordinates visualization. . . . . . . . . . . . . . . . . . . . . . 17
5 Chernoff-face visualization. . . . . . . . . . . . . . . . . . . . . . . . 18
6 Star glyphs visualization. . . . . . . . . . . . . . . . . . . . . . . . . 19
7 Stick figure visualization. . . . . . . . . . . . . . . . . . . . . . . . . 20
8 Worlds within worlds visualization. . . . . . . . . . . . . . . . . . . 22
9 Hierarchical axis visualization. . . . . . . . . . . . . . . . . . . . . . 23
10 Hyperbolic browser visualization. . . . . . . . . . . . . . . . . . . . . 24
11 Cone trees visualization. . . . . . . . . . . . . . . . . . . . . . . . . 24
12 An example of tree-maps. . . . . . . . . . . . . . . . . . . . . . . . . 25
13 An example of information slices. . . . . . . . . . . . . . . . . . . . 26
14 Spiral and axes query dependent visualization. . . . . . . . . . . . . 27
15 Circle segment visualization. . . . . . . . . . . . . . . . . . . . . . . 28
16 Table lens visualization. . . . . . . . . . . . . . . . . . . . . . . . . 29
17 An example of candidate and frequent itemsets. . . . . . . . . . . . 37
18 An example of visualizing association rules for text mining. . . . . . 39
xv
19 An example of visualizing association rules with Mosaic plots. . . . 40
20 An Anchored Measures approach of ADVIZOR. . . . . . . . . . . . 41
21 An example of barstick visualization in VisEx. . . . . . . . . . . . . 47
22 VisEx System architecture . . . . . . . . . . . . . . . . . . . . . . . 51
23 A screenshot of the user interface with four barsticks queried in VisEx. 56
24 An example of fixed mode exploration. . . . . . . . . . . . . . . . . 58
25 An example result of five queried attributes. . . . . . . . . . . . . . 59
26 Display of the relationship of six queried attributes by Comparison
techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
27 An example result from Exploration techniques. . . . . . . . . . . . 62
28 An example result of four queried attributes. . . . . . . . . . . . . . 63
29 An example result of Selection techniques in barsticks. . . . . . . . 64
30 Display of the relationship of four queried attributes with equal-
height bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
31 An example of an analysis scenario with four selected attributes. . . 66
32 An example analysis of three selected attributes with the comparison
of the sex attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
33 An example analysis shows the relationships of five selected attributes
including Total personal incomes, Years of schooling, Occupations,
Class of worker, and Industry. . . . . . . . . . . . . . . . . . . . . . 68
34 An example analysis shows the relationships of five selected attributes:
Total personal incomes, Occupations, Age, Retirement Income, and
Social Security Income. . . . . . . . . . . . . . . . . . . . . . . . . . 69
35 A comparison of five selected attributes including Occupation, Sex,
Education, Race, and Wage. . . . . . . . . . . . . . . . . . . . . . . 70
xvi
36 An example analysis with four selected attributes: Education, Expe-
rience, Age, and Wage. . . . . . . . . . . . . . . . . . . . . . . . . . 71
37 The mean time for completing each task. . . . . . . . . . . . . . . . 73
38 The correctness of each task. . . . . . . . . . . . . . . . . . . . . . . 73
39 The results from questionnaires in different categories: (a) Usability
(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 77
40 A model of the technique for mining association rules. . . . . . . . . 84
41 A screenshot and user interface of identifying frequent itemsets. . . 86
42 A screenshot and user interface of selecting interesting association
rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
43 A screenshot and user interface of visualizing association rules. . . . 89
44 The results from questionnaires in different categories: (a) Usability
(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 94
45 (a) The mean time of completing each task. (b) The correctness of
each task in each dataset. . . . . . . . . . . . . . . . . . . . . . . . 95
46 A diagram of the system for visualizing mined association rules. . . 97
47 A user interface of visualization for mined association rules. . . . . 99
48 Visualization of association rules with AND operation and support
sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
49 Visualization of association rules with AND operation and confidence
sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
50 Visualization of association rules from the selected items of interest
in Figure 47 but sorted according to confidence values. . . . . . . . 101
51 A diagram of ADOMD object model [29]. . . . . . . . . . . . . . . . 110
52 VisOLAP system architecture . . . . . . . . . . . . . . . . . . . . . 112
53 A framework of visualizing OLAP data cubes in VisOlap. . . . . . . 114
xvii
54 A user interface of VisOLAP. . . . . . . . . . . . . . . . . . . . . . 114
55 A framework of the Drill down function in VisOlap. . . . . . . . . . 116
56 A framework of the Slice function in VisOlap. . . . . . . . . . . . . 117
57 Examples of OLAP functionalities including drilling down, rolling
up, and slicing on multidimensional data. . . . . . . . . . . . . . . . 118
58 Visualization of the exploration in Product Family, Store Type, Year,
and Quarter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
59 Visualization of the drill down operation into Product Department
on Product Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
60 Visualization for exploring Promotion media, Store type, and Unit
sales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
61 An example of visualization for exploring alcoholic beverage sales of
small groceries in Year 1997. . . . . . . . . . . . . . . . . . . . . . . 125
xviii
A Visualization Framework for ExploringCorrelations among Attributes of a Large
Dataset and Its Applications in DataMining
This thesis is
presented to the
School of Computer Science & Software Engineering
for the degree of
Doctor of Philosophy
of
The University of Western Australia
By
Kesaraporn Techapichetvanich
2005
c© Copyright 2005
by
Kesaraporn Techapichetvanich
iii
iv
Abstract
Many databases in scientific and business applications have grown exponentially
in size in recent years. Accessing and using databases is no longer a specialized
activity as more and more ordinary users without any specialized knowledge are
trying to gain information from databases. Both expert and ordinary users face
significant challenges in understanding the information stored in databases. The
databases are so large in most cases that it is impossible to gain useful informa-
tion by inspecting data tables, which are the most common form of storing data
in relational databases. Visualization has emerged as one of the most important
techniques for exploring data stored in large databases. Appropriate visualization
techniques can reveal trends, correlations and associations in data that are very dif-
ficult to understand from a textual representation of the data. This thesis presents
several new frameworks for data visualization and visual data mining.
The first technique, VisEx, is useful for visual exploration of large multi-attribute
datasets and especially for exploring the correlations among the attributes in such
datasets. Most previous visualization techniques can display correlations among two
or three attributes at a time without excessive screen clutter. Though many data
exploration tasks require examining correlations among four or more attributes,
this can be done only indirectly using previous visualization tools. However, the
technique developed in this thesis allows the user to explore correlations among any
number of attributes seamlessly. This technique is also completely scalable in the
sense that it can handle small as well as very large datasets.
v
Many organizations are increasingly using data mining tools to discover important
associations in data stored in large data warehouses. Although many algorithms for
mining association rules have been researched extensively, they do not incorporate
users in the process and most of them generate a large number of association rules.
It is quite often difficult for the user to analyze a large number of rules to identify
a small subset of rules that is of importance to the user. In this thesis I present a
framework for the user to interactively mine association rules visually.
Another challenging task in data mining is to understand the correlations among
the mined association rules. It is often difficult to identify a relevant subset of
association rules from a large number of mined rules. A further contribution of this
thesis is a simple framework in the VisAR system that allows the user to explore a
large number of association rules visually.
A variety of businesses have adopted new technologies for storing large amounts
of data. Analysis of historical data quite often offers new insights into business
processes that may increase productivity and profit. On-line analytical process-
ing (OLAP) has become a powerful tool for business analysts to explore historical
data. Effective visualization techniques are very important for supporting OLAP
technology. A new technique for the visual exploration of OLAP data cubes is also
presented in this thesis.
vi
Preface
Much of the work presented in this thesis has been published as follows. The first
two papers are related to the material in Chapter 3. The third to fifth papers are
related to Chapter 4 and the last paper is related to the material in Chapter 5.
• K. Techapichetvanich, A. Datta and R. Owens. HDDV: Hierarchical dynamic
dimensional visualization. In Proceedings of IASTED International Confer-
ence on Databases and Applications, pages 157-162, 2004.
• K. Techapichetvanich and A. Datta. VisEx: A visualization framework for
exploring correlations among attributes in large multidimensional datasets,
Information Visualization, under review.
• K. Techapichetvanich and A. Datta. Visual mining of market basket asso-
ciation rules, In Proceedings of ICCSA 2004: International Conference on
Computational Science and Its Applications, Volume 3046 of Lecture Notes in
Computer Science, pages 479-488. Springer, 2004.
• K. Techapichetvanich and A. Datta. VisAR: A new technique for visualizing
mined association rules, In Proceedings of the First International Conference
on Advanced Data Mining and Applications (ADMA 2005), Volume 3584 of
Lecture Notes in Computer Science, pages 88-95. Springer, 2005.
• K. Techapichetvanich and A. Datta. Visual data mining for discovering asso-
ciation rules, In K. E. Voges and N. K. Ll.Pope (editors), Business Application
and Computational Intelligence, Chapter 11. Idea Group Publishing, 2005.
vii
• K. Techapichetvanich and A. Datta. Interactive visualization for OLAP, In
Proceedings of ICCSA 2005: International Conference on Computational Sci-
ence and Its Applications, Volume 3482 of Lecture Notes in Computer Science,
pages 206-215. Springer, 2005.
Though this thesis and all published papers are mainly similar, the structure and
all details of individual systems have been described in this thesis in more details
and thoroughly. The author of this thesis is responsible for the originality of the
presented research and is also the primary author for each of these publications.
viii
Acknowledgements
First and foremost, I would like to thank Associate Professor Amitava Datta. He
has been my supervisor for the last two years of my candidature. Over the last two
years, he has provided invaluable motivation, inspiration, and guidance. I am glad
to have you as a supervisor. Many thanks are also extended to Professor Robyn
Owens for her help.
During the first period of my candidature, I have also benefitted from the tremen-
dous support of Dr. Sato Juniper, and Margaret Jones has provided English support
by both teaching and proof reading.
General thanks go to all staff at the School of Computer Science & Software Engi-
neering at the University of Western Australia. Specifically thanks also go to Dr.
Nick Spadaccini, the head of school during the last two years of my candidature. I
have also had the good fortune to be surrounded by a great number of postgraduates
both in the school and outside.
Last, but not least, I would like to offer my special thanks to my family for their
support and encouragement. Thanks also go to the Pocathikorn family for their
support and for caring for me as one of the family.
ix
x
Contents
Abstract v
Preface vii
Acknowledgements ix
1 Introduction 1
1.1 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Visual exploration of large multidimensional datasets . . . . 5
1.1.2 Visual data mining and visualization of association rules . . 7
1.1.3 Interactive visualization for OLAP . . . . . . . . . . . . . . 10
1.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Previous Work 13
2.1 Information Visualization Techniques . . . . . . . . . . . . . . . . . 13
2.1.1 Geometric techniques . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Iconographic techniques . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Hierarchical techniques . . . . . . . . . . . . . . . . . . . . . 20
2.1.4 Pixel-based techniques . . . . . . . . . . . . . . . . . . . . . 26
2.1.5 Table-based techniques . . . . . . . . . . . . . . . . . . . . . 28
xi
2.2 The dynamic query framework . . . . . . . . . . . . . . . . . . . . . 30
2.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Association rules . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Visualization for OLAP . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 A New Technique for Visual Exploration of Large Datasets 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 VisEx system design . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 VisEx system architecture and implementation . . . . . . . . . . . . 50
3.4.1 Connection and Transformation in VisEx . . . . . . . . . . . 50
3.4.2 Visualizing multiple attributes in VisEx . . . . . . . . . . . 52
3.4.3 Querying in VisEx . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.4 User interaction . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Analysis scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.1 Analysis 1: 1990 U.S. Census Data . . . . . . . . . . . . . . 65
3.5.2 Analysis 2: 1985 The Current Population Survey . . . . . . 68
3.6 User study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 69
3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Visualization for Association Rule Mining 79
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
xii
4.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 The model for interactive association rule mining . . . . . . . . . . 82
4.3.1 Identifying Frequent Itemsets . . . . . . . . . . . . . . . . . 85
4.3.2 Selecting Interesting Association Rules . . . . . . . . . . . . 87
4.3.3 Visualizing Association Rules . . . . . . . . . . . . . . . . . 87
4.4 Data Structure used in VisDM . . . . . . . . . . . . . . . . . . . . . 89
4.5 A user study of VisDM . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 90
4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Visualization of many association rules . . . . . . . . . . . . . . . . 92
4.6.1 The VisAR system . . . . . . . . . . . . . . . . . . . . . . . 96
4.6.2 The advantages of VisAR . . . . . . . . . . . . . . . . . . . 102
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Interactive Visualization for On-line Analytical Processing 105
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3 VisOLAP system architecture and implementation . . . . . . . . . . 111
5.3.1 System connection . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Visualizing OLAP data cubes . . . . . . . . . . . . . . . . . 113
5.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5 Visual Exploration and MDX query . . . . . . . . . . . . . . . . . . 119
5.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xiii
6 Conclusion 127
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Bibliography 131
Appendices 141
A 141
A.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.1.1 Tutorial Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.1.2 Experimental Tasks . . . . . . . . . . . . . . . . . . . . . . . 142
A.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
B 147
B.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
B.1.1 Tasks from Dataset1 . . . . . . . . . . . . . . . . . . . . . . 147
B.1.2 Tasks from Dataset2 . . . . . . . . . . . . . . . . . . . . . . 148
B.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
xiv
List of Figures
1 The KDD process overview. . . . . . . . . . . . . . . . . . . . . . . 8
2 Scatterplot matrix visualization. . . . . . . . . . . . . . . . . . . . . 15
3 Parallel coordinates visualization. . . . . . . . . . . . . . . . . . . . 16
4 Star coordinates visualization. . . . . . . . . . . . . . . . . . . . . . 17
5 Chernoff-face visualization. . . . . . . . . . . . . . . . . . . . . . . . 18
6 Star glyphs visualization. . . . . . . . . . . . . . . . . . . . . . . . . 19
7 Stick figure visualization. . . . . . . . . . . . . . . . . . . . . . . . . 20
8 Worlds within worlds visualization. . . . . . . . . . . . . . . . . . . 22
9 Hierarchical axis visualization. . . . . . . . . . . . . . . . . . . . . . 23
10 Hyperbolic browser visualization. . . . . . . . . . . . . . . . . . . . . 24
11 Cone trees visualization. . . . . . . . . . . . . . . . . . . . . . . . . 24
12 An example of tree-maps. . . . . . . . . . . . . . . . . . . . . . . . . 25
13 An example of information slices. . . . . . . . . . . . . . . . . . . . 26
14 Spiral and axes query dependent visualization. . . . . . . . . . . . . 27
15 Circle segment visualization. . . . . . . . . . . . . . . . . . . . . . . 28
16 Table lens visualization. . . . . . . . . . . . . . . . . . . . . . . . . 29
17 An example of candidate and frequent itemsets. . . . . . . . . . . . 37
18 An example of visualizing association rules for text mining. . . . . . 39
xv
19 An example of visualizing association rules with Mosaic plots. . . . 40
20 An Anchored Measures approach of ADVIZOR. . . . . . . . . . . . 41
21 An example of barstick visualization in VisEx. . . . . . . . . . . . . 47
22 VisEx System architecture . . . . . . . . . . . . . . . . . . . . . . . 51
23 A screenshot of the user interface with four barsticks queried in VisEx. 56
24 An example of fixed mode exploration. . . . . . . . . . . . . . . . . 58
25 An example result of five queried attributes. . . . . . . . . . . . . . 59
26 Display of the relationship of six queried attributes by Comparison
techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
27 An example result from Exploration techniques. . . . . . . . . . . . 62
28 An example result of four queried attributes. . . . . . . . . . . . . . 63
29 An example result of Selection techniques in barsticks. . . . . . . . 64
30 Display of the relationship of four queried attributes with equal-
height bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
31 An example of an analysis scenario with four selected attributes. . . 66
32 An example analysis of three selected attributes with the comparison
of the sex attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
33 An example analysis shows the relationships of five selected attributes
including Total personal incomes, Years of schooling, Occupations,
Class of worker, and Industry. . . . . . . . . . . . . . . . . . . . . . 68
34 An example analysis shows the relationships of five selected attributes:
Total personal incomes, Occupations, Age, Retirement Income, and
Social Security Income. . . . . . . . . . . . . . . . . . . . . . . . . . 69
35 A comparison of five selected attributes including Occupation, Sex,
Education, Race, and Wage. . . . . . . . . . . . . . . . . . . . . . . 70
xvi
36 An example analysis with four selected attributes: Education, Expe-
rience, Age, and Wage. . . . . . . . . . . . . . . . . . . . . . . . . . 71
37 The mean time for completing each task. . . . . . . . . . . . . . . . 73
38 The correctness of each task. . . . . . . . . . . . . . . . . . . . . . . 73
39 The results from questionnaires in different categories: (a) Usability
(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 77
40 A model of the technique for mining association rules. . . . . . . . . 84
41 A screenshot and user interface of identifying frequent itemsets. . . 86
42 A screenshot and user interface of selecting interesting association
rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
43 A screenshot and user interface of visualizing association rules. . . . 89
44 The results from questionnaires in different categories: (a) Usability
(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 94
45 (a) The mean time of completing each task. (b) The correctness of
each task in each dataset. . . . . . . . . . . . . . . . . . . . . . . . 95
46 A diagram of the system for visualizing mined association rules. . . 97
47 A user interface of visualization for mined association rules. . . . . 99
48 Visualization of association rules with AND operation and support
sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
49 Visualization of association rules with AND operation and confidence
sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
50 Visualization of association rules from the selected items of interest
in Figure 47 but sorted according to confidence values. . . . . . . . 101
51 A diagram of ADOMD object model [29]. . . . . . . . . . . . . . . . 110
52 VisOLAP system architecture . . . . . . . . . . . . . . . . . . . . . 112
53 A framework of visualizing OLAP data cubes in VisOlap. . . . . . . 114
xvii
54 A user interface of VisOLAP. . . . . . . . . . . . . . . . . . . . . . 114
55 A framework of the Drill down function in VisOlap. . . . . . . . . . 116
56 A framework of the Slice function in VisOlap. . . . . . . . . . . . . 117
57 Examples of OLAP functionalities including drilling down, rolling
up, and slicing on multidimensional data. . . . . . . . . . . . . . . . 118
58 Visualization of the exploration in Product Family, Store Type, Year,
and Quarter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
59 Visualization of the drill down operation into Product Department
on Product Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
60 Visualization for exploring Promotion media, Store type, and Unit
sales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
61 An example of visualization for exploring alcoholic beverage sales of
small groceries in Year 1997. . . . . . . . . . . . . . . . . . . . . . . 125
xviii
Chapter 1
Introduction
In the last few decades, increased computer usage has led to the generation and
collection of huge quantities of complex data in many areas, including engineering,
health, scientific, and business areas. Moreover, databases and data warehouses
have become integral parts of business activities and scientific research. A National
Science Foundation report (NSF) defined the term visualization [49] in order to
adapt the numerical abilities of computers to suit human perception by transform-
ing and preprocessing raw data into visual images. To understand clearly huge
quantities of data in a reasonable time frame an efficient and effective visualization
tool or application is needed to enable the viewing of data in graphical form.
Visualization combines computer graphics, computer science, visual arts, image
processing, and user-interface methodology to enhance insights into data. It can
be used to help scientists, analysts, and researchers to gain knowledge from data,
and to reduce the time taken to interpret information, as well as providing any
relationships and hidden phenomena in large datasets.
Information visualization can be defined as the field of visualization that represents
abstract or non-physical data, such as financial data, that cannot be obviously
mapped onto physical space. On the other hand, scientific visualization often has an
inherent geometry and relates to mathematical structure and models. It tends to use
1
2 CHAPTER 1. INTRODUCTION
physical data containing spatial mapping from fields such as meteorology, medical
images, and space exploration [16]. For example, in meteorology visualization of the
data representing the density of the cloud covering in the atmosphere is based on a
three-dimensional representation of the earth. In visualization of medical imaging,
magnetic-resonance imagery (MRI) scans or computerized tomography (CT) scans
show anatomical organisms or parts of the body that could not be viewed by any
other means. Information visualization usually requires less computation power
than scientific visualization and so can be done on personal computers. Information
visualization deals with the problem of identifying and displaying visibly important
portions of the data and effectively mapping non-spatial information onto visual
forms.
Both scientific and information visualization deal with the encoding and mapping of
data to visual form in geometric space. In scientific visualization, mapping physical
data to geometric space is important. In contrast, the geometric space is meaning-
less in mapping abstract data for information visualization. One commonly spatial
group to encode the abstract data for information visualization is a group of four
spaces [9] as follows. The four categorized spaces are grouped by mapping tech-
niques and their data characteristics even though sometimes the first space might
be grouped as a subset of the second space and the third space might be grouped
as a subset of the fourth space.
• 1-D, 2-D, and 3-D orthogonal axes or xyz Cartesian coordinates space is used
to encode data.
• Multiple dimensions > 3 are used to encode complex data dimensions onto a
limited screen space.
• Trees are used to display connections between multiple levels to encode rela-
tionships in data.
• Networks or graphs are used to encode relationships in data.
3
Multidimensional visualization is one of the most important techniques used in in-
formation visualization [49]. The purpose of multidimensional visualization is not
only to understand data but also to understand the underlying hidden relationships
present in the data. A variety of multidimensional visualization techniques have
been developed by researchers in many areas such as statistics, finance, medical
research, and mathematics. Multidimensional visualization uses understandable
graphical forms to represent relationships in multidimensional datasets (multiple
variables or parameters and their relationships) by mapping n-dimensional coor-
dinates (n ≥ 3) onto low dimensional coordinate spaces. There are various ap-
proaches to address the problem depending on the purpose of the visualization (i.e.
what relationships users want to look for). For example, the Worlds within Worlds
technique [19] by Besher and Feiner visualizes multidimensional data by placing a
coordinate space inside another in a hierarchical manner. Techniques developed by
Healey [24, 25] focus on performing exploratory data analysis rapidly and accurately
by using preprocessing.
An important challenge in developing a visualization tool for exploring multidimen-
sional data is a need for simplicity to understand and reduce time of training. It
is easy for a human observer to understand visual information presented as a two
dimensional bar chart where one attribute is displayed against another. It is still
possible to understand three dimensional bar charts or surface plots. However, plot-
ting one attribute against another one or two attributes does not scale beyond three
dimensions. Another problem is a lack of efficiency in using screen space resulting
in occlusion. Hence, a variety of techniques have been developed for visualizing
multidimensional data, and these are reviewed in Chapter 2.
Although the well known parallel coordinates [34] and scatterplot matrix [13] can
visualize multidimensional datasets, they generate occlusion when visualizing large
datasets. Since the encoding and mapping of a large amount of multidimensional
data is a difficult design task, the problem of occlusion easily occur. It is also well
known that we can only understand a limited amount of information visually at
4 CHAPTER 1. INTRODUCTION
a time [30]. Hence, the design of an appropriate visualization technique should
avoid this problems or at least reduce the problem as much as possible. Visu-
alization should be able to convey important information through efficient use of
limited screen space, simplicity of use, and clear, understandable representations.
Simplicity [81] can be interpreted as the way of using friendly and intuitive input
structures and providing an easily interpretable output. In addition, a good visual-
ization method should improve users’ ability to extract or discover interesting and
hidden relationships from large multidimensional datasets. Although pixel-based
techniques [38, 41] address these problems, users might need a lot of training time
to use these visualization techniques.
Typically, some basic requirements should be considered in designing an informa-
tion visualization system so that the system can effectively convey important and
interesting information to users [9]. These requirements are as follows.
• Perception and Cognitive amplification: Humans have a limited ability
to perceive visual information. Information visualization relies on human
perception and hence, we need to consider the load on human perception in
designing an effective visualization system.An effective visualization system
should take advantage of normal human cognitive abilities.
• Comprehension: The visualization should convey comprehensive informa-
tion to users.
• Visual structure: The visualization should be clear enough to preserve data
and convey information through visual representation.
• Computational Cost: A good visualization should minimize its computa-
tion time, which refers to the issues of algorithm optimization and real-time
response during interactions.
1.1. CONTRIBUTIONS OF THE THESIS 5
1.1 Contributions of the thesis
In this thesis I investigate new visualization techniques for exploring large datasets
to help users gain insight into their data and to discover relationships, trends,
distributions, and patterns among data attributes. I categorize the visualization
techniques in this thesis into three major topics, namely, visual exploration of large
multidimensional datasets, visual data mining and visualization of association rules,
and interactive visualization for OLAP. All of these visualization techniques are de-
signed to reduce the occlusion when displaying a large number of records by reduc-
ing the number of graphic primitives representing records on the screen. Moreover,
the intention has been to design these techniques to be reliable, flexible, simple to
understand and to meet the requirements mentioned in the previous section.
1.1.1 Visual exploration of large multidimensional datasets
Simple tools for data visualization are needed for giving users who are not experts
in database technologies access to large databases. Such users require intuitive tools
that they can use without any prior knowledge of either any database technology or
the nature of the underlying database. Schneiderman and his co-workers recognized
the need for such intuitive tools in a series of papers almost a decade ago [61, 4,
3, 79]. Their main aim was to give the user sufficient freedom for database query.
Since most database query languages are relatively time consuming to learn, it is
unrealistic to expect that a user who wants to access a database for a specific need
will first learn a query language. Moreover, it may not be enough to know a query
language for extracting meaningful information effectively. Quite often a query may
produce either no records or a large number of records. Both of these cases do not
help the user, who usually wants a small number of records that can be effectively
examined further.
Williamson and Schneiderman [79] have presented a very interesting scenario where
a user wants to find meaningful information from a large database without any prior
6 CHAPTER 1. INTRODUCTION
knowledge of a query language or the details of the underlying database. In their
dynamic home finder system, a user wants to find a suitable home depending on
the location of her workplace and other factors like price, number of bedrooms and
area of the plot. The usual way of choosing a house is to go through the brochures
from a real estate agent. However, in the system developed by Williamson and
Schneiderman [79], the user can progressively narrow down her search through visual
queries. Schneiderman [61], Ahlberg et al. [4], Williamson and Schneiderman [79]
and Ahlberg and Schneiderman [3] have proposed the dynamic query framework for
such visualization tasks.
Most large databases store records that have several attributes. For example, a car
database contains car records that have attributes like make, model, color, engine
capacity and price. A census database stores records that have usually many more
attributes. Some of these attributes are age, gender, year of education, occupation,
ethnic background etc. Each record in a U.S. census database has 72 attributes.
One of the major tasks both experts and non-experts face during exploration of a
database is the understanding of the correlations between the different attributes.
In many cases the user can focus the search for a particular item or a small group of
items by restricting the different attributes progressively. This is often the strategy
used by shoppers at internet shopping sites. Wittenburg et al. [80], Lanning et al.
[43] and Tweedie et al. [75] have extensively considered this scenario in the dynamic
query framework. In the EZChooser system, Wittenburg et al. [80] use a visual-
ization system called parallel bargrams for progressively narrowing down the search
by restricting attribute values. The Attribute Explorer [75, 66] and MultiNav [43]
tools are also based on similar strategies. As the ranges for different attributes are
progressively chosen, the number of objects satisfying these restrictions reduces.
The tools show the objects that satisfy all the constraints in a bottom panel of the
screen. The user can select or deselect different attributes and this allows the user
to experiment with different attribute restrictions.
This thesis presents a new visualization system called VisEx as further detailed in
1.1. CONTRIBUTIONS OF THE THESIS 7
Chapter 3. VisEx is based on the dynamic query framework in the sense that it
allows even novice users to experiment with a large multi-attribute database and
frame meaningful queries. The user interaction in VisEx has some similarities with
the user interaction in EZChooser [80], Attribute Explorer [75] and MultiNav [43].
However, VisEx overcomes some of the key restrictions in these systems. VisEx is
a completely scalable system in the sense that it can handle small as well as very
large multi-attribute databases through a quantitative estimation of records. We
change the granularity at which the user views the records in a database depending
on the size of the database.
The main aim of the VisEx system is slightly different from the aims in systems
in [80, 75, 43]. The aim of these systems is to allow the user to zoom in to specific
items by restricting the ranges of the different attributes. The user can experiment
with different choices by constraining different ranges for the different attributes
so that she can have a better choice of an item at the end. Hence the focus in
these systems is to allow the user to choose an item or a few items from a large
collection according to the user’s specifications. VisEx can be used in a way similar
to the Attribute Explorer [75] and EZChooser [80] systems, however, VisEx is a
more versatile system for exploring correlations among attributes of large datasets.
In VisEx, our main aim is to give the user the facility to experiment with different
ranges for different attributes and see the effect of these restrictions on the other
attributes. It also provides a bar graph which allows users to view the proportions
of data values in each attribute. A user study is presented to judge the effectiveness
of the VisEx system.
1.1.2 Visual data mining and visualization of association
rules
Data mining is a core process of knowledge discovery from large databases. Data
mining can be illustrated through the KDD mechanism [18]. The KDD is the
8 CHAPTER 1. INTRODUCTION
Figure 1: The KDD process overview.
process of extracting knowledge by identifying valid, potentially useful, and under-
standable patterns from data sources. The KDD process (as shown in Figure 1)
consists of a number of basic steps including data selection, data preprocessing, data
transformation, data mining, pattern discovery, and pattern evaluation. The data
mining tasks have different goals according to the kinds of knowledge to be mined.
An example of a data mining task is the generation of a mailing list of purchasing
customers. The data mining process helps managers to extract the mailing list of
loyal customers.
Even though some research has been done in visual data mining, its definition is
still unclear. Visual data mining can be described as the integration of visualization
into the data mining process. The integration combines the human ability of iden-
tifying patterns visually and the ability of a computer to do large scale numerical
1.1. CONTRIBUTIONS OF THE THESIS 9
computations rapidly. The main problem with automatic data mining algorithms
is that they often mine a large number of association rules all of which are not of
equal importance. On the other hand, it is possible for a human expert to partici-
pate in the mining process for extracting a small set of interesting association rules.
Visualization is an important tool for a human expert to participate in the mining
process. Visual data mining can be categorized into three groups based on how the
visualization is integrated in the data mining process.
• Pre-applying visualization into data mining for exploring datasets: data is
firstly visualized to generate initial views before applying data mining algo-
rithms.
• Post-applying visualization into data mining for conveying the mining results:
the data mining algorithms extract patterns in data and then the extracted
patterns are visualized.
• Intermediate-application of visualization in the mining process: Users can
apply their domain knowledge to support knowledge extraction through the
mining process. This approach has been stated [81] as a tight coupling of the
human and computer in the mining process.
In the first two techniques, human experts cannot participate and apply their knowl-
edge in data mining. In the third technique, humans can examine information
through visualization and apply their knowledge at each step of the mining process.
This technique helps users to efficiently extract interesting patterns hidden in their
data and learn more through visual interaction. Surprisingly, this technique has
not been used so far in the literature for mining association rules. In this thesis, I
introduce a new tight coupling technique which enables users to apply their domain
knowledge to improve the quality of data mining approaches through visualization.
I also provide a new technique for visualizing a large number of mined association
rules.
10 CHAPTER 1. INTRODUCTION
1.1.3 Interactive visualization for OLAP
On-line analytical processing (OLAP) has become an important tool for interactive
analysis of multidimensional databases such as data warehouses. Many businesses
have adopted data warehouses as the preferred mode of data storage in order to
manage the explosive growth of their databases [11]. OLAP helps analysts to ex-
plore, analyze, and extract interesting patterns from massive amounts of data stored
in multidimensional databases and data warehouses. Since most multidimensional
databases contain hierarchical structures, it is difficult for users to explore multi-
dimensional data with a tool providing only overviews of data. It is important for
users to be able to explore their multidimensional databases interactively to refine
their views. Interactive textual displays, such as a PivotTable, are not enough for
understanding or extracting patterns from multidimensional databases.
Chapter 5 presents a new interactive visualization technique called VisOLAP. The
aim of this technique is to assist analysts to improve their performance in explor-
ing, analyzing, and understanding large databases through interactive visualization.
The tool incorporates visualization into OLAP service which enables analysts to ex-
plore overviews of high levels of data and drill down into levels of detail in each
dimension directly. The incorporation of both visualization and OLAP not only
helps users to extract interesting patterns but also helps them to interpret and
analyze the extracted information faster.
1.2 Structure of the thesis
The remainder of this thesis is organized as follows. In Chapter 2, I discuss the
current state of research in information visualization. In particular, developments
of some visualization techniques related to the research presented in this thesis are
outlined. The application of visualization in other areas such as visualization for
mining association rules and visualization for other applications including OLAP is
1.2. STRUCTURE OF THE THESIS 11
also considered.
In Chapter 3, a new technique for visual exploration of correlations among attributes
in large multidimensional datasets is presented. Although there are some visual-
ization techniques for comparing attributes in multidimensional datasets, most of
these techniques work only for a small number of dimensions or attributes, in most
cases only three. This new technique on the other hand is completely scalable and
can handle any number of attributes. Also, most techniques [41, 40], are quite
complex and users need time to understand and use the techniques. The system
presented in this chapter helps analysts and users to extract hidden relationships,
correlations, and trends and it addresses the occlusion problem effectively. In ad-
dition, this technique is highly interactive so that users can understand and gain
insight into datasets. A user study is presented for evaluating the system.
Chapter 4 describes the integration of a visualization technique into association
rule mining algorithms in a new framework called VisDM. This technique is a com-
promise between completely manual mining by users and purely automatic mining
algorithms. Again, a user study is described for this system. In addition, a new
technique called VisAR, for visualizing a large number of mined association rules is
presented. This technique improves visualization of a large number of association
rules generated through data mining algorithms.
The VisOLAP system for interactive visualization of OLAP data cubes is presented
in Chapter 5. I show how users can visually explore large data cubes with many
dimensions without any prior knowledge of OLAP technology.
Chapter 6 concludes the thesis with a discussion of the contributions presented
in this thesis. It highlights the limitations that remain in data visualization, and
points to future research.
12 CHAPTER 1. INTRODUCTION
Chapter 2
Previous Work
This chapter reviews existing research in information visualization, visual data min-
ing and visualization of OLAP data cubes. An overview of some general techniques
is given before concentrating on the techniques that are closely related to this thesis.
2.1 Information Visualization Techniques
Visualization tools have become important in helping users to discover and interpret
useful information from a large amount of data. A considerable amount of research
has been done on information visualization techniques in the past decades. This
research can be broadly categorized into several groups as discussed below [37].
• Geometric techniques involve geometric transformation and projection of data.
This category includes techniques like scatterplot matrix, parallel coordinates,
and star coordinates. These techniques are discussed in detail in Section 2.1.1.
• Iconographic techniques use features of icons or glyphs to represent data.
Some examples of techniques in this category are chernoff-faces, star glyphs,
and stick-figure icons. These techniques are explained in Section 2.1.2.
• Hierarchical techniques map variables into different recursive or hierarchical
13
14 CHAPTER 2. PREVIOUS WORK
levels. Worlds within worlds, hierarchical axis, hyperbolic browser, cone trees
and tree-maps are examples of techniques in this category. The details of these
techniques are described in Section 2.1.3.
• Pixel-based techniques [41, 38, 40] try to represent individual data records by
pixels, and characteristics of a record are denoted by coloring the correspond-
ing pixel using a color map. VisDB and pixel bar charts are some examples
of techniques in this category. An overview of these techniques is provided in
Section 2.1.4.
• Finally, table-based techniques like table lens, FOCUS and Polaris employ
table features to visualize different characteristics of data. More details of
these techniques are explained in Section 2.1.5.
2.1.1 Geometric techniques
Geometric techniques project and geometrically map datasets, especially multidi-
mensional datasets, onto the display device. One of the earlier approaches in infor-
mation visualization is the scatterplot [13], in which two variables are projected and
mapped onto x-y Cartesian coordinates. The scatterplot matrix is a combination of
several scatterplots, and a different pair of variables is used in each scatterplot. In a
scatterplot matrix (Figure 2), individual variables are arranged along the diagonal
of a matrix and each display panel illustrates relationships or correlations between
variables. The number of variables in a dataset that can be sensibly visualized
simultaneously by scatterplots is limited by the size of the display device, so often
only a subset of the data is visualized at any particular time.
The best-known geometric technique is the parallel coordinates technique proposed
for visualizing multidimensional data by Inselberg and Dimsdale [34]. In this ap-
proach, the dimensions are represented by parallel vertical lines, which are per-
pendicular to and uniformly distributed along a horizontal line (Figure 3). Each
variable, attribute or dimension is assigned to a specific parallel axis. A record
2.1. INFORMATION VISUALIZATION TECHNIQUES 15
Figure 2: Scatterplot matrix represents six dimensions, mpg, weight, drive ratio,horse power, displacement and cylinders of cars. The figure is taken from Basalaj [7].
16 CHAPTER 2. PREVIOUS WORK
is represented by plotting a zigzag line by connecting its attribute values on the
different axes. The relationships among attributes that are represented by nearby
vertical lines are easy to perceive. However, it gets harder to perceive relationships
among attributes that are represented by widely separated vertical lines. Hence,
the initial choice and ordering of the axes has a big effect on the visualization. The
other limitations of this method are the restriction of the horizontal axis and screen
space. If the number of data points become large, the plotted result becomes a
solid blob of color and the correlation among attributes represented by distant axes
is hard to understand or explore.
Figure 3: Parallel coordinates from Inselberg and Dimsdale, 1987 [34].
Star coordinates [36] is another approach to project multidimensional data onto a
two-dimensional plane. As the name suggests, different attributes of the dataset
are represented by a set of radial axes that emanate from the center of a circle.
As shown in Figure 4, nine attributes (namely horse power, mpg, origin, cylinders,
model year, name, acceleration, displacement and weight) are represented by nine
axes. In contrast to the parallel coordinates technique, the star coordinates tech-
nique transforms each data item and displays it as a point. Similar to the parallel
coordinates technique, this approach is based on the projection and geometrical
mapping of datasets. To visualize a large amount of multidimensional data, the
display can become a solid blob of color which is hard to use for interpreting the
2.1. INFORMATION VISUALIZATION TECHNIQUES 17
correlation of attributes. Since this technique introduces the display of an excessive
number of graphic primitives representing data records on the screen, the occlusion
of different graphic primitives by other graphic primitives occurs.
Figure 4: Star coordinates from Kandogan, 2001 [36].
2.1.2 Iconographic techniques
In iconographic displays, icons or glyphs are used to visualize multidimensional data.
The common implementation of these techniques is through mapping dimensions
of data to graphical attributes such as size, color, shape, and orientation.
One of the first iconographic techniques was developed by Chernoff [12] as shown
in Figure 5. Multidimensional data are represented in the form of a human face.
The design of this technique was based on the ability of humans to recognize and
differentiate human faces and therefore to perceive regions of clustered data and
outliers. Each data variable is assigned to facial features such as eyes, eyebrows,
18 CHAPTER 2. PREVIOUS WORK
facial area, nose, and mouth, which have different attributes of shape, location,
orientation, length, and size to represent the values in each dimension. The appear-
ance of similar faces or indistinguishable faces can occur due to the varying order
of assignment of variables to the different features.
Figure 5: Chernoff-faces from Chernoff, 1973 [12].
A star glyph [62] represents multiple attributes by line segments, like the radii of
a circle, when each line emanates from the center of the glyph and the length
represents the value of each dimension (as shown in Figure 6). The number of
line segments generated depends on the number of dimensions, for example, n-
dimensions require n line segments. A star glyph represents all selected dimensions
(i.e. a row of a data table) of a data point. The ability to display many dimensions
depends on the uniformly mapped angles. However, the glyphs become cluttered
with many dimensions or attributes. Also, it is difficult to display many glyphs at
a time corresponding to many data points due to the limitation of screen space.
Pickett and Grinstein have developed a technique to map multidimensional data
onto a two-dimensional plane [57] such as a computer screen. The applications
of the technique have been focused on spatially or temporally coherent data such
as multispectral imagery datasets. The original approach of Pickett and Grinstein
2.1. INFORMATION VISUALIZATION TECHNIQUES 19
Figure 6: Star glyphs represent six dimensions, mpg, weight, drive ratio, horse power,displacement and cylinders of cars from Seigel et al., 1972 [62]. The figure is takenfrom Basalaj [7].
20 CHAPTER 2. PREVIOUS WORK
displays multispectral imagery data by using colors so that each dimension of data
controls the intensity of a primary color: red, green, and blue. The changes in
colors show relationships among datasets.
The method by Pickett and Grinstein [57] uses icons, called stick-figure icons,
to represent data elements. Each stick-figure icon is composed of five connected
line segments. Four are limbs and the other is the body of the icon. The first
four dimensions from the data can be mapped onto the four limbs with each value
controlling the angle of a limb. The last dimension controls the orientation of the
body (as shown in Figure 7). In addition, color, thickness, or length can be encoded
to the limbs and body to represent higher dimensionality.
Figure 7: Examples of stick figure icons from Pickett and Grinstein, 1988 [57].
After data elements have been mapped to icons, the icons are displayed in two-
dimensions. Data that have close values are clustered into the same groups and
have similar icon shapes. When these icons are displayed as groups, they form a
texture pattern in the image. The boundary of each group can be noticed because
each group generates different patterns of textures. Pickett and Grinstein have
investigated the possibilities of dynamic icon coding, users to interact with dynamic
icons and dynamic icons to interact with each other. Since one glyph visualizes one
data object, most techniques in this category are limited by the small number of
dimensions and the number of data records that can be displayed.
2.1.3 Hierarchical techniques
Visualization of hierarchical techniques represents datasets by partitioning space
hierarchically into subspaces. Some techniques in this group are based on recursively
2.1. INFORMATION VISUALIZATION TECHNIQUES 21
embedding dimensions, which stacks subspaces onto each other. Examples are
Hierarchical Axes [50, 52, 51] and worlds within worlds [19, 10], that are discussed
later in this section. Each subspace has a relationship with its parent subspace (i.e.,
an inner subspace has a relationship with an outer space). Some techniques used in
information visualization such as cone trees [59], tree-maps [35], and the hyperbolic
browser [42] are structured node links in which child nodes are extended from their
parents. Most of these information visualization techniques are used to visualize
and interact with data sets with large hierarchies.
Worlds within worlds, also called a nested heterogeneous coordinate system [19], is
a three-dimensional hierarchical space technique, in which lower dimensions (inner
worlds) are recursively placed in higher dimensions (outer worlds) as shown in
Figure 8. A height field or vertical axis of inner worlds represents the value of a
function and all remaining variables, and is used to code the constant value of the
outer world (at most three variables at each level). The positions of the outer worlds
are related to the inner world’s origin. Moving or translating the inner worlds affects
representing values of variables of outer worlds so the height field of inner worlds
is adjusted, but not vice versa. In order to increase flexibility in manipulating the
relationships of multivariate data to be represented, an extension of worlds within
worlds called AutoVisual [10], has been developed. The zooming and selection
tools in this technique allow users to perceive interesting areas of a dataset more
accurately.
Hierarchical axis methods [50, 52, 51] use one-dimensional subspace embedding and
aim at visualizing high dimensionality on two-dimensional graphics space. One of
the techniques used is to plot scalar fields on an n-dimensional lattice by categorizing
the data into two sets of dependent and independent variables. In Figure 9, the
former are mapped on the vertical axis, and the latter are recursively mapped onto a
single horizontal axis. The term “speed” was introduced to this method and colors
were used to distinguish the values of each parameter in the hierarchically horizontal
axis from the others. The first mapped variable is termed the “fastest”, the next the
22 CHAPTER 2. PREVIOUS WORK
Figure 8: Example of worlds within worlds by Beshers and Feiner, 1990 [19].
“second fastest”, and so forth. Three classification techniques [51] were described
for individual functions. For common multidimensional analysis, three rules [52]
were applied to determine the vertical plot including minimum/maximum, sum,
and mean or standard deviation methods. A histogram, or a binned matrix, using
a mean function to gain better visual perception, was compared to a traditional
scatterplot matrix [13]. In addition, the zoom and clone tools were developed to gain
a large number of data representations by allowing users to display the subspace of
an interesting area.
Hyperbolic browser [42], uses a hierarchical technique with a Focus+Context (fish-
eye) approach applied for interaction. The display of hyperbolic browser is a tree
as shown in Figure 10. The root is initially placed at the center. Focus+Context
technique allows viewers to focus on the details of small areas or other nodes while
retaining the context of the entire hierarchy. The hyperbolic browser approach draws
the hierarchy uniformly on a hyperbolic plane and then maps this plane onto a cir-
cular disk on the display. During laying a tree on a hyperbolic plane, recalculation
is done to see if there is any change of the node focus. Transformation of space is
used to magnify a region at the center of focus while the rest of the region shrinks.
This allows users to explore or browse selected regions of interest in more detail,
but moving a node in hyperbolic browser affects the orientation of its children and
2.1. INFORMATION VISUALIZATION TECHNIQUES 23
Figure 9: Example of hierarchical axis method in which years is the fastest axis,the second fastest axis is subject, and class is the slowest axis from Mihalisin et al.,1991 [51].
the viewing can be disoriented while the children are rotated.
Other hierarchical techniques are cone trees and tree-maps. Cone trees [59] are an
animated three-dimensional visualization techniques instead of the two-dimensional
techniques used with hierarchical data structures. Sub-trees or child nodes of one
trees (shown in Figure 11) are evenly expanded in a circle at a lower level around
the apex of the cone. In the first implementation, each level of cone trees was
implemented with the same height. Diameters of cone trees at each level were
reduced so that they fitted into the display space, called a room. For the visual
and interactive aspects, some nodes are labeled as transparent to avoid occlusion,
and viewers rotate cone trees to explore the data, and can adjust the cone radius
and height, name levels of the cone trees, and shift perspective angles between a
parent and child node of the cone trees. The problem with cone trees is that the
user experience deteriorates with the density of the data. Also, occlusion becomes
a problem with dense datasets.
Tree-maps (a space-filling approach) [35] illustrated in Figure 12 is a visualization
technique used to map the entire hierarchical information (which is both structural
24 CHAPTER 2. PREVIOUS WORK
Figure 10: Hyperbolic browser from Lamping, 1995 [42].
Figure 11: Cone trees from Card and Mackinlay, 1997 [59]
2.1. INFORMATION VISUALIZATION TECHNIQUES 25
Figure 12: Tree-maps from Schneiderman, 1998 [35].
and content information) to a rectangular partitioned display space. In this way,
large hierarchical information structures are presented on a two-dimensional dis-
play. Each partitioned rectangle is assigned a weight based on the size of the node.
This weight determines the area of the associated rectangle. In the original imple-
mentation of this visualization tool, hard disk drives with large directory structures
were used as datasets. Tree-maps allowed viewers to visualize the entire hierarchy
simultaneously and to set display properties such as colors and borders to enhance
the image.
Information slices technique [5] is another visualization approach. It uses semi-
circular discs to represent large hierarchies in two-dimensional space by dividing
the disc into multiple levels as shown in Figure 13. Deeper hierarchies can be
viewed by expanding a series of semi-discs from each section of each level.
Most hierarchical techniques are more suitable for representing dense data than
26 CHAPTER 2. PREVIOUS WORK
Figure 13: Information slices from Andrews, 1998 [5].
some of the visualization techniques previously mentioned. Viewers can see and
compare the closer groups of datasets rather than distributed datasets on the same
screen. Hierarchical techniques are not straightforward, as they require appropriate
mapping of data in order to interpret data efficiently.
2.1.4 Pixel-based techniques
Pixel-based techniques aim to display as many data items as possible. Each data
record is mapped onto a pixel and each pixel is colored from a fixed range of colors
according to its value, so that its value falls into each attribute range.
In VisDB [41], each data record is mapped to individual pixels on a screen after
sorting and arranging the relevant data according to a query. The colors are chosen
by considering relevance factors. The VisDB system uses visualization approaches
to provide feedback on query results. In VisDB, there are two main techniques:
query independent and query dependent display. The query independent technique
employs line ordering or column ordering, using space-filling curves and recursive
2.1. INFORMATION VISUALIZATION TECHNIQUES 27
pattern approaches to order data items based on an attribute. The query dependent
approach, including Spiral and Axes techniques, arranges the closest results from
queried data items, mapping them to colors in a color ramp onto the center of the
display. The Query Dependent approaches display only the region of data items
within a certain distance to the reference point as shown in Figure 14. The other
variables are represented in different windows and the distances are distinguished
by different colors in each dimension. The different parts of the database can be
visualized by changing the reference point.
Figure 14: Pixel-based visualization of query dependent techniques (Spiral andAxes) from left to right, from Keim, 1996 [41].
The circle segments technique [6] is the pixel-based technique which maps data
attributes onto circle segments. Each attribute is sorted independently and arranged
line-by-line from the center to the border of the circle segment. Figure 15 shows
the pixel arrangement into circle segments of four attributes.
Pixel bar charts [39] is the technique which applies Pixel-based and x-y plotting
into traditional bar charts. The bars are used to represent categorical data while
x-y plotting and color coding inside the bars are used to represent numerical data.
Although the techniques use pixels to represent data objects for efficient space
28 CHAPTER 2. PREVIOUS WORK
Figure 15: The representation of the circle segments arrangement of data items ontopixels from Ankerst et al., 1996 [6].
usage, users might need considerable training to use and understand the outcome
of the visualization. In addition, sometimes users might be overwhelmed by the
mixing of colors.
2.1.5 Table-based techniques
The techniques in this group employ table characteristics such as rows and columns
to visualize datasets. Some Table-based techniques have integrated interaction tech-
niques such as Focus+Context to make the table interactive and applied graphical
representation to display data attributes into their systems.
Table lens [58] in Figure 16 is a visualization technique based on Focus+Context
or the fisheye technique to display multidimensional data in a tabular style. This
technique displays a dataset by using horizontal bar charts and a Focus+Context
technique onto a table rather than in a text form. The system can be used for
visualization of large datasets represented by compressed tables. Users can also
zoom into specific areas of the table to see the distribution of specific attributes
visually.
InfoZoom [67] developed from FOCUS [68] represents attributes along rows and data
records along columns of the table. Similar to the table lens technique, InfoZoom
allows the users to gain a flexible overview of an object-attribute table through the
2.1. INFORMATION VISUALIZATION TECHNIQUES 29
Figure 16: Table lens technique from Inxight, 1994 [58].
fisheye technique. The goal of this technique is to present and compare products on
the Internet. The user can progressively explore specific areas of the table through
formulation of interactive queries. However, this technique is predominantly textual
with only limited visual feedback to the user for comparing different attributes.
Both table lens and InfoZoom do not support instant viewing of individual attributes
in the specific ranges across other attributes and the entire dataset.
Polaris [72] is a table-based visualization technique which allows users to explore
multidimensional databases. A table in Polaris comprises rows, columns, and lay-
ers. The system treats nominal and ordinal data as independent variables called
dimensions, and all quantitative data as dependent variables called measures. Rows
and columns of Polaris represent the data attributes which may contain nested di-
mensions. To generate a graphical display of the table, the system uses table algebra
to specify table configuration and types of graphical display such as a bar chart or
line chart. It maps a set of records retrieved by database queries to each pane of
the table through the graphical representation. The graphical encoding employs
retinal properties [8] such as size, shape, and colors as graphical display of markers
on the pane.
An example of a technique that is not included in the five groups discussed above
is XmdvTool [78]. XmdvTool is a brushing technique, in which data points can be
selected to display interesting areas of data. This method was integrated from other
30 CHAPTER 2. PREVIOUS WORK
multidimensional visualization methods for projecting data onto a two-dimensional
screen. XmdvTool supports scatterplots, star glyphs, parallel coordinates, and dimen-
sional stacking approaches for displaying multivariate data. N-dimensional brushing
in XmdvTool allows users to change, highlight, select, or delete a subset of graphi-
cally displayed objects by proper input devices. In addition, n-dimensional Brushes
have characteristics like shape, size, boundary, position, motion and display, which
allow the user to gain the perception of relationships in the n-space of selected data
points. Linking, which is an associated method of brushing, enables multiple views
to be displayed simultaneously for the same data.
2.2 The dynamic query framework
About a decade ago, Schneiderman [61] argued strongly for an intuitive and visual
mechanism for accessing and experimenting with databases. He argued that there
are two main difficulties in using a database query language for retrieving records
from a database. First, many users do not know such a query language and second,
in many situations the user does not have sufficient information about the underly-
ing database. Quite often database queries result in either no records matching the
query or too many records matching the query. The result is not helpful for the user
in either case. On the other hand, a visual query mechanism can provide the user
with useful information about the underlying database, so that the user can frame
queries visually and also see the results of these queries visually. Schneiderman calls
such visual interfaces direct manipulation interfaces.
Williamson and Schneiderman [79] mention four criteria to judge the quality of a
direct manipulation interface.
• Continuous visual representation of objects and actions of interest,
• Physical actions or labeled button presses instead of complex query syntax,
2.2. THE DYNAMIC QUERY FRAMEWORK 31
• Rapid, incremental, reversible operations whose results are immediately visi-
ble, and
• Layered or spiral approaches to learning that permit usage with minimal
knowledge.
Williamson and Schneiderman [79] illustrate these four criteria in their dynamic
home finder system. In a typical scenario, a user wants to purchase a suitable
home within an affordable price range, with a required number of rooms and in
a convenient locality. The user does not have any knowledge of the underlying
database of available homes, and she can experiment with her requirements by
relaxing them if necessary. The system allows the user to narrow down the search
progressively by gradually restricting the attributes for her search. The primary
focus of the work by Schneiderman and his co-workers [61, 4, 3, 79] is to provide
the user complete flexibility in changing the search criteria and rapid feedback when
the attribute ranges are changed.
Spence and Tweedie [66] argue that the traditional approach of information retrieval
through a database query language works only for a small fraction of real world
problems. In most situations the user needs to have a clear idea about the structure
of the underlying database for framing meaningful queries. Moreover, database
queries retrieve only the records that exactly match the query. Hence, the user does
not get any idea about the records that might be just outside the query range, but of
interest to the user. Spence and Tweedie [66] put forward the idea of information
synthesis rather than information retrieval. In their opinion, problem or query
formulation is as important an activity as the retrieval of records. They emphasize
the need for the user to learn about the structure of the underlying database through
a visual query mechanism. The underlying philosophy of their Attribute Explorer
system can be described in the following sentence [66].
Given a collection of objects, each described by the values associated with a set
of attributes, find the most acceptable such object or, perhaps, a small number of
32 CHAPTER 2. PREVIOUS WORK
candidate objects suited to more detailed consideration.
The Attribute Explorer system [66, 75] allows user interaction satisfying the require-
ments of the dynamic query framework [79]. The user can set the upper and lower
limits for each attribute. Each attribute is displayed as a histogram in a separate
window. The x-axis for the histogram is the selected range and the y-axis is the
number of items satisfying a particular attribute value. Since the main aim of
Attribute Explorer is to help the user to narrow down the search for a particular
item or items, it is important to show the items individually. Hence, each object is
displayed separately in each histogram as a small rectangular box. A bar of the his-
togram is a stack of such boxes. When the user specifies the range for an attribute,
all the objects satisfying this range are marked with a specific color according to a
color coding scheme.
Another strong feature of the system is attribute interaction. The system colors
an object with the same color when the object satisfies the current selected range
for one of the attributes. This helps the user to judge the position of the object
in different attribute windows and the interrelation between different attributes.
For example, if the object is a house, and the user chooses a price range between
$200,000 and $300,000, all the houses satisfying this constraint are coded with the
same color in the other attribute windows. Suppose another attribute is ‘number of
bedrooms’. The user now can see clearly the distribution of number of bedrooms in
the houses within this price range. In the information synthesis scenario of Spence
and Tweedie [66], the user may want to revise the initial choice of the price range
if the number of bedrooms is not adequate for her need. The system also colors
objects that fail one or more attribute limits specified by the user. The purpose is
to inform the user that a change of one or more attributes may bring these objects
back within the limits of all the attributes chosen by the user. The main aim of
the system is to guide the user to a small number of objects that satisfy the user
requirements expressed as attribute ranges.
The EZChooser system designed by Wittenburg et al. [80] follows a strategy similar
2.2. THE DYNAMIC QUERY FRAMEWORK 33
to the Attribute Explorer system. This approach has been further illustrated in
the paper by Lanning et al. [43]. The main focus is to use the dynamic query
mechanism for providing the user complete freedom in choosing the attribute ranges.
The user can choose the range for one attribute and see the effect of that choice
on other attributes. However, the visualization framework in EZChooser system
is quite different from the Attribute Explorer. Instead of histograms, Wittenberg
et al. use parallel bargrams to show the attribute ranges and the user selections.
A parallel bargram is a horizontal histogram which shows all the objects in the
database according to the increasing values of a specific attribute.
Wittenburg et al. [80] illustrate the use of the EZChooser system through a vehicle
choosing interface for prospective buyers. Categorical attributes like car make or
model are converted into ordinal attributes by assigning an ordering to the nominal
fields. The display for EZChooser has two frames. The upper frame contains the
bargrams for all the attributes in the underlying database. The lower frame displays
the objects that satisfy the constraints specified by the user in the bargrams. The
bargrams for different attributes are displayed in parallel, one below the other,
according to increasing attribute values. For example, cars with lower prices appear
to the left of a bargram and cars with higher prices to the right of the bargram
specified for showing the price attribute. As the user chooses attribute ranges
progressively, all the cars that satisfy these ranges are highlighted through coloring
in the different bargrams. As an example, if the user chooses a price range of
$20,000 to $22,000, the cars that satisfy this price range will be highlighted in the
bargrams for all other attributes. Moreover, all the cars satisfying the constraints
will be displayed in the lower frame through icons or pictures of specific cars.
Both the Attribute Explorer and the EZChooser systems emphasize the importance
of displaying individual items in the database. This is important since the user needs
to view how many objects are selected due to the restriction of a specific attribute.
The objects are displayed as rectangles in the Attribute Explorer system and as
icons in the EZChooser system. However, this requirement imposes a constraint
34 CHAPTER 2. PREVIOUS WORK
on how many objects can be displayed in a histogram in Attribute Explorer or in
a bargram in EZChooser. Spence and Tweedie [66] do not address the issue of
scalability as all the datasets they work with are small. Wittenburg et al. [80]
mention the scalability issue. They form different aggregation of values in bins in
each bargram. In other words, each bin shows a collection of items satisfying a
range of values. As the user progressively narrows down the ranges for successive
attributes, the bins show a smaller and smaller number of items and eventually
individual items. Recall that the lower frame in EZChooser shows the individual
items that satisfy the user-selected ranges on all attributes. There is a problem
with scalability in showing these items in EZChooser as the smallest representation
is a single pixel for an item. Hence, it is only possible to show the number of pixels
satisfying the screen resolution. However, items are given larger and larger space
as the user narrows down the search and eventually small icons are shown when
the selected items form a small enough set. Wittenburg et al. [80] conclude that
the EZChooser system allows users to explore datasets interactively for item sets
consisting of up to 1000 items when each item has about 10-20 attributes.
Spence [65] has emphasized the importance of sensitivity encoding to support navi-
gation in information space. According to Spence, the exploration of an information
space consists of four interrelated activities, interpretation, decision, browsing and
modeling. The user interprets the data in order to take a decision on the directions
of movement in an information space. The user creates an internal model of the
underlying data through browsing. Spence [65] defines sensitivity as a specific trans-
lation in information space and the related action required to achieve it. Spence has
discussed in detail how the systems like Attribute Explorer and EZChooser fit into
this framework of sensitivity encoding.
Although there are systems in the dynamic query framework helping the user to
narrow down the search for a particular item or items, no practical system ad-
dresses the issue of size scalability of datasets and provides the user to search for
correlation with different ranges among attributes of large datasets. In Chapter 3
2.3. DATA MINING 35
I base my work on the dynamic query framework and present a novel visualiza-
tion framework called VisEx for exploring correlations among attributes in large
multidimensional datasets. The VisEx system also follows the paradigm of sensi-
tivity encoding through a particular choice of sensitivity encoding for comparing
attributes. The other visualization frameworks presented in this thesis like VisDM
and VisAR in Chapter 4 and VisOlap in Chapter 5 are also based on the dynamic
query framework.
In the next section, a brief overview of data mining and a process in knowledge
discovery including some review of data mining tasks are described.
2.3 Data Mining
Data mining is a process for extracting knowledge or useful information from a huge
amount of data. It can be considered as a knowledge discovery process [18]. As
briefly mentioned in the previous chapter, data mining tasks have different targets
for both gaining insight into data and/or predicting trends in the data. The data
mining results can be primarily categorized as one of the following [21]: association
rules, classification, regression, prediction, data processing and clustering. Data
mining methods are not reviewed in detail in this section, except for association
rule mining which is relevant to this thesis.
2.3.1 Association rules
Association rule mining is one of the data mining methods which focuses on explor-
ing relationships among items in datasets such as transactional databases. Associ-
ation rules contain discovered patterns or conditions under which the data records
frequently occur together. For example, an association rule might show which
products are frequently purchased together or the purchase of a particular item
may imply (with some probability) the purchase of other items. These types of
36 CHAPTER 2. PREVIOUS WORK
association rules are also called market basket association rules. Store managers
or marketing officers can use an analysis of the market basket association rules to
learn purchasing behavior of their customers and to promote product sales or to
improve their marketing plans.
The earliest well-known algorithms for generating association rules are AIS [1],
SETM [28], Apriori, and AprioriTid [2]. The Apriori algorithm constructs frequent
itemsets by generating candidate k-itemsets (Ck) and then determining the support
of each candidate itemset. The process of generating the candidate k-itemsets is
also known as joining and pruning. The first iteration through the transactional
database is done to count the number of appearances of individual items. Each
subsequent iteration checks the support of candidate itemsets generated from pre-
vious iterated frequent itemsets (the joining step). The joining step combines two
k-1 itemsets which have identical k-2 itemsets. In other words, the candidate k-
itemsets from individual iterations, of which the count qualifies minimum support,
are the frequent k-itemsets (the pruning step). The algorithm will stop when there
is no new frequent itemset. Figure 17 illustrates an example of generating candidate
itemsets and finding frequent itemsets.
The FP-tree (Frequent Pattern tree) algorithm [22] searches frequent itemsets with-
out generating candidate itemsets. Similar to Apriori, the FP-tree obtains the
1-itemset from scanning the database. The frequent items in each transaction are
sorted according to their frequency of occurrence. The algorithm then scans through
the database again to construct the FP-tree. To generate frequent itemsets from
FP-tree, the algorithm proceeds along three major steps: constructing conditional
pattern bases (sets of items of each node when their parent node exists) by traversing
the FP-tree based on the order of the frequency table, constructing FP-trees (called
the conditional FP-tree) from the conditional pattern bases, and then recursively
mining the conditional FP-trees.
Not all discovered association rules qualifying user-predefined minimum support and
minimum confidence are interesting. Interestingness of association rules has been
2.3. DATA MINING 37
Figure 17: An example of candidate itemsets and frequent itemsets generated fromthe Apriori algorithm [2].
38 CHAPTER 2. PREVIOUS WORK
researched in [64, 56]. Since the final process of determining the interestingness of
the association rules depends on users, visualization for association rules has been
researched in recent years.
Visualization techniques have been integrated into data mining to help users in
understanding datasets, discovering associations and patterns in their data. Various
methodologies have been developed to visualize association rules generated by data
mining algorithms. Prior research can be categorized into three main groups: Table-
based, Matrix-based, and Graph-based.
First, Table-based techniques are the most common and traditional approaches to
visualize association rules in the form of a table. In general, the columns of a rule
table represent the items, the number of antecedents and consequents, the support,
and the confidence of association rules. Each row represents an association rule.
Some examples of Table-based techniques are included in SAS Enterprise Miner [32]
and DBMiner [21].
Second, Matrix-based techniques such as MineSet [33] (2-D matrix), 3-D matrix [81],
and grid represent the antecedent and consequent on a square grid based on the
coordinate axes. In 3-D matrix, the height and color of columns are used to represent
the properties of the association rules such as support and confidence. Similar to
2-D matrix, the grid techniques relying on frame display represent antecedents and
consequents by a square matrix. A cell with color and brightness is used to represent
the confidence and support of an association rule. For example, MineSet [33] uses
a 2-D matrix technique to visualize a large number of association rules. Wong et
al. [81] use the 3-D matrix in which both of the antecedents and consequents are
represented by a matrix based on x-y coordinates, but its 2-D matrix tiles represent
the relationships of rule-to-item rather than item-to-item. In this technique, the
blue and red columns illustrate the antecedents and consequents respectively as
shown in Figure 18. The columns of the confidence and support of association rules
are scaled and plotted at the farthest end of the x-y plane.
The last group is Graph-based techniques such as Directed Graph. These techniques
2.3. DATA MINING 39
Figure 18: An example of visualizing mined association rules including their an-tecedents, consequents, support, and confidence from Wong et al., 1999 [81].
use nodes to represent the items and edges to represent the associations of items
in the rules. For example, a rule A ⇒ B is represented by a directed graph with A
and B as the nodes. The edge connecting A and B has the arrow pointing to the
consequent (B) of the rule. DBMiner [21] uses a technique called Ball graph which
is based on a directed graph. The nodes in Ball graph are called balls whose size
varies depending on the number of items represented by a ball.
Some prior work has integrated the above techniques into their systems. For in-
stance, CrystalClear [55] is an integrated technique based on a grid that applies a
tree technique to view the number of items and the lists of antecedents and conse-
quents. Another technique, that has not been discussed above, is Interactive Mosaic
plots for visualizing association rules [27] as shown in Figure 19. As its name sug-
gests, this technique applies Mosaicplot visualization to represent the relationships
among items in each association rule from a contingency table. To visualize the
relationships of items in association rules, the technique displays all items in the
40 CHAPTER 2. PREVIOUS WORK
Figure 19: An example of visualizing with Mosaic plots from Hofmann et al.,1999 [27].
left hand side of rules by using Mosaicplot and the right hand side of rules by
highlighting the corresponding categories in a barchart. The Mosaicplot technique
represents each cell of a table by using a bin whose size varies depending on the
number of occurrences of items in the cells.
Although there are many existing algorithms for association rule mining, most of
them are automatic mining algorithms. There is still the challenge to incorporate
human knowledge into automatic association rule mining algorithms to retrieve as-
sociation rules of interest. Chapter 4 investigates a new technique for visual mining
of association rules that allows humans to participate in the mining processes. I also
present a hybrid technique for visualizing mined association rules which reduces the
complexity of visualizing large number of association rules on a single screen.
2.4 Visualization for OLAP
On-line Analytical Processing (OLAP) has been a very active area of research in
recent years. Only OLAP research that employs information visualization has been
discussed in this thesis.
One of the popular ways of viewing OLAP results in a textual presentation of
2.4. VISUALIZATION FOR OLAP 41
Figure 20: An example of visualization by Anchored Measures from Eick 2000 [17].
queried results [14, 54, 74] is a technique such as a pivot table. ADVIZOR [17] pro-
vides visualization tools for exploring databases through visual query and analysis.
Three techniques, Single Measure, Multiple Measure, and Anchored Measure, are
parts of this tool. The Single Measure approach represents a measure by using a
3D bar chart on a centered window called the 3D Multiscape. The height of each
bar shows a measure value. The Multiple Measure approach applies a scatterplot to
visualize two measures along x-y axes rather than the 3D Multiscape. Colors can
display the third measure. The Anchored Measures approach combines ParaBox,
bubble plots, parallel coordinates, and box plots to visualize multidimensional data as
shown in Figure 20. Bubble plot axes represent dimensions and box plots measures.
Both bubble plot axes and box plot axes are arranged in the style of parallel coordi-
nates [34]. The system allows drilling down into low levels of abstraction, however,
the user can drill down only one dimension at a time.
Polaris is used for visualizing multidimensional databases as well as for viewing
data cubes. It is an interactive visual exploration tool which employs a table based
visualization technique [72]. A graphical display of the table looks similar to a pivot
table in a textual format. The extension of Polaris [73] provides an additional tool
42 CHAPTER 2. PREVIOUS WORK
for interactive exploration of hierarchical structures of datasets. The system allows
users to get overviews of data and drill down into lower levels similar to a pivot table
approach. In contrast, the technique presented in Chapter 5 allows users to explore
independent overviews of data, low levels of detail, and any particular region of
interest anytime during navigation.
Andreas et al. [47] have developed their model for OLAP screens, called the Cube
Presentation Model (CPM), and applied a visualization technique, table lens, into
their model. The CPM model consists of two layers, logical and presentational.
The logical layer deals with data retrieval while the presentational layer is for data
presentation. The model employs cross-joins for retrieving maximum, minimum,
and closest average values. The main goal of the system is in determining the
window of interest for viewing the data in particular areas in large overviews of
the cross-join window. The system does not provide drilling down and rolling up
features for exploring multiple levels along different hierarchies.
Although visualization has been integrated in some OLAP tools, no practical tech-
nique providing visual feedback and interactive visualization for exploring hierarchi-
cal data has been researched. Chapter 5 details an interactive visualization tool for
analytical tasks which reduces the user responsibility in remembering exploration
paths.
2.5 Summary
This chapter has reviewed prior research relevant to this thesis. It has given an
overview of the field including visualization techniques, dynamic query framework,
data mining, visual data mining and visualization for analytical tasks. Several new
visualization techniques related to these topics are now presented in the subsequent
chapters of this thesis.
Chapter 3
A New Technique for Visual
Exploration of Large Datasets
3.1 Introduction
In this chapter, a new technique is presented for visual exploration of large mul-
tidimensional datasets for discovering correlations among attributes. There is an
explosion of datasets in many different areas like business, government and scientific
disciplines. The demand to extract meaningful trends and correlations from these
datasets is also increasing.
Data visualization and visual data exploration play important roles in extracting
trends and correlations in large datasets [15]. It is quite often impossible for a
human expert to understand large multidimensional datasets through manual ex-
amination or by viewing the data tables in text format. Visualization tools are
extremely important for this purpose. A visualization tool can quickly show trends
and correlations in the underlying dataset that are impossible to find through other
means. Although the well known visualization techniques such as parallel coordi-
nates [34], star coordinates [36], and scatterplot matrix [13] are accepted and com-
monly used, most of them have the problem of occlusion when visualizing large
43
44CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
datasets. Moreover, these techniques are not very useful for visualizing correlations
among attributes of large datasets.
The original dynamic query framework was introduced by Schneiderman [61], Ahlberg
et al. [4], Williamson and Schneiderman [79] and Ahlberg and Schneiderman [3].
Many systems including MultiNav [43], EZChooser [80], and Attribute Explorer [75,
66] have been developed and extended, based on this framework. However, these
systems do not scale well for large datasets and their focus is on searching for a
single item or a few particular items, rather than visualizing correlations among
attributes.
VisEx is a new tool for exploratory visualization of large multidimensional datasets.
This tool allows the user to visualize a large dataset dimension by dimension or at-
tribute by attribute. VisEx is based on the dynamic query framework in the sense
that it allows even novice users to experiment with large multi-attribute datasets
and to frame meaningful queries. Users can explore the dataset through what-if type
analysis by imposing restrictions on the values of the different attributes. Once a
range has been restricted for one attribute, the tool displays all the records that
satisfy this range in the values of the other attributes. VisEx is a completely scal-
able system to handle both small as well as very large multi-attribute databases.
The system also provides users the flexible granularity for viewing the records in a
database depending on the size of the database. As the number of records increases,
the granularity at which records are shown is made coarser. Moreover, VisEx can
be used for selecting specific items by restricting the values of the attributes pro-
gressively, just as in Attribute Explorer [66, 75] and EZChooser [80].
As a motivation for the need of such a system, consider an example scenario when a
user needs to compare the different attributes of a dataset to learn about the corre-
lations between these attributes. Consider a commonwealth database of all primary
school children in Australia (this scenario is equally valid for other countries with
some modifications). Suppose each child has six attributes in the database, age
3.1. INTRODUCTION 45
(AGE), parent’s median income (MINCOME), parent’s median educational back-
ground (MED), whether the child attends a private, catholic, Anglican or state
school (TSCHOOL), literacy level of the student : poor, average, satisfactory or
excellent (LITLEVEL) and whether the child comes from a single or two parent
family (NPARENT).
Assume that a state or the commonwealth education department is trying to im-
prove the literacy levels of the primary school children through the framing of new
policies or initiatives. The purpose of a visual analysis in this case is not to choose
specific records, unlike the Attribute Explorer [66, 75] or the EZChooser [80] sys-
tems. Rather the emphasis is on framing hypotheses and testing them through
restrictions of different attributes. For example, a policy maker may have a hy-
pothesis that children in the lower primary age group (6-10) attend higher levels
of literacy if they come from two-parent homes rather than single-parent homes.
The policy maker can test this hypothesis in the following way. She first selects
the AGE attribute for display and a range 6-10 for the AGE attribute. She next
selects the LITLEVEL attribute for display. Only the student records with age in
the range 6-10 are displayed in the LITLEVEL attribute barstick (my term for a
horizontal histogram as described in the next section). In other words, the display
of the LITLEVEL attribute is constrained by the selection of the AGE attribute.
The analyst then restricts the range of the LITLEVEL attribute as 3-4 (satisfactory
or excellent). Next, she chooses the NPARENT attribute and only the records that
satisfy the restrictions on the previous two attributes are displayed in the barstick
for the NPARENT attribute. Since the NPARENT attribute has two quantitative
levels 1 (single-parent) and 2 (two-parents), the policy maker can easily check the
distribution of the highlighted records in these two levels and see whether her hy-
pothesis is correct. Suppose she finds no difference in the two distributions, in other
words, children who attain a higher level of literacy may come equally from single
or two-parent homes. She may now want to test whether children in the upper
primary age group satisfy her hypothesis. She can change the range of the AGE
46CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
attribute to 11-14 and only the records that satisfy this restriction will be displayed
in the other barsticks. Hence, she can immediately check her second hypothesis vi-
sually. This is only an example scenario illustrating the power of VisEx in providing
rapid visual feedback to the user for testing many hypotheses quickly without any
detailed knowledge of the underlying database.
This chapter is organized as follows. Section 3.2 introduces the terminologies used
throughout this chapter. In Section 3.3, I discuss the system design including ben-
efits and requirements of the visualization system. The VisEx system architecture
is presented in Section 3.4. The details of subsystems are then described and fol-
lowed by user interaction in the system along with an example. Analysis scenarios
of the system as well as a user study are given in Section 3.5 and 3.6, respectively.
Section 3.7 summarizes the contributions of the chapter.
3.2 Terminology
A row in a relational table or a flat file can be referred to as a tuple, item, or record
and a column as a field, dimension, or attribute. However, in this chapter, I refer to
a row as a record or item and a column as an attribute or dimension. The display
in VisEx consists of two separate visual entities called barstick and bar as discussed
below.
Barstick
A barstick is a histogram placed horizontally as shown in Figure 21. Each attribute
of the underlying dataset is represented by a separate barstick. Each barstick is
initially empty, but has the potential to display all the records in the dataset.
VisEx starts displaying the records when the user starts restricting the ranges of
the different attributes. The length of each barstick is restricted to be the same to
optimize the use of screen space.
3.2. TERMINOLOGY 47
Figure 21: An example of querying multiple attributes in VisEx. The figure showsthree barsticks for three selected attributes. The attributes are selected by the user inthis order. (a) The first barstick is displayed by sorting the value of the first queriedattribute from the dataset. The red-colored area and the partitions (represented bytwo black vertical lines) represent the selected range of this attribute. The color ofthe selected bars is red since all the records in this range are selected and each barrepresents the maximum number of records. (b) The second barstick displays thesecond attribute of the records. The colored bars represent the records whose firstattributes are within the selected range in the first barstick. The coloring of thebars show the density of the records. For example, the first set of bars in the secondbarstick is colored in yellow and the second set of bars is colored in blue. This meansthat the first set of bars has a higher density of records. The user selected range in thesecond barstick is shown by the two vertical black lines. (c) The third barstick displaysthe records that have their first and second attributes within the selected ranges inthe first and second barsticks. Again, the color of the bars shows the density of therecords. The other interaction technique is shown by the highlighted bars in gray inthe three barsticks. The user selects a group of bars in the second barstick by clickingon them. All the records affected by this selection are highlighted in gray in the otherbarsticks.
48CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
Bar
Each bar spans the width of a barstick. Each bar represents at least one record from
the dataset. However, there is no upper limit on how many records a single bar
can represent. A variable color coding scheme is used for representing the number
of records represented by a bar. The scheme varies from red to blue along the
spectrum. If all the records in the dataset are displayed in a barstick, the color of
each bar is the same.
The details of how to visualize and how to map data records into bars and data
attributes into barsticks are further provided in Section 3.4.
3.3 VisEx system design
VisEx has been designed to discover relationships, correlations, distributions and
trends in large datasets, and overcome the occlusion problem. To overcome the
occlusion problem, the system is designed to reduce the number of graphic primitives
on the screen. I do this through the quantitative estimate display of the bars with
a color scheme. The granularity of the data is adjusted according to the number
of records to be displayed. In other words, each bar may show a higher number of
records if the number of records to be displayed is large. The color of a bar indicates
the number of records represented by the bar. For example, red indicates a higher
density of records and blue indicates a lower density of records. The other colors
in between indicate different degrees of densities. The system has four benefits,
namely simplicity, scalability, flexibility, and dynamism.
• Simplicity: The simplicity of this visualization is to display clear and un-
derstandable visualization in a limited amount of screen space at a time. In
this technique, barsticks and bars (as discussed further in the next section)
are exploited for visualizing large multidimensional datasets.
3.3. VISEX SYSTEM DESIGN 49
• Scalability: VisEx also supports the display of small as well as a large num-
ber of records on a limited screen space through the same approach, the
quantitative estimate of records with the color scheme of the bars, as used in
occlusion.
• Flexibility: The system provides human experts capabilities of selection and
exploration to conduct what-if experiments on large datasets.
• Dynamism: The general idea of dynamism in VisEx is to generate dynamic
visualization which is capable of reconfiguring the attributes and handling
dynamic analysis of large amounts of multidimensional data.
Any visual analysis tool needs to meet requirements [40, 72] for effective visualiza-
tion of large amounts of data. VisEx provides specific features to handle the display
of large datasets. These features are:
• Data-dense displays: Large number of data records are transformed and
displayed in a single barstick.
• Screen management: Screen space is effectively managed to avoid screen
occlusion which limit the ability of analysts in interpreting data from visual-
ization.
• Locality: Data records are arranged by ordering and grouping similar at-
tribute values to each other. Data locality helps analysts to obtain a clear
view of the distribution of queried subsets among all the data records.
• Filtering: Analysts are able to generate queries to limit the range of data
which they are interested in so that unrelated data are not displayed on the
screen.
50CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
3.4 VisEx system architecture and implementa-
tion
VisEx system architecture consists of four main components: connection and trans-
formation, querying, visualizing, and interaction as shown in Figure 22. The system
connects to the data files or files from databases which users provide. Users can set
up the queries or data subsets of interest through the interaction tool. The system
then retrieves queried subsets from the relational database and organizes the data
records in barsticks according to the queries. After arranging queried results, Vi-
sEx displays the subsets of data records that satisfy the constraints in the query.
This visual feedback helps users to understand the relationships and correlations
of the queried subsets so that they can set up a sequence of queries to discover
deeper correlations of attributes in datasets. The interaction allows users to obtain
or browse details of the subsets, including regenerating new attribute and range
selection queries. To enable the system to visualize data from a variety of data
sources, VisEx allows accessing data from both flat files and relational databases.
The connection to database servers is made through an ODBC Driver manager.
VisEx has been implemented in Visual C++ and tested on many datasets coming
from flat files and from relational databases through ODBC. ODBC enables pro-
gramming applications to access a variety of databases depending on the availability
of an ODBC Driver for each DBMS.
3.4.1 Connection and Transformation in VisEx
The connection and transformation component in VisEx supports the communica-
tion with datasets. VisEx accesses flat files (i.e., a text file) where each flat file
consists of rows and columns that have delimiters, such as a tab and comma, be-
tween the columns. A relational database is a commonly used data storage device.
To increase capability and applicability of the visualization tool in representing
3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 51
Figure 22: VisEx System architecture
data from a variety of data sources such as MS SQL server, Oracle DB, and MS
Access, VisEx supports visualizing data from these large data sources via ODBC.
Relational databases in Microsoft Access have been used for the experiments. Vi-
sEx communicates with a database in three major steps, namely, connecting to an
ODBC data source, executing SQL statements, and retrieving data. The execution
of SQL statements allows records from the database to be retrieved, updated, and
created. To communicate with the data source (or DBMS), an application needs to
link to the ODBC Driver Manager implemented in ODBC32.dll on the Microsoft
Windows platform. The ODBC Driver Manager then passes ODBC function calls
from the application to the appropriate ODBC drivers to set up the communication.
ODBC drivers process all ODBC function calls, such as calls for connecting to the
52CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
specified data source.
There are seven major steps in generating ODBC function calls for setting commu-
nication between an application and DBMS as follows.
1. Adding data source
2. Allocating handles for application
3. Connecting application to data source
4. Retrieving data source and connection information
5. Executing SQL
6. Retrieving results
7. Disconnecting from data source and freeing all allocated handles
3.4.2 Visualizing multiple attributes in VisEx
Before presenting the details of the querying component in the next section, this
subsection introduces how the system visualizes the queried results and provides
visual feedbacks. VisEx uses barsticks to represent the attributes in a dataset, one
barstick for each attribute. The barsticks are created dynamically, according to
the number of selected attributes. The minimum and maximum scales and the
name of the attributes are shown with the barstick after users select that queried
attribute. All data records of the selected attribute are arranged in ascending order
and partitioned into bars based on the number of data records in the dataset, as
shown in Figure 21.
A bar within a barstick represents a group of records which have closely related
values for the attribute represented by the barstick. Each bar represents the same
number of data records except the last bar which might contain fewer data records.
The number of data records in the last bar is the remaining number of data records
3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 53
after dividing all data records equally among the other bars. The number of data
records represented by each bar is derived from dividing the total number of data
records by the length of a barstick, i.e., the total number of bars a barstick can
accommodate. In the implementation, each barstick accommodates 500 bars. The
size of bars relies on the number of data records in each dataset. Suppose there are
two datasets. The first dataset contains 500 data records and another has 250 data
records. The width of bars in the first dataset, 1 pixel wide, is twice the width of
bars in second dataset, 2 pixels wide.
Data attributes in datasets, including databases, have different data characteristics,
which can be categorized as categorical and quantitative. VisEx converts categorical
data to a numerical form and treats them as quantitative data. For example, there
are two genders of people in census data and in VisEx the genders are represented
by 0 and 1 as males and females respectively. Analysts can gain insight into the
distribution of values of records for an attribute by observing the density and color
of the bars in the corresponding barstick.
Figure 21 shows how bars are organized into each barstick. The figure also illustrates
three barsticks for three selected attributes. The attributes are selected by the user
in this order.
Suppose, there are N records in the dataset and a barstick can accommodate M
bars. Then typically each bar represents NM
records if all the records are displayed.
The bars are sorted according to the range of attribute values in the barstick. For
example, if the attribute is the age of people, the barstick may have a minimum
value of 0 and maximum value of 100. If each barstick can accommodate 100 bars,
each bar will represent the number of people with the same age. For example, the
50-th bar will represent all people of age 50 in the dataset. Also, each bar will be
represented by the highest color from the range of colors, which is red.
However, in most analysis scenarios in VisEx I am not interested in displaying all
the records. Instead, only those records that satisfy the constraints specified by
the user are displayed. To continue with the example, suppose there are 10, 000
54CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
people with age 50 in the dataset. The user has constrained some other attribute
in the dataset so that only 800 of these 10, 000 records in the 50-th bar need to be
displayed. In that case, an appropriate color is chosen for displaying the 50-th bar.
Hence, the color of a bar gives an intuitive meaning to the number of records that
is represented by that bar.
Color Selection
A color scale has been used to distinguish between different ranges of values in
both categorical and quantitative data and to represent the distribution of the
data [20]. The color scale should satisfy these requirements: order, uniformity and
representative distances, and no artificial boundaries [44, 45].
I decided to use colors along the full spectrum (blue to red) for coloring each bar.
There are two reasons behind this decision. First, the main aim of VisEx is to help
an analyst to discover correlations in large multidimensional data sets. Hence, I
am interested in displaying only quantitative estimates and not an exact number
of records. The second reason is the convenience of the user to recognize and
distinguish the coloring. Each bar in VisEx associates a quantitative estimate of
the number of records with each of these colors. However, it is easy to change the
color scheme used in VisEx.
The following example gives a perspective of the quantitative estimate provided
by VisEx. Assume that a dataset contains one million records and 500 bars per
barstick are used. Hence, each bar represents 2, 000 records if all the records in the
dataset are displayed. In this case, red is used for coloring each bar. Now consider
a scenario where only 864 records need to be displayed in a bar when these records
satisfy the constraints imposed by the user. Suppose a color scheme has ten levels
along the full spectrum. The i-th level of the color scheme, 1 ≤ i ≤ 10, is used to
represent the number of records between 200 ∗ (i− 1) and 200 ∗ i. Hence, the 5-th
level of the color scheme to color a bar for 864 records is chosen.
3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 55
The minimum and maximum boundaries of the selected range are shown by two
black lines. The minimum and maximum range values are written in blue and red
respectively below the corresponding barstick. The minimum and maximum range
could be the same for categorical attributes.
The user interface of VisEx is shown in Figure 23. The left panel allows the user to
load a database for exploration and also choose the attributes one by one. The user
can switch between the three modes of exploration, normal, comparison and fixed
any time during the exploration. It is possible to fix an attribute by checking a box
during fixed mode exploration and the user can specify a range by typing the lower
and upper limits for the range. Also, the user can choose different quantitative
values of an attribute for comparison in the comparison mode. The right panel is
used for displaying the barsticks corresponding to the chosen attributes. Both the
panels can be scrolled up and down to choose and display any number of attributes.
3.4.3 Querying in VisEx
The querying subsystem in VisEx arranges subsets of data records from which
analysts select attributes and specify their ranges. The querying process can be
divided into three phases:
• Ordering: VisEx sorts attribute values of selected attributes in datasets
according to the specified queries.
• Grouping: VisEx places similar attribute values close to each other into the
same group.
• Filtering: The specified range selection is used to filter unrelated data records
or to hide irrelevant data records, i.e., the records that do not satisfy the
constraints of the query, from the screen.
A dataset for Boston house prices reported by Harrison and Rubinfeld [23] is used
as an example in visualizing and demonstrating both querying and user interaction
56CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
Figure 23: The user interface of VisEx is explained through an example. The leftpanel allows the user to choose the attributes in any order. The user can selectan attribute from a drop-down list of attributes. In this example, three selectedattributes: median value of owner occupied homes, per capita crime rate by town,and residential zone proportion, are queried in this order. This example shows VisExscreen application with high value range of median price (30-50), low percentage ofper capita crime (.01-1), and higher (20-100) percentage of residential zone selection.The minimum and maximum ranges are shown in blue and red respectively below abarstick. The result shows that the per capita crime is low for areas where medianvalue of owner occupied homes is high (second barstick). Similarly, residential zoneproportion is high (indicating wealthy suburbs) when median value is high and crimerate is low (third barstick). Finally, if the residential zone proportion is selected atthe higher end in the third barstick, the non retail business proportion i.e., numberof industrial sites is low in the fourth barstick.
3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 57
(as presented in the next section) in the VisEx system. This dataset is available
from the StatLib-Datasets Archive of Carnegie Mellon University [71]. The records
in this dataset contain approximately 15 attributes (e.g., Median value of owner-
occupied homes, per capita crime rate by town, proportion of residential land zoned
for lots over 25,000 sq.ft.,etc.) and 500 house records. Many of the attributes in this
dataset are categorical. For example, the attribute ‘median value of owner occupied
homes’ uses categories between 5 and 50, where each of these categories actually
represents a range of prices.
VisEx has three modes in which a user can generate queries and explore a multidi-
mensional dataset. I call these three modes normal, fixed and comparison modes.
The user can switch between these three modes depending on the requirements of
the exploration.
Normal mode exploration
The querying process starts when the user selects one of the attributes as the first
attribute. The user also selects a range for this first attribute to be displayed in
the first barstick. I call the first attribute att(1) and its range range(1). Next, the
user chooses the second attribute (att(2)). VisEx displays only those records in the
barstick of att(2) with their att(1) values within range(1). The user now selects a
range of values for the second attribute att(2) from among the records displayed
in the barstick for att(2). This process continues for the subsequent attributes. In
general, the barstick for the N -th attribute att(N) displays the records that have
their att(i) value within range(i), for 1 ≤ i ≤ N − 1.
Fixed mode exploration
The fixed mode exploration is suitable when the user has in mind some hard con-
straints on the ranges of some of the attributes. In other words, the user is sure
about the ranges of two or more attributes (called chosen attributes) and she wants
58CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
to experiment with the other attributes after imposing the ranges on the chosen
attributes. An attribute can be fixed any time during the exploration. Suppose the
user chooses to fix the i-th, j-th and k-th attributes and chooses range(i), range(j)
and range(k) for these three attributes. Only the records that satisfy all these three
ranges will be displayed in the three barsticks for the three fixed attributes i, j and
k. The user can subsequently choose other attributes that are not fixed (called float-
ing attributes) and the corresponding ranges for these floating attributes. Only the
records that satisfy all the ranges for all the fixed attributes are displayed in the
barstick of a floating attribute. There is no limit on the number of fixed attributes.
Note that the normal mode exploration can be viewed as a fixed mode exploration
when only the very first attribute is fixed.
Figure 24: An example of fixed mode exploration. First three attributes in this caseare fixed and hence only the records that satisfy all the three ranges are displayedin the first three barsticks. The next two attributes are floating, i.e., the user canexperiment with different ranges for these two attributes. This is an analysis of thedataset in [23]. If the median value of homes is high, per capita crime is low andresidential zone proportion is high (the first three fixed attributes), the houses tendto be new with more number of rooms (the last two attributes).
The fixed mode operation is useful in situations when the user has made up her
mind about the ranges of some of the attributes and wants to experiment with
ranges of other floating attributes. The user can be sure that any records displayed
for any of the floating attributes already satisfy the restrictions of the fixed at-
tributes. Moreover, the selected records due to the fixed attributes do not have
any dependency on the order of selection unlike in the normal mode exploration.
As a comparison between the fixed mode and normal mode exploration, Figures 24
3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 59
Figure 25: An example with five queried attributes: median value of owner occupiedhomes, per capita crime, residential zone proportion, age of house, and number ofrooms. The red vertical bars in each barstick represent the selection by the user. Theresult shows that in areas where the median price is high and per capita crime is low,higher residential zone proportions and new houses with higher number of rooms arefound.
and 25 are referred. All selected attributes in Figure 24 are the same as in Figure 25
except the first three attributes which are fixed in Figure 24. The first three fixed
attributes display only the records satisfying all the three specified ranges.
Comparison mode exploration
The comparison mode is useful for comparing two or more categorical attributes in
detail. Recall that all categorical attributes in VisEX are treated as quantitative
attributes by assigning serial numbers. For example, gender is treated as a quan-
titative attribute with two values 0 (male) or 1 (female). Similarly, if the dataset
has ‘town’ as an attribute and names of towns as the values for that attribute, each
town is given an integer label to convert the attribute to a quantitative attribute.
Once the user switches over to the comparison mode from normal mode opera-
tion, she can choose different values for a quantitative attribute from a scrolling
menu. Each subsequent barstick for the other user selected attributes is split into
i barsticks if the user chooses i quantitative values for comparison. To enhance the
understanding of this mode, a comparison example scenario is presented. Consider
60CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
Figure 26 and the second barstick representing ‘Town’ which is a categorical at-
tribute. If the user wants to compare other attributes like ‘Tax Rate’ and ‘Pupil
teacher ratio’ for different towns, it can be done in the following way. The user
selects three towns 28, 75 and 76 for comparison. Each subsequent barstick for the
other attributes is split into three barsticks for these three towns. Figure 26 shows
that correlations of the most expensive houses with residential land, and indus-
trial zones are opposite to the correlations seen from Figure 23 and Figure 29 for
these three towns. Town number 75 and 76 tend to have high pupil teacher ratios
whereas town number 28 has a lower pupil teacher ratio. The use of this technique
along with exploration and selection techniques also helps users to extract hidden
correlations and different characteristics, or outliers, of attribute values.
Figure 26: Display of the relationship of six queried attributes: median value ofowner occupied homes, town, residential zone proportion, non retail business propor-tion, tax rate, and pupil-teacher ratio; with three categorical attribute values: townnumbers 28, 75, and 76.
3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 61
3.4.4 User interaction
To effectively handle user interaction, an interactive tool must deal with many
human factors [20]. VisEx supports some principles of interactive design such as
consistency, providing feedback, and ease of use or simplicity without extensive
training. The interaction features of VisEx can be categorized into two main groups,
namely Exploration and Selection techniques. The interaction techniques provide
the possibility of visual feedback when users generate queries and interact with the
system.
Exploration techniques
An attribute list allows users to explore all attributes in the dataset. A barstick
is displayed for each selected attribute and all values of the selected attribute are
sorted and shown in each list box. Users can specify and adjust ranges of attribute
values. After selection of individual attributes, subsets of attribute values falling
into the ranges of the previous selected attribute are colored.
In Figure 23, there are four selected attributes: median value of owner-occupied
homes (MEDV), per capita crime rate by suburb (CRIM), proportion of residential
land zoned for lots over 25,000 sq.ft. (ZN), and non retail business proportion
(INDUS). First, the user selects the attribute MEDV from the first attribute list
and specifies the range in between 30 and 50 to examine how the most expensive
houses correlate to the subsequent selection of attributes. The first two barsticks
show that expensive houses tend to be in areas with low crime rates. The user then
specifies the lowest percentage range of per capita crime rate (0.01-1), and higher
(20-100) percentage range of residential zone selection. Hence, the correlations of
four selected attributes from the visualization that the more expensive houses tend
to be not only in areas with low crime rates but also with a higher proportion of
residential zone and with a low proportion of industrial zones can be summarized.
The opposite is true for the cheaper houses which tend to be in areas with high
62CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
crime rates, a low proportion of residential zones, and a high proportion of industrial
zones as shown in Figure 27. In other words, both MEDV and ZN attributes have
opposite correlation with the CRIM and INDUS attribute. Users can select further
attributes that might be affected by the first three selected attributes. Figure 25
shows that most of the expensive houses in areas with low crime rates on larger
blocks are quite new and have more rooms.
Figure 27: The selections in this example are opposite to that shown in Figure 23.If median value is selected as low (first barstick), per capita crime is selected as high(second barstick), residential zone proportion is selected in the medium range (thirdbarstick), the number of industrial sites is high in those localities.
In addition, VisEx allows users to reselect the attributes to be displayed by a bar-
stick to examine their hypotheses dynamically. The system maintains and updates
all remaining attributes and attribute values according to the last change in an
attribute selected by the user.
Changing the queried ranges in a previously selected attribute affects how the bars
in the following barsticks are colored. For example, if the range of the first selected
attribute is changed, the selected colored areas of the subsequent barsticks, such
as the second barstick, will change. Corresponding to the first queried range, the
second queried range of the second barstick will affect the selected colored areas
of the third barstick. For reselected attributes, as shown in Figure 28, the new
first selected attribute is the proportion of industrial land or, ‘non retail business
proportion’. The result shows that the crime rate is high in industrial areas. The
subsequent reselection of other attributes reveals that industrial areas also have a
high concentration of nitric oxide pollutant and higher property-tax rate.
3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 63
Figure 28: An example with four queried attributes: non retail business proportion,per capita crime, nitric oxides (NOX) concentration (parts per 10 million), and taxrate. The result shows that per capita crime, nitric oxide concentration and tax rateare higher in industrial areas (areas with higher non retail business proportion).
Selection techniques
Selection techniques are designed to support viewing the details and distribution
of data records in selected attributes. Users can view the distribution of records in
any particular area of the selected barstick in all other barsticks by clicking on the
areas of that barstick. For example, users click on the last rectangle of the third
barstick in Figure 29. All areas responding to the selected area in other barsticks are
highlighted in gray. The highest residential land areas tend to have the lowest crime
rates, fewer industrial land and low pupil to teacher ratios. This highlighting tool
helps users to understand characteristics and distributions of selected data records
in other attributes. Selection supports on-demand details. When users click a right
mouse button on any colored areas of each barstick, details of specified range (e.g.,
numbers of queried data records and minimum and maximum values of that selected
area) are shown in a pop-up window. For example, a right click on the last group
of bars in the third barstick pops up the details that nine houses have minimal and
maximal proportion of residential zone from 90 to 100%.
In addition, the system provides an equal-height bar chart for viewing the distribu-
tion of data values in the selected barstick. The motivation for using an equal-height
bar chart is for scalability and space efficiency. When users double click the left
mouse button on any selected areas in each barstick, the equal-height bar chart of
64CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
Figure 29: An example of selection in a barstick. The relationship between fivequeried attributes is shown. These are median value of owner occupied homes, percapita crime, residential zone proportion, non retail business proportion, and pupil-teacher ratio by town. The user selects the last group of bars in the third barstick(residential zone proportion) and all the affected bars in the other barsticks are high-lighted in gray.
that barstick is popped up to show the distribution and accumulation of attribute
values under range selection of the previous queried attribute. Figure 30 displays an
example of the equal-height bar chart of the selected attribute, TOWN. There are
only two towns, 26 and 27, in the industrial zones with a low rate of crime. Town
number 27 has more houses than Town number 26. In other words, the houses in
industrial zones tend to be in Town number 27 more than in Town number 26.
3.5 Analysis scenarios
A variety of departments (e.g., federal government, business organizations, etc.) use
census data to analyze, evaluate and improve their services. For example, federal
government uses the census data to measure economic circumstances by an analysis
of average capital incomes with other related factors. Businesses can use data as
an investment guide. The transportation department uses the census data to plan
highway improvements, develop public transportation services, design programs to
ease traffic problems, or reduce pollution. In my experiments, I have analyzed
two scenarios including U.S. census data and the current population survey to
demonstrate the data exploration capabilities of VisEx.
3.5. ANALYSIS SCENARIOS 65
Figure 30: Display of the relationship of four queried attributes: non retail businessproportion, town, per capita crime, and median value of owner occupied homes, withthe equal-height bar chart of town attribute.
3.5.1 Analysis 1: 1990 U.S. Census Data
A part of the 1990 United States census data from the KDD archive of the Univer-
sity of California at Irvine [26] has been used for the analysis in this section. The
census data consists of 72 attributes such as age, gender, income, education, indus-
try, occupation, and social class of workers, and has approximately 300,000 data
records. Many correlations, trends, and relationships may be discovered through
VisEx. I have experimented with some of the relationships and correlations. Fig-
ure 31(a) shows that entertainment, recreation, and professional service businesses
tend to pay high salaries to highly educated people to work in managerial and pro-
fessional speciality occupations. In contrast, finance, insurance, real estate, and
personal service businesses tend to pay more for highly educated people to work as
technicians, salesmen, and related support services than other occupations as shown
in Figure 31(b). More highly educated males earn higher total personal incomes
than females with the same levels of education, as shown in Figure 32.
66CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
(a)
(b)
Figure 31: An example of an analysis scenario with four selected attributes: occupa-tion (dOccup), industry (dIndustry), total personal incomes (dRpincome), and yearsof schooling (iYearsch). (a) Managerial and professional specialty jobs are selectedin the first barstick. The second barstick shows that most of these jobs are in pro-fessional services, and entertainment and recreation businesses. The third barstickshows that the salaries for such jobs are usually high. Finally, the fourth barstickshows that people employed in these jobs have higher years of schooling usually. (b)Technicians and related support occupations and sales occupation have been selectedin the first barstick. The second barstick shows that people in these occupations havelower years of schooling compared to those in example (a). The third barstick showsthat they usually earn less and are employed in finance, insurance, real estate, andpersonal service businesses.
3.5. ANALYSIS SCENARIOS 67
(a)
(b)
Figure 32: An example analysis with three selected attributes: years of schooling(iYearsch), gender (iSex), and total personal incomes (dRpincome). 14-17 years ofschooling is selected in both examples (a) and (b). (a) The second barstick shows thedistributions of males and females with 14-17 years of schooling. The left block ofbars (males) is selected. The third barstick shows that highly educated males earnhigh salary. (b) Females (the right block of bars) with 14-17 years of schooling areselected in this example. The third barstick shows that highly educated females earncomparatively less salary compared to their male counterparts.
68CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
Figure 33: An example analysis shows the relationships of five selected attributes:Total personal incomes (dRpincome), Years of schooling (iYearsch), Occupations(dOccup), Class of worker (iClass) and Industry (dIndustry). The first two bar-sticks show the range of the highest total personal income and degree of education.The third and forth barsticks show the selected group of people who work in man-agerial and professional roles for private companies. In the last barstick, these peopletend to work in the manufacturing group and in the entertainment, recreation, andprofessional service groups.
People who work in managerial and professional speciality careers for private profit
companies in the manufacturing group and in the entertainment, recreation, and
professional service groups tend to earn higher total personal income than people
who have occupations in other areas as shown in Figure 33. Most of those people
have completed bachelor or higher degrees. Figure 34 shows that people in the 65
or older age group, receive high incomes and are unemployed. Most of these people
receive social security income as well as retirement, survivor, or disability pension
incomes, while the rest have only one of these sources of income.
3.5.2 Analysis 2: 1985 The Current Population Survey
The population survey from StatLib-Datasets Archive of Carnegie Mellon Univer-
sity [71] has been used. The dataset consists of 534 data items and 11 dimensions
or attributes. In Figure 35, some queries are made on Occupation, Sex, Education,
Race, and Wage attributes. The exploration of Figure 35(a) illustrates that more
males have professional jobs than females. Almost all of those males are white and
are highly educated and well paid. In contrast, in Figure 35(b) more females work
as clerks than males and have average education.
3.6. USER STUDY 69
Figure 34: An example analysis shows the relationships of five selected attributes:Total personal incomes (dRpincome), Occupations (dOccup), Age (dAge), Retirementincome (dIncome7), and Society security income (dIncome5). The first selected at-tribute and range represent higher total personal income. The second barstick showsunemployed people earning higher personal income. The highest range of age (at least65 years old) has been selected as the third attribute. The selections of the last twobarsticks show that most of these people receive retirement income.
In Figure 36(a), Education, Experience, Age, and Wage attributes are queried and
the example shows that persons who have less education, a lot of experience and
are older tend to have low wages. In Figure 36(b), young highly educated persons
tend to have little experience and low wages. There are only two persons earning
the highest wages as detected from the outliers of the wage bar. Older people with
much experience have less education while younger people with little experience
have higher education. Younger persons (age around 15-18 years) tend to have
12-15 years of education.
3.6 User study
3.6.1 Experimental methodology
To evaluate the efficiency of the system, I conducted a user study with eighteen
participants from postgraduate students in the School of Computer Science and
Software Engineering by asking them to perform tasks and report their findings of
the assigned tasks.
The experiment was divided into two sessions. The first session was a tutorial
70CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
(a)
(b)
Figure 35: Five selected attributes: Occupation, Sex, Education, Race, and Wage,are queried with specified ranges in each attribute. (a) The “Professional career” inthe Occupation attribute is selected in the first barstick. The second barstick showsthe distribution of males and females. The third barstick visualizes the distributionof education of males and the specified range of education is 15-18. “Race” and“Wage” are selected as the fourth and last attribute respectively. (b) In the first bar-stick, “clerk” is selected as the attribute of interest. The second barstick shows thedistribution of males and females who are clerks. The third barstick shows the dis-tribution of females with the specified education range of 15-18. “Race” and “Wage”are selected as the fourth and last attributes respectively.
3.6. USER STUDY 71
(a)
(b)
Figure 36: An example analysis with four selected attributes: “Education”, “Expe-rience”, “Age”, and “Wage”. (a) The first barstick shows 2-10 years of the specifiededucation ranges. The second barstick shows the specified ranges of experience from35 to 55 years. The third and fourth barsticks display the selected age 50-64 yearsand wage between $1-6/hour, respectively. (b) The first barstick shows 13-18 yearsof education. The second barstick shows experience from 0 to 5 years. The thirdand fourth barsticks display the selected age between 18-25 years and wage between$1-6/hours, respectively.
72CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
session. At first a brief introduction of visual representations in the system was
provided to the participants. Then the participants were asked to explore a car
dataset [71] and answer five example tasks (as shown in Appendix A) by using
the system so that they could learn how to use the tool including all features for
exploring the dataset. They could also ask any question during this session until
they were ready to continue with the next session. In the other session of the
experiment, participants were asked to complete ten tasks related to the census
dataset [71]. To evaluate all features in the tool, the tasks were set up so that the
participants could use the main features including normal mode, fixed mode, and
comparison mode explorations as well as interactive tools for exploring the dataset.
The performance of the participants was timed and marked as correct or incorrect
in order to evaluate how easy they found the tool. The ten tasks can be categorized
into three main groups including identifying a group of records, finding correlation,
and comparing groups of relevant records.
3.6.2 Results
Time and Correctness
All participants spent less than five minutes on individual tasks and spent more
time in completing tasks involving more attributes. Task 7, 9, and 10 consisted
of two questions with four to five attributes so unsurprisingly these tasks were the
most time consuming tasks as shown in Figure 37. Task 3 was a comparison task
and Task 1 involved searching for correlation. Both of these tasks involved only two
attributes so participants spent less time for these tasks. Participants spent less
time in correlation tasks (Task 2 and Task 5) than in comparison tasks (Task 4 and
Task 6). The correctness of the given tasks is shown in Figure 38. All participants
correctly answered Task 3 while Task 5 and Task 7 were the least correctly answered
and about 89% of participants answered these two tasks correctly. I observed that
a few participants did not carefully read these questions and did not completely
3.6. USER STUDY 73
answer all questions in the tasks. However, the correctness of all tasks was more
than 85%. I also observed that some participants tried different features of the tool
for answering the tasks.
Figure 37: The mean time for completing each task.
Questionnaire and feedback
After finishing all tasks, the participants were asked about their experience in data
analysis and visualization using VisEx. All but one of the participants had no
experience in data analysis and none of them had experience in using visualization
tools. The questionnaire was categorized into four major groups including usability,
visualization, interaction, and information and their corresponding feedbacks are
Figure 38: The correctness of each task.
74CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
presented in Figure 39.
In usability, 27.78% and 44.44% of participants strongly agreed and agreed that
visualization was easy to understand and 16.66% of participants rated it fair. The
tool was found to be easy to use (5.56% strongly agree, 61.11% agree, and 27.78%
fair) though only 5.56% of participants did not agree. Greater than 88.88% of
participants found that the tool was easy to learn. 66.67% of participants agreed
that the tasks were easy to complete with the tool and 22.22% of participants
provided a fair rating.
In the visualization category, greater than 90% of participants could identify the
difference between normal mode and fixed mode exploration. Participants pre-
ferred the normal mode exploration. They reported that they could visualize more
information in the normal mode exploration than in the fixed mode exploration.
However, a few participants said that they liked both of these modes and used
them depending on the goals of the exploration. Participants did not give any
negative feedback on using and understanding barsticks, identifying specific groups
of records, comparing groups of records and the clarity of visual representation
in VisEx. Participants also provided a lot of positive feedback for the ability of
identifying correlations among specified attributes. 27.78% and 50% of participants
strongly agreed and agreed that they could easily understand the displayed infor-
mation.
In the interaction category, all participants provided positive feedback on their
ability in using and changing exploration modes. Greater than 78% of participants
strongly agreed and agreed that it was easy to change the selection of parameters
as well as to correct mistakes. Participants reported that the search for data of
interest was easily directed (11.11% strongly agree, 55.56% agree, and 27.78% fair).
Moreover, most participants provided positive feedback for more than 70% of all
features in this category. In addition, participants also provided further comments.
A few participants commented that they would have more confidence if they could
spend more time using the system. Finally, most participants found the tool quite
3.7. SUMMARY 75
useful.
3.7 Summary
Visualization is an important tool for understanding large and complex datasets.
It has been employed in many fields to help users gain insight into their data.
However, most of the visualization techniques encounter the problems of occlusion
and scalability. Most systems also require some prior training for the users. Hence,
it is an interesting challenge to design a visual exploration system that can provide
clear and understandable visualization as well as simple and flexible user interaction.
VisEx provides a new framework for visualizing correlations among attributes in
large multidimensional datasets. The display technique in VisEx avoids occlusion
through the quantitative estimates of the data. It is possible to compare only a few
attributes (usually two) in most previous visualization techniques for multidimen-
sional datasets. However, a user can discover correlations among many attributes
at a time in VisEx through its coupled display system. VisEx also provides analysts
with facilities for selecting ranges within attribute values and all the records affected
by these selections are highlighted in all the other barsticks for the other attributes.
This helps analysts to conduct what-if type experiments in discovering correlations
among attributes. VisEx is completely scalable for small to large datasets since
the aim is to display quantitative estimates rather than the actual records. Hence,
VisEx maintains similar screen appearance without occlusion of graphic primitives
for datasets of all size.
In the next chapter I introduce an integration of a visualization technique similar to
VisEx into the data mining process to enhance effective human intervention in data
mining. A framework for visual data mining is presented for discovering interesting
association rules. Moreover, I propose a new visualization technique for displaying
mined association rules.
76CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
(a)
(b)
(c)
3.7. SUMMARY 77
(d)
Figure 39: The results from questionnaires in different categories: (a) Usability (b)Visualization (c) Interaction (d) Information
78CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS
Chapter 4
Visualization for Association Rule
Mining
4.1 Introduction
Data mining algorithms in general have different purposes, e.g., gaining insight into
data, predicting trends and discovering hidden associations in large datasets. Dif-
ferent data mining methods such as mining association rules, cluster analysis, and
classification have different goals according to the kind of knowledge to be mined.
In this thesis, I focus only on visual mining of and visualization for association rules.
An example of the use of association rules is to help store managers study purchas-
ing behaviors of their customers and promote sale of specific items to their loyal
customers. The size of databases like transaction records in supermarkets, telecom-
munication companies, e-marketing and credit card companies has been growing
rapidly and it is difficult to extract meaningful information from such databases.
Analysts need a tool to transform large amounts of data into interpretable knowl-
edge and information, and to help make decisions, predict trends, and discover
relationships and patterns. Association rule mining is one of the most important
data mining processes. It is a powerful tool that helps analysts to understand and
79
80 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
discover the relationships in their data. Market basket analysis is also an exam-
ple of mining association rules which help marketing analysts to analyze customer
characteristics to improve their marketing strategies. To increase the number of
sales, one such marketing strategy could be the placing of associated items in the
same area of the floor so that customers can access (place the items in their market
baskets) the items easily. For example, if bread and cheese are frequently purchased
items, placing such items in close proximity may increase sales because customers
who buy bread may also buy cheese when they see cheese on a nearby shelf. Fur-
thermore, promotion of items frequently purchased together on the store catalogs
may increase sale of those items.
Mining association rules is a well researched area within data mining [21]. There are
many algorithms for generating frequent itemsets and mining association rules [1,
60, 69]. Such algorithms can mine association rules which have confidence and
support higher than a user-supplied level. However, one of the drawbacks of these
algorithms is that they mine all rules exhaustively and many of these rules are
not interesting in a practical sense. Too many association rules are difficult to
analyze and it is often difficult for an analyst to extract a meaningful (small) set of
association rules. Hence there is a need for human intervention during the mining of
association rules [2, 77] so that an analyst can directly influence the mining process
and extract only a small set of interesting association rules.
However, it is quite often impossible for a human expert to understand large multi-
dimensional datasets through manual examination. Visual data mining helps users
to extract interesting patterns hidden in their data and learn more about the data
through visualization. It is also important for the analyst to participate in the min-
ing process in order to identify meaningful association rules from a large database
through her guidance and knowledge. Any such participation should be easy from
an analyst’s point of view. Hence, visual association rule mining seems to be a
natural way of directing the mining process.
The visualization technique presented in VisEx has been modified for helping an
4.2. TERMINOLOGY 81
analyst to mine association rules. This modification introduces a new tight coupling
technique, called VisDM which enables users to apply their domain knowledge to
enhance decision making processes which cannot be done by only automatic pro-
cesses. The algorithms and all user interfaces are implemented in Visual C++ and
tested on both synthetic and real world datasets.
This chapter is organized as follows. Section 4.2 provides terminologies about as-
sociation rules. The new tight coupling technique is introduced for visual mining
of association rules in Section 4.3. The data structure used in the implementation
of VisDM is discussed in Section 4.4. An example of visual mining of market bas-
ket association rules as well as an evaluation through a user study is discussed in
Section 4.5. Section 4.6 presents a new technique for visualizing mined association
rules. The conclusion of the chapter is presented in Section 4.7.
4.2 Terminology
An association rule [2] is formally described as a rule of type A ⇒ B where A is an
item set called antecedent, body, or left-hand side (LHS) and B is an item set called
consequent, head, or right-hand side (RHS). Each item set consists of items from a
transactional database. Items existing in the antecedent are not in the consequent.
In other words, an association rule is of the form.
A ⇒ B
where A, B ⊂ I and A ∩ B = φ.
I = {i1, i2, ..., in } is a set of items in the transaction database where ij, 1 ≤ j ≤ n, is
an item in the database that may appear in a transaction. Two common measures
for evaluating the importance of an association rule are support and confidence. The
support of a rule is defined as the percentage of frequency with which all items in
the rule appear together. The confidence of the rule is the ratio of frequency of
items in both antecedent and consequent (frequency of A and B) to frequency of
82 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
items in the antecedent appearing together. The probability of both support and
confidence is,
Support (A ⇒ B) = P (A⋃
B)
Confidence (A ⇒ B) = P (B|A)
An example of an association rule is {cheese, bread} ⇒ {milk, eggs}. If this rule
has a support of 12%, it means that the four items cheese, bread, milk, eggs appear
together in 12% of all transactions. If this rule has a confidence of 52%, it means
that 52% of all customers who purchased cheese and bread also purchased milk and
eggs in the same transaction.
A term, frequent itemset [21] or large itemset [60], is used to define item sets whose
number of co-appearances in the database is greater than a user specified support.
In other words, it is known as the items frequently purchased together based on the
specified minimum support.
4.3 The model for interactive association rule min-
ing
Figure 40 shows a diagram of the model. The VisDM system can be divided into
three stages as follows. Each step has been designed to enhance the ability of the
users to interact in the mining process.
• Identifying frequent itemsets
• Mining association rules
• Visualizing the mined association rules
In the first stage of VisDM, the user first finds a suitable frequent itemset. In most
data mining algorithms, the selection of a frequent itemset is done automatically.
4.3. THE MODEL FOR INTERACTIVE ASSOCIATION RULE MINING 83
Any item that has an occurrence above the user specified support is chosen as the
member of the frequent itemset. Though this method is efficient for identifying all
the frequently occurring items, the subsequent association rule mining step quite
often discovers a large number of association rules involving these frequently oc-
curring items. The technique gives the user complete control for choosing items to
form the frequent itemset. The detail of this stage is described in Section 4.3.1.
In the second stage, the user participates in generating interesting association rules
by specifying antecedents and consequents of each rule from the frequent itemset
chosen in the first stage. The user can experiment with different combinations
of the antecedents and consequents and save a generated association rule if it is
interesting. Section 4.3.2 provides more details on how this stage works.
Finally, in the third stage, the user can visualize all the discovered rules saved
during the second stage. Further details of this stage are presented in Section 4.3.3
VisDM splits the application window into two areas: left and right panels. The left
panel is a user control panel which allows the user to input parameters. The right
panel is a visualizing panel which displays results in response to the parameters set
in the left panel.
To effectively handle user interaction, an interactive tool must deal with many
human factors such as consistency and feedback [20]. My interactive technique takes
into account some requirements of interactive design such as consistency, providing
feedback, reducing memorization, and ease of use without extensive training. In
addition, an analyst has complete control over deciding on the antecedents and
consequents of each rule and the whole process is intuitively simple for the analyst.
Although a complete visual mining process is slow compared to an automated pro-
cess, it has the advantage of exploring only interesting association rules. As men-
tioned before, an automated process can mine many association rules that are not
meaningful practically. The visualization tool is extremely simple to use and avoids
screen clutter. This makes it an attractive option to use both for small and large
84 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
Figure 40: A model of the technique for mining association rules.
4.3. THE MODEL FOR INTERACTIVE ASSOCIATION RULE MINING 85
databases.
4.3.1 Identifying Frequent Itemsets
This part of the system assists analysts to search for frequent itemsets based on a
user-specified minimum support. An analyst can provide the minimum support to
filter only items that she is interested in. After specifying the minimum support,
all items exceeding the threshold are loaded and sorted in descending order of their
support. The analyst can use the sorted list as a guide in selecting each item
in the frequent itemset. Each selected item is represented by a barstick with the
percentage of its support. After the first selection of an item, the system generates
a list of items that co-exist with the first selected item. All the items in this co-
existing item list have supports greater than the user-specified minimum support.
The co-existing item list is also generated each time a subsequent item is chosen.
The percentage of support is calculated by comparing the numbers of the first and
second selected items appearing together with the total number of appearances of
the first selected item. At each step, the barsticks are displayed in a way similar to
the VisEx system discussed in Chapter 3.
VisDM helps an analyst to find items which tend to appear together in the transac-
tions. In addition, the system supports user interaction to find the details of each
selected item. When the analyst clicks in each bar, the percentage of each item in
the co-existing item list and its support are displayed to help make decisions and
compare selected interesting items and their supports.
As shown in Figure 41, the display window is divided into two sub-windows. The
left panel comprises the specified minimum support, lists of the items through com-
boboxes, and the list of co-existing selected items with their supports in descending
order. For example, the co-existing item list of cereal with milk, bread, and cheese
consists of biscuits (40% support), chocolate (28% support), and juice (36% sup-
port). The right panel shows the selected items with the number of purchases from
86 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
Figure 41: The right drawing space represents each selected item as a barstick withthe number of purchases from all transactions. The control tab represents a comboboxfor each selected item and the list of its co-existing items.
all transactions as hierarchical barsticks. Milk, bread, cheese, and cereal are se-
lected in that order as items of interest. The user can change a previously chosen
item at any stage of choosing the frequent itemset. Each item in the set is cho-
sen from a drop-down list of items and the user can resize the frequent itemset by
deleting the last item at any stage. The user can change any previously chosen
item by successively reselecting any item from any drop-down list. Once the user
has chosen the frequent itemset, it can be saved for the later stages of the mining
process. Only seven items have been shown in Figure 41; however, it is possible to
include any number of items in the left panel through a scrolling window.
4.3. THE MODEL FOR INTERACTIVE ASSOCIATION RULE MINING 87
4.3.2 Selecting Interesting Association Rules
In this stage, the selected frequent itemset from the first stage is used to generate
association rules. Again, complete freedom is provided to the user for choosing the
association rules including the items in the antecedent and consequent of each rule.
The items in the antecedent and consequent of an association rule are not limited
only to one-to-one relationships. The system supports many-to-many relationship
rules as well. In Figure 42, the left panel shows the selected frequent itemset of
interest including milk, bread, cheese, and cereal from the first stage. The user
is allowed to generate many-to-many relationship rules e.g., milk and bread as
antecedent and cheese and cereal as consequent or any combination of the items in
the antecedent and consequent. In the right panel, the first colored bar illustrates
the proportion of selected items, milk and bread for the antecedent. The second
colored bar represents all selected items of the association rule or in other words it
shows the proportion of the consequent items, cheese and cereal, appearing together
with the antecedent of the rule. In the left control panel, the system shows the
support of antecedent, the support of the selected itemset, and the confidence of
the association rule.
4.3.3 Visualizing Association Rules
This part deals with visualization of the mined association rules in the second
stage. The visualization allows analysts to view and compare the mined association
rules generated from the first two steps. Among the selected interesting rules, the
visualization bars allow analysts to obtain the most significant and interesting rules.
Figure 43 represents three association rules. For example, the first rule shows the
relationship of the antecedent: milk and bread and the consequent: cheese and
cereal. The confidence, the antecedent support, and the itemset support of this
rule are 49, 51, and 25, respectively. For the second rule, the first bar, with support
51, represents the antecedent: milk and bread and the second bar, with support 40,
88 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
Figure 42: The right drawing space represents two barsticks. The first bar showsthe proportion of the antecedent of the association rule. The second bar shows theconsequent based on the selected antecedent. The control tab on top of the left handside is to input the antecedent and consequent of the rule. The bottom of the tabdisplays the confidence, the antecedent support, and the itemset support.
4.4. DATA STRUCTURE USED IN VISDM 89
Figure 43: Illustration for deriving interesting association rules from the selectionof the rules in Figure 42. The two bars and the texts represent each rule and itsproperties.
represents cheese. The confidence is 78. The antecedent support of the last rule is
30, the frequency of item set is 25, and the confidence is 83. The last rule has the
highest confidence while its antecedent support is the lowest and the frequency of
the itemset is equal to that of the first rule.
4.4 Data Structure used in VisDM
The VisDM algorithm scans a market basket transaction database twice. The first
scan is to count the support of each item in the transaction records. The second
90 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
scan is to generate a bitwise table to store the item lists of the original transaction
records. A bitwise operation is used for representing both existing and non-existing
items.
In the first stage, for identifying the frequent itemset, an item identification list
including non-existing items of each transaction is converted into a bit-vector rep-
resentation, where 1 represents an existing item and 0 represents a non-existing
item in the record. For example, suppose a market basket transaction database
consists of four items including milk, bread, cheese, and cereal in ascending order
of item identifications and a transaction contains two items: milk and cheese. A
bit-vector of this transaction is 1010. Hence, the associated items can be retrieved
by applying a bitmask operation to each transformed item list. Each bitmask is
generated by transforming all selected items to bits which are set to 1. After se-
lecting each interesting item from a list in the first stage, an associated item list
is generated to support the user’s search for the next interesting item. To reduce
search time of associated items in each transaction, the associated item list contains
only the indexes of transactions with all selected items appearing together. Each
transaction index is linked to the bitwise table so that all associated items in that
transaction can be retrieved. This technique can support a large number of items
in a transaction database. Though the bitwise technique needs some preprocessing
time to convert the transaction records to a bitwise table, it is efficient and effective
for searching the existing and associated items at run time.
4.5 A user study of VisDM
4.5.1 Experimental methodology
To evaluate the usability of the system, I conducted a user study with seven post-
graduate students from the School of Computer Science and Software Engineering
by asking them to perform data mining tasks and reporting their findings. The
4.5. A USER STUDY OF VISDM 91
experiment was run on two datasets and all participants had to complete four main
tasks in each dataset as shown in Appendix B.
The first dataset is derived from UCI Machine Learning Repository [26] and the
items are denoted by numbers. The other dataset is from Data Mining II (DMII) [46]
with the associated item names for each transaction. Before starting an experiment,
each participant was given a tutorial on terminologies and descriptions for interpret-
ing association rules and frequent itemsets and given instructions for using VisDM.
An example of VisDM in action was also shown to the participants. At the end
of the experiment, the participants completed a brief usability questionnaire partly
derived from Stasko [70] and Marghescu and Rajanen [48].
4.5.2 Results
The participants were asked about their experience in mining and visualization using
VisDM. All of the participants had no extensive experience in data analysis and only
three participants had some experience in using visualization tools. The usability,
visualization, interaction, and information data from the study are presented in
Figure 44. 57% of the participants found that parameters shown in the tool are
understandable and the tool is easy to use, though 29% of the participants did not
agree that the tool is easy to use. The tool was found easy to learn (29% strongly
agree, 43% agree, and 28% fair). 43% of the participants agreed that the tool was
easy to use for completing the tasks while 14% of participants did not agree. For
quality of visualization, all participants provided positive feedback for identifying
most and least often bought items. Greater than 55% of participants found that
they could identify the maximum and minimum percentage of items purchased
together and appreciate the clarity of visual representation, though about 14% of
the participants did not agree. For quality of interaction, most participants provided
(i.e., ability to change the selection of items, to explore data, to use parameters, and
to direct search for data of interest) positive feedback. 86% of participants agreed
and strongly agreed that they were able to correct their mistakes, though 14%
92 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
of the participants did not agree. Moreover, most participants provided positive
feedback for the quality of information they could collect about the underlying
dataset. Figure 45 shows that the participants spent more time to complete the
second and third tasks of Dataset1 compared to the corresponding tasks while using
Dataset2. This result suppports the idea that participants found it easy to search
for frequent itemsets and association rules of interest faster after gaining some
experience from the previous tasks. However, the user study is limited to a small
set of participants, all of whom had no experience in data analysis.
4.6 Visualization of many association rules
I have discussed the facility of visualizing a small number of association rules in
the VisDM system. This section discusses a system called VisAR which is suitable
for visualizing a large number of association rules. Typically, association rules
generated by mining algorithms are difficult for users to understand. Visualization
allows users to visually analyze and understand the mined association rules.
Zhao and Liu have proposed a visualization technique [82] for association rules.
Their technique uses a line to represent each association rule. The x-axis represents
time data and the y-axis represents the support or confidence value. Although this
technique is designed to help users to understand mined association rules through
visual analysis of time, their visualization uses a technique similar to the parallel
coordinates technique [34]. In practice, this technique causes occlusion and screen
clutter when visualizing a large number of association rules.
Wong et al. use a 3D visualization framework for association rules [81]. The ap-
proach is based on a Matrix-based technique. Although this technique can visualize
many-to-one association rules, the number of association rules generated from as-
sociation rule mining algorithms is massive. It is difficult to display all generated
association rules by using this technique. In particular, this technique is prone to
occlusion. Though the author claims that the height of the columns is scaled, the
4.6. VISUALIZATION OF MANY ASSOCIATION RULES 93
(a)
(b)
(c)
94 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
(d)
Figure 44: The results from questionnaires in different categories: (a) Usability (b)Visualization (c) Interaction (d) Information
higher columns representing antecedent and consequent items of the rules can still
occlude the columns of low support and confidence. However, in my technique it
is possible to view not only many-to-one but also many-to-many association rules.
The technique allows users to select items existing in association rules so that the
users can view only the association rules containing their items of interest.
Although Table-based, Matrix-based, and Graph-based techniques as well as some
commercial visualization systems are capable of representing mined association
rules, they visualize all mined association rules in a single view. Typically, visual-
izing all association rules at once produces too much information and might also
generate screen clutter. It is difficult for users to interpret and extract interesting
association rules from a single view of all rules.
This chapter presents a new technique called VisAR for visualizing association rules
derived from data mining algorithms. The aims of the VisAR system are similar to
VisEx presented in chapter 3. I focus on reducing the complexity of visualizing large
number of association rules in a single screen so that users are able to effectively
understand and interpret information from a large number of association rules.
The system is also designed to eliminate occlusion from visualization. This new
technique visualizes the association rules containing user specified items. Users can
explore association rules through their specified items of interest. The input for
4.6. VISUALIZATION OF MANY ASSOCIATION RULES 95
(a)
(b)
Figure 45: (a) The mean time of completing each task. (b) The correctness of eachtask in each dataset.
96 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
visualizing many association rules in the VisAR system is derived from algorithms
in CBA [46]. Next section presents the details of VisAR with an example.
4.6.1 The VisAR system
The system has been designed based on the diagram in Figure 46. I have categorized
all processes in the diagram into four major stages.
• Managing association rules
• Filtering association rules of interest
• Visualizing selected association rules
• Interaction during visualization
The first stage includes two processes: specifying and loading association rules that
have been generated by an automated data mining tool as shown in Figure 46.
The specified association rules are first loaded into memory. The system counts all
provided association rules and the number of distinct items in both antecedents and
consequents as well as manages lists of items in antecedents and consequents. Then
the system sorts the association rules according to the support values of individual
association rules. The support is used as a default for sorting association rules.
The purpose of the second stage is to specify the items of interest in association
rules and filter association rules according to the specified items. The user specifies
the items of interest and the system filters the association rules for which the user-
specified items exist in the antecedents.
The aim of the third stage is to visualize the association rules containing the se-
lected items from the previous stage. Figure 47 shows the visualization result of
the selected items, namely cd and rice, and the user interface of VisAR. After the
user selects the items of interest, all association rules containing the specified items
are visualized on the right panel. All antecedents and consequents of all qualified
4.6. VISUALIZATION OF MANY ASSOCIATION RULES 97
Figure 46: A diagram of the system for visualizing mined association rules.
98 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
association rules are displayed along the y-axis. The antecedents are placed above
and the consequents below the x-axis, which is displayed as a bold and black line.
The selected items of interest are displayed above other unselected items in the an-
tecedents along the y-axis. In Figure 47, items in antecedents of all association rules
are cd, rice, battery, soya sauce, newspaper, and sweets. The items in consequents
of association rules are newspaper, battery, soya sauce, and sweets. The selected
items are cd and rice. These two items are listed above battery, soya sauce, newspa-
per, and sweets. The system displays all association rules parallel to the y-axis by
the sorted support values. Each rule is visualized by a vertical line parallel to the
y-axis with circular dots representing items in each association rule. For example,
in Figure 47, the first vertical line represents an association rule with five circular
dots. Four dots representing cd, rice, battery, and soya sauce are in the antecedent
section and another dot in the consequent section represents the newspaper item.
The association rule is {cd, rice, battery, soya sauce} ⇒ {newspaper}. This is the
rule with highest support among all rules that include cd and rice in the antecedent.
Each confidence of an association rule is mapped to a color ramp so that the user
can identify and group similar association rules according to color.
Ten different colors are used for representing ten equal scales of either support or
confidence in terms of percentage from zero to hundred. This color range has been
designed to enhance the human ability of grouping items according to color. Red
represents the maximum value range, 90−100%, while blue represents the minimum
value range, 0− 10%. All association rules in Figure 47 are in the same range and
are mapped to the third color range, i.e., 20− 30%.
The last stage in VisAR is the interaction stage. This stage allows users to view
details of each association rule and provides flexible adjustments to view specific
association rules. The support and confidence values of an association rule are
shown when the user moves the mouse over the vertical line representing the rule.
The user can change both defaults of the system to visualize association rules. The
first option is to change the viewing of association rules to display only association
4.6. VISUALIZATION OF MANY ASSOCIATION RULES 99
Figure 47: The left panel displays all antecedent items of association rules withthe interactive options (operation and sorting) for visualizing association rules. Theright panel visualizes association rules whose antecedent items are selected. cd andrice are the selected items in this figure. This visualization represents a selected ORoperation.
100 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
Figure 48: Visualization of association rules from the selected items of interest inFigure 47. This visualization represents the selected operation AND which shows onlyassociation rules containing exactly the selected items, cd and rice in the antecedent.
Figure 49: This visualization represents the sorting of confidence which shows onlyassociation rules containing exactly the selected items, cd and rice. The color ofvertical lines represents the support value of the association rules
rules containing exactly the specified items of interest. Figure 48 and Figure 49
show association rules in which only cd and rice appear in the antecedents of the
association rules. The default of the system is set to display association rules
containing both specified items and all other items in each antecedent. The second
option is to change the sorting order from support to confidence. The default of
sorting in the system is according to support.
4.6. VISUALIZATION OF MANY ASSOCIATION RULES 101
Figure 50: Visualization of association rules from the selected items of interest inFigure 47 but sorted according to confidence values.
102 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
4.6.2 The advantages of VisAR
The VisAR system can be considered as a hybrid of the Matrix-based and Graph-
based techniques. The technique has many advantages over the Table-based, ordi-
nary Matrix based and Graph-based techniques as follows.
• VisAR allows users to specify items of interest for visualizing association rules
containing such items. This feature in the technique provides users to focus
on specific association rules instead of viewing all association rules in a single
view.
• VisAR has no limitation on the number of items in both the antecedent and
the consequent to be displayed. The system can visualize both many-to-one
and many-to-many association rules seamlessly.
• VisAR employs the benefits of both Matrix-based and Graph-based techniques
for placing and linking items in association rules to solve the occlusion prob-
lem. The Matrix-based technique organizes the items like an array in which
items are placed in rows while association rules are displayed by columns.
The employed Graph-based technique links the same groups of items and the
items of the same association rules so that users can easily identify the groups
of items and individual association rules.
• There is no screen clutter or occlusion even when a large number of rules are
displayed on the same screen.
• VisAR visually separates antecedent items and consequent items so that the
users can clearly distinguish between the antecedent items and the consequent
items of the association rules.
• The simplicity of VisAR helps the users to enhance their ability of interpre-
tation. The users can identify groups of association rules which have close
values of support or confidence.
4.7. SUMMARY 103
4.7 Summary
Visualization techniques have been widely researched and integrated into many ap-
plications involving data analysis tasks including data mining in order to increase
human abilities to deeply understand data and extract hidden patterns from large
datasets. However, currently association rule mining algorithms have some short-
comings. Most of these algorithms usually mine a large numbers of association rules
and some of these rules are not practically interesting. Moreover, it is difficult for
analysts to understand and interpret a large number of rules. Most of the visualiza-
tion techniques display all mined association rules in a single screen. It is difficult
for an analyst to interpret such large amounts of information. In addition, some
visualization techniques encounter problems of screen clutter and occlusion.
The VisDM system has been introduced for mining association rules. The tight
coupling of VisDM helps users in filtering only interesting association rules. The
interactive visualization technique of VisDM is useful in mining market basket as-
sociation rules so that users can obtain visual feedback and apply their knowledge
in guiding the mining process.
The VisAR technique reduces the number of visualized association rules for effec-
tively interpreting and understanding a large number of rules. The analysts can also
choose to view specific association rules through their choice of items of interest. In
addition, the visualization technique has overcome the problems of screen clutter
and occlusion.
The next chapter introduces an integration of a visual exploration technique sim-
ilar to VisEx into an on-line analytical processing (OLAP) system to enhance the
analysis and decision making capabilities of analysts.
104 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING
Chapter 5
Interactive Visualization for
On-line Analytical Processing
5.1 Introduction
Modern business processes generate an enormous amount of data that needs to be
analyzed and understood for better business performance. Executives, managers,
and analysts need a tool for making decisions and planning strategies. On-line
analytical processing (OLAP) has become an important tool for interactive analysis
of multidimensional databases such as data warehouses. This tool helps analysts
to explore, analyze, and extract interesting patterns from massive amounts of data
stored in multidimensional databases. A variety of industries have adopted data
warehouses as the preferred mode of data storage in order to manage the explosive
growth of their databases [11].
OLAP tools provide functionalities such as slicing, rolling up, and drilling down
for an end user to analyze and navigate through dynamic multidimensional data
cubes. Though OLAP research has been conducted extensively in the past several
years, most research in OLAP has been focused on modeling tasks with textual
forms of presentation [14, 54, 74] e.g., pivot tables. Some commercial systems [31,
105
106CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
54] provide the combinations of visualization techniques such as bar charts, line
graphs, and histograms which allow users to view a single snapshot of each textual
representation. There is very little available research in interactive visualization for
OLAP.
Visualization is a powerful tool supporting visual representation and exploration
of massive datasets. The capability of humans to interpret and capture informa-
tion from graphical formats such as a chart is better than from a list of numbers
or from text files. This chapter introduces a novel interactive visual exploration
technique for analysis of multidimensional data cubes from data warehouses. To
obtain an effective and powerful analysis, the tool incorporates visualization into
OLAP services, which enables analysts to explore overviews of high levels of data
and drill down into levels of detail of each dimension directly. The integration of
both visualization and OLAP not only helps users to extract interesting patterns
but also helps them to interpret and analyze the extracted information from OLAP
faster. My technique allows users to view the visualization of all previously selected
paths of interest so that users do not need to recognize which levels and dimensions
they are looking at.
Since hierarchical structures have been deployed in most multidimensional databases,
I feel that it is difficult for users to explore multidimensional data with a tool pro-
viding only overviews of data. It is important for users to interactively drill down
through the low levels of details to refine their views. Furthermore, only interac-
tive textual displays such as the PivotTable are not effective for understanding or
extracting patterns from multidimensional databases.
Sifer has presented the SGViewer [63] tool, an interface technique for querying in
OLAP. The technique consists of three different viewing parts including progressive,
global and result. A user can drill down through the progressive coordinated view
and view details of the results through result coordination. The global view displays
the trend of a found result. Although SGViewer provides similar conceptual ideas,
it is different from the technique developed in this thesis. The SGViewer technique
5.2. TERMINOLOGY 107
is only a design exercise and supports only leaf nodes in five dimensions. However,
the approach in this thesis has no limitation on this.
This new technique, called VisOLAP allows users to make a decision about whether
they want to get overviews of data, to drill down into low levels, to roll up to high
levels, or to view any particular region of interest of data anytime. The VisOLAP
system provides a navigation facility which reduces user responsibility of remem-
bering the exploration path of interest. In addition, the user is able to keep track of
the exploration and view the distribution of navigation results across the selected
dimensions with their explored levels and members.
This chapter is organized as follows. Section 5.2 introduces terminology used
throughout this chapter. I then introduce the system architecture and discuss
how the system is implemented in Section 5.3. Section 5.3.1 describes the system
components, followed by a discussion on how to visualize the OLAP data cube in
Section 5.3.2. The details of the remaining components of the VisOLAP system,
namely an interaction tool are provided in Section 5.4 and a query generation tool
in Section 5.5. An analysis of the VisOLAP performance including experimental
results is given in Section 5.6 and the conclusion in Section 5.7.
5.2 Terminology
A detailed discussion of OLAP technology is beyond the scope of this thesis, how-
ever, a brief overview of some of the concepts used in this chapter is given below. An
on-line transaction processing (OLTP) system is related to relational database sys-
tems. The OLTP system serves everyday transactions and operations. In contrast,
an On-line analytical processing (OLAP) system is related to multidimensional
database systems and is normally stored in data warehouses [21, 76]. The OLAP
system helps analysts or knowledge users such as managers in analysis tasks and
decision making. Data in warehouses are historical data which is summarized and
aggregated from a variety of relational databases. In both OLAP tools and data
108CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
warehouses, a model of data is formed as a multidimensional data cube.
The multidimensional data model, known as a data cube, consists of three major
components including schema, dimensions, and measures. The schema typically
contains a fact table and dimension tables. The schema can be categorized into
three types as follows.
• Star schema: The fact table is a large central table surrounded by dimension
tables and containing measures and dimension keys linking to individual di-
mension tables. Each dimension table also contains a set of attributes. This
schema conceptually forms a shape like a star.
• Snowflake schema: This schema is an alternative to the star schema. Its shape
is like a snowflake. Some dimension tables of this schema are in normalized
form. In other words, those dimension tables are split into sub-tables to reduce
redundancies. However, more joining operations can reduce the effectiveness
of this schema when executing a query.
• Fact Constellation schema: This schema is a set of star schemas. It contains
multiple fact tables sharing dimensional tables.
Dimensions and measures are organized formats of a data cube which allow viewing
of aggregated data from different perspectives. The term dimension is used to
represent categories of data. Dimension and measure are similar to independent and
dependent variables in statistics. The distinction between dimension and measure
are as follows.
• Dimensions are organized in a hierarchical fashion and are similar to inde-
pendent variables. Dimensions are distributed along the dimension tables of
the schema. For instance, a product is the dimension and the number of unit
sales of the product is the measure. The dimensions usually have hierarchies
consisting of multiple levels of abstraction from a high level to a low level.
5.2. TERMINOLOGY 109
For example, the same product dimension may be composed of product fam-
ily, product department, and product name. A time dimension comprises year,
quarter, and month.
• Measure is similar to the dependent variable and is a numeric value. The
aggregation of the measure should generate a new sensing number. Typically,
measures are organized in the fact table of the schema. An example of a
measure is the sales amount.
Microsoft SQL Server for OLAP maps a data schema to ADOMD objects as the
diagram in Figure 51 shows. The diagram is mainly composed of collections and
objects.
To communicate with an OLAP server of Microsoft SQL Server Analysis Ser-
vices [29], there are three main approaches as follows.
• Decision Support Object model (DSO)
• Add-ins Interface and Objects
• PivotTable Services
VisOLAP relies on PivotTable Services, so the implementation supports only ac-
cessing the OLAP server through PivotTable Services. PivotTable Service provides
OLE DB functionalities for accessing both multidimensional data and data mining
through Multidimensional Expression (MDX). Similar to a SQL syntax for query-
ing and manipulating data from relational databases, MDX is a powerful syntax for
querying and manipulating multidimensional data from OLAP data cubes [53].
110CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
Figure 51: A diagram of ADOMD object model [29].
5.3. VISOLAP SYSTEM ARCHITECTURE AND IMPLEMENTATION 111
5.3 VisOLAP system architecture and implemen-
tation
VisOLAP system architecture consists of four main components, system connec-
tion, visualization, interaction, and query generation as shown in Figure 52. The
system connects to a multidimensional database or the data cube which a user pro-
vides through an OLAP Server. The user selects dimensions for exploring from the
user interface in the left panel as shown in Figure 54. The system then retrieves
details of members in the particular dimensions from the data cube and organizes
these members in barsticks. This visual feedback gives users the member details
of each selected dimension so that they can interact and explore the correlations
among these selected dimensions. The interaction tool allows users to obtain de-
tails of individual members of the dimensions and browse into deeper levels of each
dimension. More details of each component are described in the following sections.
I have implemented the system for visualizing OLAP data cubes in Visual C++
with both ActiveX Data Object (ADO) and ActiveX Data Object Multidimen-
sional (ADOMD) interface. The ADOMD interface is an extension of the ADO
interface. The system has been developed to access multidimensional databases
through PivotTable service in Microsoft SQL Server 2000 Analysis Services.
5.3.1 System connection
To connect VisOLAP with a multidimensional database, the system needs to open
a connection to the multidimensional database and create a catalog to activate
the connection. The system then prepares a system structure and retrieves the
multidimensional schema information to obtain details of a data cube structure.
The system initializes a tree visualization of the schema as well as organizes a data
structure for retrieving properties of dimensions through ADOMD objects.
112CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
Figure 52: VisOLAP system architecture
5.3. VISOLAP SYSTEM ARCHITECTURE AND IMPLEMENTATION 113
5.3.2 Visualizing OLAP data cubes
I have modified the idea in Chapter 3 for exploring hierarchical structure of OLAP
data cubes and have used a barstick for representing each dimension of the data
cube but the details of the display are different.
One of the frameworks for visualizing OLAP data cubes in VisOlap is shown in
Figure 53. The figure represents a framework of a product sales data cube and
its visualization. The data cube has three hierarchical dimensions including Time
(T), Product (P), and Location (L) and consists of eight data cells. Each data cell
shows a number of product sold in a specific location at a time. For example, V1 is
a number of P1 sold in a location L1, at time T1. Each dimension is displayed on a
barstick. Barsticks are vertically arranged in hierarchical fashion when users select
the highest level of dimension of interest. A barstick is divided into small rectangles
which represent all existing members of the level in that dimension. A barstick does
not show the member with no measure value. In other words, the member which
does not have a measure value is hidden from the barstick. The length of each
rectangle is calculated based on the proportion of measure value of the member in
the selected dimension. For example, if the selected dimension is ‘time’, it contains
four members in the next level called ‘quarter’. The total profit is $250,000 and
profits for the quarters are $70,000, $55,000, $37,500 and $87,500. The calculated
proportions of all profits in each quarter are 28%, 22%, 15%, and 35%, respectively
and they are the proportion of the length of all rectangles representing the quarters.
The system interface can be divided into three areas as follows.
• First, the left panel represents the multidimensional data model.
• The upper right panel displays the visual exploration through interaction.
• Finally the lower right panel is for viewing one deeper level.
Figure 54 presents an example of visualization for exploring OLAP data cubes. A
tree structure in the left panel of Figure 54 represents the hierarchical arrangement
114CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
Figure 53: The left side of the figure shows a data cube consisting of hierarchicaldimensions including Time (T), Product (P), and Location (L), and eight data cellswith the numbers of product sales. The right side is visualization of the selecteddimensions with their members.
Figure 54: The left panel shows the tree structure of the OLAP data cube. Theupper right panel visualizes the selected dimensions as hierarchical barsticks includingProduct Family, Product Department, Store Type, Year, and Quarter levels. Thelower right panel displays one deeper level of the data cube in advance depending onthe position of the mouse. In this case, the user has positioned the mouse on ‘Q3’and this panel shows the ‘Month’ level of the fourth quarter.
5.4. INTERACTION 115
of the data cube. The upper right panel displays explored hierarchical barsticks
while the lower right panel illustrates one deeper level of the currently selected
member. In this example, ‘Sales’ is selected as a data cube of interest and the
variable ‘Unit sales’ is selected as a measure of interest. The first barstick repre-
sents a ‘Product Family’ level of the ‘Product’ dimension. There are three members
including ‘Drinks’, ‘Foods’, and ‘Non-consumable’ in ‘Product Family’ and ‘Foods’
has the highest unit sales displayed by its longest proportion of the rectangle in the
barstick. The second barstick illustrates a ‘Product Department’ level of the ‘Prod-
uct’ dimension. In ‘Product Family’, the members of ‘Drink’ consist of ‘Alcoholic
Beverages’, ‘Beverages’, and ‘Dairy’. ‘Store Type’ level of the ‘Store Type’ dimen-
sion is selected in the third barstick and ‘Supermarket’ has the highest proportion
of unit sales of the drink product followed by ‘Deluxe Supermarket’, ‘Gourmet Su-
permarket’ (G), ‘Mid-Size Grocery’ (M) and ‘Small Grocery’ (S). The fourth and
last barsticks are explored in the ‘Year’ and ‘Quarter’ levels of the ‘Time’ dimen-
sion. The data for unit sales of ‘Drink’ product exist only for 1997 and the fourth
quarter has the largest amounts of unit sales. The lower right panel displays the
month level consisting of July, August, and September or 7, 8, and 9 when the user
places the mouse over the third quarter of the last barstick. This is the next deeper
level in the ‘Time’ dimension.
5.4 Interaction
To efficiently support the analysis and exploration processes of the hierarchical
structure, the technique provides several navigational functions:
Drill down: Drill down is a function to navigate into deeper hierarchical levels of
each dimension in a data cube to obtain more details in a particular member. A
framework of this function is shown in Figure 55. This framework represents the
Drill down interaction into one lower level of the Product dimension and mapping
the numbers of the cells to the visualization. Figure 57, for example, shows the drill
116CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
Figure 55: An illustration representing the mapping numbers of the drilled downcells in the data cube to the visualization.
down function on the ‘Location’ dimension from ‘Countries’ level to ‘States’ level in
the data cube view. My technique provides this function through left mouse double
click. Users can drill down into deeper levels any time in any dimension. This
feature allows users to view a dimension in more details. Figure 54 shows drilling
down in the ‘Product’ dimension from ‘Product Family’ to ‘Product Department’
and in the ‘Time’ dimension from the ‘Year’ level to the ‘Quarter’ level of ‘Unit
sales’ sold in the ‘Deluxe Supermarket’ store type.
Roll up: In contrast to drill down, roll up is a function to navigate for exploring
upper levels of dimensions so that a user can see an overview of explored members.
A framework of this function is shown in Figure 53. This framework represents
rolling up in the Product dimension of the framework in Figure 55. Figure 57
also shows the data cube view of a roll up operation from the ‘States’ level to
the ‘Countries’ level in the ‘Location’ dimension. Users can roll up any particular
dimension by double clicking the right mouse button. Figure 58 illustrates rolling
up of the ‘Product’ dimension from the ‘Product Department’ level to the ‘Product
Family’ level.
Slice: This function allows users to view a particular sub-cube for any selected
5.4. INTERACTION 117
Figure 56: An illustration representing the mapping numbers of the sliced cells inthe data cube to the visualization.
dimension of the data cube. Figure 56 represents a framework of the Slice func-
tion with a change of the selection from P1 in the framework of Figure 53 to P2.
In Figure 57, a slice operation is shown on the ‘Drink’ member in the ‘Product’
dimension. My tool provides this function through left mouse click for viewing a
measure value of other members in the same level. Figure 59 shows the change of
exploration from Deluxe Supermarket (in Figure 54) to Supermarket.
All navigational functions can be automatically combined when users interact with
each barstick so that they can view any particular region of interest in the data
cube. For instance, the combination of drill down and slice functions allow users to
explore unit sales of ‘Alcoholic Beverages’ in ‘Product Department’ for all quarters
in 1997. Moreover, users can view each independent dimension when each barstick
is first created or by clicking the right mouse button on any barstick. The system
supports on demand details. When users move the mouse over any rectangle in the
barstick, the details of the rectangle including the name of the specified member
and its measure value are shown in a pop-up box as shown in Figure 54, ‘Q3’.
118CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
Figure 57: Examples of OLAP functionalities including drilling down, rolling up,and slicing on multidimensional data.
5.5. VISUAL EXPLORATION AND MDX QUERY 119
Figure 58: This Figure shows the selected dimensions including Product Family,Store Type, Year, and Quarter. The Product dimension is rolled up from the ProductDepartment level to Product Family level of Figure 54.
5.5 Visual Exploration and MDX query
I have implemented the binding of MDX queries with the navigational functions of
the interactive tool to enable users who are not OLAP experts to explore OLAP
data cubes and data warehouses without generating sophisticated MDX queries.
The basic syntax of an MDX statement looks similar to a SQL statement. An
example syntax of the MDX statement is:
SELECT < member selection > on axis1, < member selection > on axis2, ..
FROM < cube name >
A calculated member is a member of a dimension which is derived from values of the
other members. It has been used in the interaction tools of the VisOLAP system.
The definition of the calculated member is stored and calculated in response to
a query. The calculated member can be described by MDX statements, namely
WITH MEMBER and CREATE MEMBER. Only the WITH MEMBER statement
is used for setting up interactive queries in the system to aggregate new member
values and measures without increasing the size of a cube.
I describe some examples of the combination of MDX and the interactive tools
120CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
Figure 59: This figure shows the change of member selection from Supermarket toDeluxe Supermarket in the Store Type level.
based on Figure 54. Suppose ‘Sales’ is a selected data cube, ‘Unit Sales’ is a selected
measure, and ‘Product Family’ is the selected level of the ‘Product dimension’ for
querying. To view all members of the ‘Product Family’ level in proportion, a MDX
query implying this process can be described as:
WITH MEMBER Measures.[sum] AS
′sum([Product].[ProductFamily].members, Measures.[UnitSales])′
MEMBER Measures.[percent] AS ′(([Product].CURRENTMEMBER,
Measures.[UnitSales])/(Measures.[sum]))′,
FORMAT STRING = ′Percent′
SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,
NON EMPTY [Product].[ProductFamily].members on rows
FROM Sales
When the user drills down on the ‘Product Department’ level through ‘Drink’ in the
‘Product Family’ level, an equivalent MDX query as shown below is automatically
generated to display the ‘Product Department’ members on the second barstick.
5.5. VISUAL EXPLORATION AND MDX QUERY 121
WITH MEMBER Measures.[sum] AS
′sum([Product].[AllProducts].[Drink].children, Measures.[UnitSales])′
MEMBER Measures.[percent] AS ′(([Product].CURRENTMEMBER,
Measures.[UnitSales])/(Measures.[sum]))′,
FORMAT STRING = ′Percent′
SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,
NON EMPTY {DESCENDANT[All Products].[Drink], [Product Department])}
on rows
FROM Sales
After the user drills down onto ‘Product Department’ level, the user might explore
several new dimensions for viewing the correlations. It is possible for the user to
explore dimensions which are drilled down. An equivalent MDX query representing
the exploration interaction as shown in Figure 54 is:
WITH MEMBER Measures.[sum] AS
′sum([Time].[1997].children, Measures.[UnitSales])′
MEMBER Measures.[percent] AS ′(([Time].CURRENTMEMBER,
Measures.[UnitSales])/(Measures.[sum]))′,
FORMAT STRING = ′Percent′
SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,
NON EMPTY {DESCENDANT([Time].[1997], [Time].[Quarters]) on rows
FROM SalesWHERE ([Product].[All Products].[Drink].[Alcoholic Beverages],
[Store Type].[All Store Type].[Deluxe : Supermarket])
As shown in Figure 54 and Figure 59, suppose the user changes the queried members
in the barstick representing the ‘Store Type’ level from ‘Deluxe Supermarket’ to
‘Supermarket’. The equivalent MDX statement for this process to display the ‘Year’
members on the fourth barstick of the ‘Time’ dimension is:
122CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
WITH MEMBER Measures.[sum] AS
′sum([Time].[Year].members, Measures.[UnitSales])′
MEMBER Measures.[percent] AS ′(([Time].CURRENTMEMBER,
Measures.[UnitSales])/(Measures.[sum]))′,
FORMAT STRING = ′Percent′
SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,
NON EMPTY [Time].[Year].members on rows
FROM SalesWHERE ([Product].[All Products].[Drink].[Alcoholic Beverages],
[Store Type].[All Store Type].[Supermarket])
When the user rolls up on the ‘Product’ dimension from ‘Product Department’ level
to ‘Product Family’, the user needs to take a few more steps on the interaction in
case that there is no selected member of the upper rolled up level in the following
selected dimension. I generate these MDX queries automatically depending on user
interaction.
5.6 Analysis
A FoodMart 2000 database [29] is used in the case study. The database consists of
data cubes such as ‘Budget’, ‘HR’, ‘Sales’, and ‘Warehouse’. The ‘Sales’ data cube
comprises twelve dimensions excluding the hierarchical levels of each dimension
and seven measures. The ‘Product’, ‘Time’, ‘Store Type’, ‘Promotion Media’, and
‘Promotions’ dimensions and ‘Unit sales’ and ‘Profit’ measures are used for the
case study. Suppose the store manager would like to increase the sales of the drink
product stocked in the store. Figure 54, 58, and 59 show exploration of the drink
product family. The manager can extend the exploration of the ‘Promotions’ and
‘Promotion Media’ dimensions to obtain how they affect the sales amounts in each
year. For example, it is easy to find the following correlations from exploration of
the data cube.
5.7. SUMMARY 123
Daily paper, radio, and TV tend to be the most effective media to increase the
amount of sales in sales days promotion of the supermarkets, while bulk mails tend
to be the most effective way to advertise the promotions including ‘You Save Day’,
‘Shelf Emptiers’, and ‘Sales Galore’ for gourmet supermarkets. Daily paper is the
most effective medium to advertise the promotions such as ‘Big Time Discounts’
for mid-size groceries, and ‘In-Store Coupon’ is the most effective way to increase
the sales promotions for small groceries as shown in Figure 60. In addition, the
number of unit sales varies over time. For instance, the fourth quarter has the
highest number of alcoholic beverage and beverage sales in all stores except the
small groceries which have the highest number of alcoholic beverage sales in the
third quarter as shown in Figure 61. However, as the analysts, managers, and
executives know better market and store situations, they can explore and analyze
the data in several efficient ways.
5.7 Summary
The applications for OLAP have been extensively researched but most of them
are only investigated in modeling tasks and presenting results through textual for-
mats. The integration of visualization and interaction tools into OLAP enhances
the human capability to analyze and understand multidimensional databases.
A novel interactive visual exploration tool has been introduced for analysis of OLAP
data cubes. The technique provides visual feedbacks while users explore data cubes
in graphical formats rather than textual table formats. The incorporation of both
visualization and the OLAP service helps users to deeply understand, gain insight
and extract useful information faster from their data. Hierarchical barsticks are
presented for exploring and visualizing hierarchical structures of the OLAP data
cubes. Users can view the trail of the exploration through visualization in order
to reduce the recognition load and also view one deeper level in advance before
drilling down into deeper levels. In addition, the technique provides users overviews
124CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
(a)
(b)
(c)
(d)
Figure 60: This Figure shows an example of visualization for exploring Promotionmedia, Store type, and Unit sales.
5.7. SUMMARY 125
Figure 61: This figure shows an example of visualization for exploring alcoholicbeverage sales of small groceries in Year 1997.
and refined views of interest in the data cubes. Users are allowed to change the
exploration views anytime through the combination of navigational functions and
interactive tools. VisOLAP is expected to be useful for interactive visual exploration
of data cubes.
The next chapter provides a discussion of the implications of the systems presented
in this thesis and emphasizes the contributions and the limitations of individual
chapters. Some future research directions are also discussed.
126CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING
Chapter 6
Conclusion
6.1 Summary
I have investigated several visualization frameworks based on the dynamic query
mechanism in this thesis. The emphasis in designing all of these frameworks was on
simplicity, flexibility, giving the user all the controls for selection and exploration
and finally, reducing the overload of information and occlusion that is present in
other existing systems.
The setting and contribution of the thesis is presented in Chapter 1. Next, a
detailed overview of a diverse range of visualization techniques has been given in
Chapter 2. Since the interest in this thesis is in designing dynamic and interactive
visualization frameworks, the design methodologies are placed within the dynamic
query framework. The dynamic query framework is explained in detail in the second
part of Chapter 2.
Chapter 3 presented a novel technique, VisEx, for visual exploration of multidimen-
sional datasets, in particular for exploring correlations among attributes in large
datasets. Most previous visualization systems can display the correlations among
a small number of attributes, usually two or three, whereas it is possible to explore
correlations among many attributes in VisEx. Moreover, the user does not need
127
128 CHAPTER 6. CONCLUSION
to go through any prior training or have any prior knowledge of the underlying
database for using VisEx. The system provides visual feedback to the user during
interactive visualization sessions and the user can dynamically change the attributes
and their ranges for testing hypotheses. A user study of the system has also been
done for evaluating its simplicity and usefulness.
Chapter 4 presented a new technique called VisDM for integrating visualization in
the association rule mining algorithms. Most algorithms for association rule mining
generate a large number of rules all of which are not interesting. Hence, there is a
need for integrating human expertise in a mining algorithm so that an analyst can
mine interesting association rules. The user has complete freedom in choosing the
antecedents and consequents of rules in the VisDM system, and hence, it is possible
for the user to mine interesting rules instead of all rules that are above a threshold
of support and confidence. The simplicity of the VisDM system was demonstrated
through a user study.
A new framework called VisAR is also presented for visualizing a large number of
association rules in Chapter 4. Most previous systems display a large number of
rules in a single view and it is difficult for a user to concentrate on a subset of
interesting rules. Moreover, the display of a large number of rules usually results
in occlusion and screen clutter. VisAR integrates matrix-based and graph-based
techniques in a single framework to display a large number of association rules.
Moreover, VisAR gives the user complete freedom in choosing and visualizing the
rules that the user is interested in.
In Chapter 5 a novel interactive visualization technique called VisOLAP for OLAP
analysis tasks is presented. The visualization technique in VisOLAP has been
modified from VisEx to explore and analyze hierarchical structure of OLAP data
cubes in order to reduce user responsibility of remembering the exploration paths
of interest. Moreover, analysts can view exploration tracks and distribution of
navigating results across the specified dimensions and levels.
All the four systems, VisEx, VisDM, VisAR and VisOLAP, developed in this thesis
6.1. SUMMARY 129
are scalable with respect to the size of the datasets. VisEx can handle small as
well as very large datasets. Each bar in a barstick can represent one data record
when the dataset is very small and can represent an arbitrarily large number of
records for large datasets. Since VisEx is designed to give the user a quantitative
estimate of the correleations among attributes, I have not tried to represent each
data record individually. However, the user can get a quantitative estimate of the
number of records first by comparing the different color levels that are used for
coloring the bars and also by clicking on a bar or group of bars, through a dialogue
box. The representation of data through the two primitives barstick and bar has
helped me to incorporate this scalability in VisEx. Similarly, there is no limit on the
size of transactional databases that VisDM can handle. VisAR can handle a large
number of association rules compared to other systems due to its two dimensional
display. While three dimensional displays can give rise to occlusion, this is not a
problem in VisAR. However, occlusion is still a problem if the number of rules is
higher than a few hundred. One of the main features of VisAR is the ability of the
user to visualize selected association rules instead of seeing all the rules at once. I
expect that any user will utilize this facility more often as it allows users to focus on
specific association rules. Perhaps the least scalable of the systems is VisOLAP as it
is difficult to display a large number of dimensions at a time within a limited screen
space. On the other hand, it is not possible to represent a collection of dimensions
by a bar, as in VisEx, as the user may need to see individual dimensions for drilling
down. This aspect of VisOLAP needs to be explored further in future.
The user studies reported in this thesis are only preliminary. It was not possible
to compare the systems with similar systems mainly due to two reasons. First,
it is difficult to get implementations of most of the other systems either because
they are commercial systems or because I could not get any response from the
authors. Second, I had only limited time and hence could not organize large scale
user studies. It is very important to conduct more extensive user studies of the
systems reported in this thesis with more participants and also with participants
130 CHAPTER 6. CONCLUSION
from a diverse range of backgrounds.
6.2 Future Work
Although the work presented in this thesis covers the contributions in the area
of information visualization, several research directions remain open. Some of the
possible research directions are mentioned below.
• An interesting research problem is to integrate visualization into other data
mining algorithms such as clustering and classification. For example, appro-
priate visualization for clustering may provide analysts important feedback for
understanding the association between clusters. In addition, an integration
of visualization into other knowledge discovery areas is possible.
• Another interesting problem is the incorporation of visualization into other
automatic data mining algorithms through a tight coupling mechanism like
VisDM.
• It is also important to explore the visualization of OLAP data cubes in more
details as OLAP technology is becoming an integral part of most business
processes.
Bibliography
[1] R. Agrawal, T. Imielinski and A. Swami. Mining association rules between
sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD
International Conference on Management of data, pages 207–216. ACM Press,
1993.
[2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in
large databases. In Proceedings of the 20th International Conference on Very
Large Data Bases, pages 487–499. Morgan Kaufmann Publishers Inc., 1994.
[3] C. Ahlberg and B. Schneiderman. Visual information seeking : tight coupling
of dynamic query filters with starfield displays. In Proceedings of the ACM
SIGCHI Conference on Human Factors in Computing Systems, pages 313–317.
ACM Press, 1994.
[4] C. Ahlberg, C. Williamson and B. Schneiderman. Dynamic queries for infor-
mation exploration : An implementation and evaluation. In Proceedings of the
ACM SIGCHI Conference on Human Factors in Computing Systems, pages
619–626. ACM Press, 1992.
[5] K. Andrews and H. Heidegger. Information slices: Visualising and exploring
large hierarchies using cascading, semi-circular discs. In IEEE Symposium on
Information Visualization (IEEE InfoVis’98), pages 9–12, October 1998.
131
132 BIBLIOGRAPHY
[6] M. Ankerst, D. A. Keim and H. P. Kriegel. Circle segments: A technique for
visually exploring large multidimensional data sets. In In Visualization ’96,
Hot Topic Session, 1996.
[7] W. Basalaj. Proximity visualisation of abstract data. In Technical Report 509.
University of Cambridge Computer Laboratory, 2001.
[8] J. Bertin. Semiology of Graphics. Madison, Wis.: University of Wisconsin,
1983.
[9] S. K. Card, J. D. MacKinlay and B. Shneiderman. Readings in Information
Visualization: Using Vision to Think. Elsevier Science & Technology Books,
January 1999.
[10] C.Beshers and S.Feiner. Autovisual: rule-based design of interactive multivari-
ate visualizations. Computer Graphics and Applications, IEEE, Volume 13,
Number 4, pages 41–49, 1993.
[11] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap tech-
nology. SIGMOD Rec., Volume 26, Number 1, pages 65–74, 1997.
[12] H. Chernoff. The use of faces to represent points in k-dimensional space graph-
ically. Journal of the American Statistical Association, Volume 68, pages 361–
368, 1973.
[13] W. S. Cleveland. Visualizing data. Hobart Press Summit, 1993.
[14] Microsoft Corperation. Microsoft excel:user’s guide 2, version 4.0. Redmond,
WA Microsoft Corperation, 1992.
[15] M. C. Ferreira de Oliveira and H.Levkowitz. From visual data exploration
to visual data mining: a survey. IEEE Transactions on Visualization and
Computer Graphics, Volume 9, pages 378–394, 2003.
BIBLIOGRAPHY 133
[16] T. A. DeFanti, M. D. Brown and B. H. McCormick. Visualization: expand-
ing scientific and engineering research opportunities. Computer, Volume 22,
Number 8, pages 12–16,22–5, August 1989.
[17] S. G. Eick. Visualizing multi-dimensional data. SIGGRAPH Computer Graph-
ics, Volume 34, Number 1, pages 61–67, 2000.
[18] U. Fayyad, G. Piatetsky-Shapiro and P. Smyth. From data mining to knowledge
discovery in databases. AI Magazine, Volume 17, Number 3, pages 37–54, 1996.
[19] S. K. Feiner and C. Beshers. Worlds within worlds: metaphors for exploring n-
dimensional virtual worlds. In Proceedings of the 3rd Annual ACM SIGGRAPH
Symposium on User Interface Software and Technology, pages 76–83. ACM
Press, 1990.
[20] J. D. Foley, A. V. Dam, S. K. Feiner and J. F. Hughes. Computer Graphics:
Principles and Practice Second edition in C. Addison Wesley, 1997.
[21] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan
Kaufmann, 2001.
[22] J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate gener-
ation. In Proceedings of the 2000 ACM SIGMOD International Conference on
Management of data, pages 1–12. ACM Press, 2000.
[23] D. Harrison and D. L. Rubinfeld. Hedonic prices and the demand for clean air.
J. Environ. Economics & Management, Volume 5, pages 81–102, 1978.
[24] C. G. Healey, K. Booth and J. And. Harnessing preattentive processes for
multivariate data visualization. In Proceedings Graphics Interface ’93, pages
107–117, 1993.
[25] C. G. Healey and J. T. Enns. Large datasets at a glance: Combining textures
and colors in scientific visualization. IEEE Transactions on Visualization and
Computer Graphics, Volume 5, Number 2, pages 145–167, 1999.
134 BIBLIOGRAPHY
[26] S. Hettich, C. L. Blake and C. J. Merz. UCI repository of machine learning
databases, 1998.
[27] H. Hofmann, A. P. J. M. Siebes and A. F. X. Wilhelm. Visualizing associ-
ation rules with interactive mosaic plots. In Proceedings of the sixth ACM
SIGKDD International Conference on Knowledge Discovery and Data mining,
pages 227–235. ACM Press, 2000.
[28] M. A. W. Houtsma and A. N. Swami. Set-oriented mining for association rules
in relational databases. In Proceedings of the Eleventh International Conference
on Data Engineering, pages 25–33. IEEE Computer Society, 1995.
[29] Microsoft http://msdn.microsoft.com.
[30] http://web.cs.wpi.edu/ matt/courses/cs563/talks/perception.html.
[31] http://www.contourcomponents.com/.
[32] SAS Institute Inc. http://www.sas.com/technologies/analytics/datamining/miner/.
[33] SGI http://www.sgi.com/software/mineset.html.
[34] A. Inselberg and B. Dimsdale. Parallel coordinates for visualizing multidimen-
sional geometry. In CG International ’87 on Computer graphics 1987, pages
25–44, New York, NY, USA, 1987. Springer-Verlag New York, Inc.
[35] B. Johnson and B. Shneiderman. Tree-maps: a space-filling approach to the
visualization of hierarchical information structures. In Proceedings of the 2nd
International IEEE Visualization Conference, pages 284–291. IEEE Computer
Society, October 1991.
[36] E. Kandogan. Visualizing multi-dimensional clusters, trends, and outliers using
star coordinates. In Proceedings of the seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data mining, pages 107–116. ACM
Press, 2001.
BIBLIOGRAPHY 135
[37] D. A. Keim. Databases and visualization. In SIGMOD ’96: Proceedings of the
1996 ACM SIGMOD International Conference on Management of data, page
543, New York, NY, USA, 1996. ACM Press.
[38] D. A. Keim. Designing pixel-oriented visualization techniques: theory and
applications. Visualization and Computer Graphics, IEEE Transactions on,
Volume 6, pages 59–78, 2000.
[39] D. A. Keim, M. Hao, U. Dayal, M. Hsu and J. Ladisch. Pixel bar charts: A
new technique for visualizing large multi-attribute data sets without aggre-
gation. In Proceedings of the IEEE Symposium on Information Visualization
2001 (INFOVIS’01), page 113. IEEE Computer Society, 2001.
[40] D. A. Keim, M. C. Hao, and U. Dayal. Hierarchical pixel bar charts. IEEE
Transactions on Visualization and Computer Graphics, Volume 8, pages 255–
269, 2002.
[41] D. A. Keim and H. P. Kriegel. VisDB: database exploration using multidimen-
sional visualization. Computer Graphics and Applications, IEEE, Volume 14,
pages 40–49, 1994.
[42] J. Lamping, R. Rao, and P. Pirolli. A focus + context technique based on
hyperbolic geometry for visualizing large hierarchies. In CHI ’95, ACM Con-
ference on Human Factors in Computing Systems, pages 401–408. ACM Press,
1995.
[43] T. Lanning, K. Wittenburg, M. Heinrichs, C. Fyock and G. Li. Multidimen-
sional information visualization through sliding rods. In Proceedings of Ad-
vanced Visual Interfaces - AVI 2000, pages 173–180. ACM Press, 2000.
[44] H. Levkowitz. Perceptual steps along color scales. International Journal of
Imaging Systems and Technology, Volume 7, pages 97–101, 1996.
[45] H. Levkowitz and G. T. Herman. Color scales for image data. Computer
Graphics and Applications, IEEE, Volume 12, Number 1, pages 72–80, 1992.
136 BIBLIOGRAPHY
[46] B. Liu, W. Hsu and Y. Ma. Integrating classification and association rule
mining. In Proceedings of the Fourth International Conference on Knowledge
Discovery and Data Mining, pages 80–86, 1998.
[47] A. S. Maniatis, P. Vassiliadis, S. Skiadopoulos and Y. Vassiliou. Advance
visualization for olap. In Proceedings of the 6th ACM International Workshop
on Data wareshousing and OLAP, pages 9–16. ACM Press, 2003.
[48] D. Marghescu and M. J. Rajanen. Assessing the use of som technique in data
mining. In Proceeding of the 23rd IASTED International Multi-Conference
Databases and Applications, pages 181–186, February 2005.
[49] B. H. McCormick, T.A. DeFanti and M.D. Brown (ed). Visualization in scien-
tific computing. Computer Graphics, Volume 21, Number 6, November 1987.
[50] T. Mihalisin, E.Gawlinski, J. Timlin and J. Schwegler. Visualizing a scalar field
on an n-dimensional lattice. In A.Kaufman (editor), Proceedings of the First
IEEE Conference on Visualization’90, pages 255–262 and 479–480. Practical,
1990.
[51] T. Mihalisin, J. Timlin and J. Schwegler. Visualization and analysis of multi-
variate data: a technique for all fields. In G.M. Nielson and L Rosenblum
(editors), Proceedings of the IEEE Conference on Visualization’91, pages 171–
178 and 421. Practical, 1991.
[52] T. Mihalisin, J. Timlin and J. Schwegler. Visualizing multivariate functions,
data, and distributions. Computer Graphics and Applications, IEEE, Vol-
ume 11, pages 28–35, 1991.
[53] C. Nolan. Manipulate and query olap data using adomd and multidimensional
expression. In Microsoft Systems Journal. Microsoft, August 1999.
[54] P. O’Donnell and N. Draper. An experimental evaluation of an alternative to
the pivot table for ad hoc access to olap data. In Proceedings of the 2004 IFIP
BIBLIOGRAPHY 137
International Conference on Decision Support Systems (DSS2004): Decision
Support in an Uncertain and Complex World, July 2004.
[55] K. H. Ong, K. L. Ong, W. K. Ng and E. P. Lim. Crystalclear: Active visu-
alization of association rules. In International Workshop on Active Mining (
AM-2002), in conjunction with IEEE International Conference On Data Min-
ing, December 2002.
[56] G. Piatetsky-Shapiro. Discovery, analysis and presentation of strong rules. In
Knowledge Discovery in Databases, pages 229–248, 1991.
[57] R.M. Pickett and G.G. Grinstein. Iconographic displays for visualizing multi-
dimensional data. In Proceedings of the 1988 IEEE International Conference
on Systems, Man, and Cybernetics, 1988., Volume 1, pages 514–519, 1988.
[58] R. Rao and S. K. Card. The table lens: Merging graphical and symbolic rep-
resentations in an interactive focus+context visualization for tabular informa-
tion. In Proceedings of the ACM Conference on Human Factors in Computing
Systems, CHI. ACM, 1994.
[59] G. G. Robertson, S. K. Card and J. D. Mackinlay. Information visualization
using 3d interactive animation. Commun. ACM, Volume 36, Number 4, pages
57–71, 1993.
[60] A. Savasere, E. Omiecinski and S. B. Navathe. An efficient algorithm for mining
association rules in large databases. In Proceedings of the 21th International
Conference on Very Large Data Bases, pages 432–444. Morgan Kaufmann Pub-
lishers Inc., 1995.
[61] B. Schneiderman. Dynamic queries for visual information seeking. IEEE Soft-
ware, Volume 11, Number 6, pages 70–77, 1994.
[62] J. H. Seigel, E. J. Farrell, R. M. Goldwyn and H. P. Friedman. The surgical
implication of physiologic patterns in myocardial infarction shock. Surgery,
Volume 72, pages 126–141, 1972.
138 BIBLIOGRAPHY
[63] M. Sifer. A visual interface technique for exploring olap data with coordinated
dimension hierarchies. In Proceedings of the Twelfth International Conference
on Information and Knowledge Management, pages 532–535. ACM Press, 2003.
[64] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge
discovery systems. IEEE Transactions on Knowledge and Data Engineering,
Volume 8, Number 6, pages 970–974, 1996.
[65] R. Spence. Sensitivity encoding to support information space navigation : a
design guideline. Information Visualization, Volume 1, pages 120–129, 2002.
[66] R. Spence and L. Tweedie. The attribute explorer : information synthesis via
exploration. Interacting with Computers, Volume 11, pages 137–146, 1998.
[67] M. Spenke and C. Beilken. visual, interactive data mining with infozoom - the
financial dataset. In 3rd European Conference on Principles and Practice of
Knowledge Discovery in Databases, pages 15–18, 1999.
[68] M. Spenke, C. Beilken and T. Berlage. FOCUS: The interactive table for
product comparison and selection. In ACM Symposium on User Interface
Software and Technology, pages 41–50, 1996.
[69] R. Srikant and R. Agrawal. Mining generalized association rules. In VLDB ’95:
Proceedings of the 21th International Conference on Very Large Data Bases,
pages 407–419. Morgan Kaufmann Publishers Inc., 1995.
[70] J. Stasko. An evaluation of space-filling information visualizations for depict-
ing hierarchical structures. Internation Journal of Human-Computer Studies,
Volume 53, Number 5, pages 663–694, 2000.
[71] StatLib-Datasets Archive, http://lib.stat.cmu.edu/datasets. Carnegie Mellon
University, 2004.
BIBLIOGRAPHY 139
[72] C. Stolte, D. Tang and P. Hanrahan. Polaris: a system for query, analysis, and
visualization of multidimensional relational databases. IEEE Transactions on
Visualization and Computer Graphics, Volume 8, Number 1, pages 52–65, 2002.
[73] C. Stolte, D. Tang and P. Hanrahan. Query, analysis, and visualization of
hierarchically structured data using polaris. In Proceedings of the Sixth ACM
SIGKDD International Conference on Knowledge Discovery and Data mining,
pages 112–122. ACM Press, 2002.
[74] E. Thomsen. OLAP Solutions Building Multidimensional Information Systems.
Wiley Computer Publishing, 1997.
[75] L. Tweedie, B. Spence, D. Williams and R. Bhogal. The attribute explorer. In
Proceedings of the ACM SIGCHI Conference on Human Factors in Computing
Systems (Coference Companion), pages 435–436. ACM Press, 1994.
[76] R. Vieira. Profesional SQL Server 7 Programming. Wrox Press, 1999.
[77] K. Wang, Y. Jiang and L. V. S. Lakshmanan. Mining unexpected rules by push-
ing user dynamics. In KDD ’03: Proceedings of the Ninth ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data mining, pages 246–255.
ACM Press, 2003.
[78] M. O. Ward. Xmdvtool: integrating multiple methods for visualizing mul-
tivariate data. In Proceedings of the Conference on Visualization ’94, pages
326–333. IEEE Computer Society Press, 1994.
[79] C. Williamson and B. Schneiderman. The dynamic homefinder : evaluating dy-
namic queries in a real-estate information exploration system. In Proceedings of
the 15th Annual International ACM Conference on Research and Development
of Information Retrieval, pages 338–346. ACM Press, 1992.
[80] K. Wittenburg, T. Lanning, M. Heinrichs and M. Stanton. Parallel bargrams
for consumer-based information exploration and choice. In Proceedings of
140 BIBLIOGRAPHY
the 14th Annual ACM Symposium on User Interface Software and Technol-
ogy (UIST ’01), pages 51–60. ACM Press, 2001.
[81] P. C. Wong, P. Whitney and J. Thomas. Visualizing association rules for
text mining. In INFOVIS ’99: Proceedings of the 1999 IEEE Symposium on
Information Visualization, pages 120–123, Washington, DC, USA, 1999. IEEE
Computer Society.
[82] K. Zhao and B. Liu. Visual analysis of the behavior of discovered rules. In
Workshop Notes in ACM SIGKDD-2001 Workshop on Visual Data Mining,
August 2001.
Appendix A
This appendix contains documents for the user study of VisEx in Chapter 3. The
documents include tutorial and experimental tasks, and a questionnaire.
A.1 Tasks
A.1.1 Tutorial Tasks
1. Is it true if cars have low mpg and acceleration, they will have higher weight
and horsepower?
2. How many cylinders most of the Japanese cars have? Do they have high or low
displacement? (You can use equal height histogram to look for distribution
of each value through left mouse double click on blue areas of each bar.)
3. Which country produces 3-cylinder cars and which country produces 8-cylinder
cars?
4. Are Japanese 6-cylinder cars generally heavier than European 6-cylinder cars?
5. Did European and Japanese companies only produce 4-cylinder cars in 1982?
141
142 APPENDIX A.
A.1.2 Experimental Tasks
For Task9 and Task10, please use the fixed attribute property located in Type
Attribute Selection (on the left panel) for exploration.
1. Is it true that highly educated people (with 15-18 years of schooling) tend to
work in professional and managerial occupations?
2. Is it true that highly educated clerks (with 15-18 years of schooling) tend to
have higher wages?
3. Are more males working in clerical jobs compared to females?
4. Which group earns higher wages in managerial jobs, males or females?
5. Do older people (above 60) with a lot of experience tend to have higher edu-
cation (more than 14 years of schooling)?
6. Do males tend to have higher education than females in managerial occupa-
tions?
7. Which occupation and sex does a person have when he/she earns the highest
wage and has less experience?
8. Is there a highest educated female (age above 60 years old) who earns high
wage (at least 20 dollars per hour) and works in the Manufacturing sector?
9. Do higher educated (with 15-18 years of schooling) people earning high wages
(20-45 dollars per hour) tend to live in the South? Does the higher educated
person who earns the highest wage in the group live in the South?
10. How old is and how many years of schooling does a male person have when
he works in the service occupation and earns the highest wage?
A.2. QUESTIONNAIRE 143
A.2 Questionnaire
Part I : Please provide your information
1. Do you have experience in data analysis?
2. Do you have experience in using any visualization tool?
Part II : Please provide ranking of your satisfaction
Strongly disagree Disagree Fair Agree Strongly agree
Usability
• Easy to complete the tasks 1 2 3 4 5
• Easy to learn tool 1 2 3 4 5
• Easy to use tool 1 2 3 4 5
• Easy to understand visualization 1 2 3 4 5
Quality of visualization
• Clarity of visual representation 1 2 3 4 5
• I was able to understand displayed pa-
rameters
1 2 3 4 5
• I was able to identify the correlation
among specified attributes
1 2 3 4 5
• I was able to compare the groups of ob-
jects in the specified range of attributes
1 2 3 4 5
• I was able to identify the specific groups
of data objects
1 2 3 4 5
• I was able to use and understand equal
height histogram
1 2 3 4 5
144 APPENDIX A.
Strongly disagree Disagree Fair Agree Strongly agree
• I was able to identify the difference be-
tween Fixed attributes and Normal (Non
fixed) attributes in Type Attribute Selec-
tion
1 2 3 4 5
Which one do you prefer to explore data
sets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Learning(Quality of interaction)
• Easy to direct the search for data of in-
terest (naviagation)
1 2 3 4 5
• I was able to specify parameters 1 2 3 4 5
• I was able to correct my mistakes 1 2 3 4 5
• I was able to change the selection of pa-
rameters
1 2 3 4 5
• I was able to explore data 1 2 3 4 5
• I was able to use interactive features
(through mouse click)
1 2 3 4 5
• I was able to use and change features
(including Normal, Comparison, Fixed at-
tributes)
1 2 3 4 5
Quality of information
• Reliable 1 2 3 4 5
• Interesting 1 2 3 4 5
• Clear and understandable 1 2 3 4 5
• Easy to interpret results 1 2 3 4 5
A.2. QUESTIONNAIRE 145
Part III : Please provide comments.
Please provide your comments about the system.
146 APPENDIX A.
Appendix B
This appendix contains documents for the user study of VisDM in Chapter 4. The
documents include tasks from different data sets and a questionnaire.
B.1 Tasks
B.1.1 Tasks from Dataset1
Suppose you are recently appointed as a marketing manager for a supermarket. You
have been informed that the volume of sell this month has dropped. You would
like to identify which items have low unit sale and the sale of which items can be
increased by making a promotion or putting the items close together so that there
is a chance that customers buying one of these items may buy the other items as
well.
You have four main tasks to complete. Please set minimum support = 10.
1. Identify (name) the first two maximum and two minimum items sold according
to their sales volume.
2. Identify four items which are purchased together most of the time and pro-
vide an association rule which satisfies the provided support and confidence
thresholds. (Assume that the support > 70 and the confidence > 70 are the
possible association of items purchased together).
147
148 APPENDIX B.
3. Identify three association rules involving item numbers: 17, 42, and 11 in-
cluding support and confidence of each rule. Which rule do you think is the
strongest?
4. Do you think it is possible that item numbers 31 and 29 are frequently pur-
chased together? (Assume that the support > 70 and confidence > 70 are the
possible association of items purchased together).
B.1.2 Tasks from Dataset2
You have four main tasks to complete. Please set minimum support = 10
1. Identify the first two maximum and two minimum items according to their
sales volume.
2. Identify two items which are purchased together the maximum number of
times and the association rule involving these items that satisfies the sup-
port and confidence threshold provided. (Assume that the support > 50 and
confidence > 70 are the possible association of items purchased together).
3. Identify three association rules involving the items: tomato, pacifier, and rice
including support and confidence of each rule.
4. Do you think rice and tomato are frequently purchased together? (Assume
that the support > 50 and confidence > 60 are the possible association of
items purchased together.)
B.2 Questionnaire
Part I : Please provide your information
1. Do you have experience in data analysis?
B.2. QUESTIONNAIRE 149
2. Do you have experience in using any visualization tool?
Part II : Please provide ranking of your satisfaction
Strongly disagree Disagree Fair Agree Strongly agree
Usability
• Easy to complete the tasks 1 2 3 4 5
• Easy to learn tool 1 2 3 4 5
• Easy to use tool 1 2 3 4 5
Quality of visualization
• Clarity of visual representation 1 2 3 4 5
• I was able to understand parameters 1 2 3 4 5
• I was able to identify the maximum per-
centage of items purchased together (co-
existing items)
1 2 3 4 5
• I was able to identify the minimum per-
centage of items purchased together (co-
existing items)
1 2 3 4 5
• I was able to find the item that is bought
most often
1 2 3 4 5
• I was able to find the item that is bought
least often
1 2 3 4 5
150 APPENDIX B.
Strongly disagree Disagree Fair Agree Strongly agree
Learning(Quality of interaction)
• Easy to direct the search for data of in-
terest (navigation)
1 2 3 4 5
• I was able to use parameters 1 2 3 4 5
• I was able to correct my mistakes 1 2 3 4 5
• I was able to change the selection of pa-
rameters
1 2 3 4 5
• I was able to explore data 1 2 3 4 5
Quality of information
• Reliable 1 2 3 4 5
• Interesting 1 2 3 4 5
• Clear and understandable 1 2 3 4 5
• Easy to interpret results 1 2 3 4 5
Part III : Please provide comments.
Please provide your comments about the system.