a visualization framework for exploring …...a visualization framework for exploring correlations...

186
A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis is presented to the School of Computer Science & Software Engineering for the degree of Doctor of Philosophy of The University of Western Australia By Kesaraporn Techapichetvanich 2005

Upload: others

Post on 28-May-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

A Visualization Framework for ExploringCorrelations among Attributes of a Large

Dataset and Its Applications in DataMining

This thesis is

presented to the

School of Computer Science & Software Engineering

for the degree of

Doctor of Philosophy

of

The University of Western Australia

By

Kesaraporn Techapichetvanich

2005

Page 2: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis
Page 3: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

c© Copyright 2005

by

Kesaraporn Techapichetvanich

iii

Page 4: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

iv

Page 5: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Abstract

Many databases in scientific and business applications have grown exponentially

in size in recent years. Accessing and using databases is no longer a specialized

activity as more and more ordinary users without any specialized knowledge are

trying to gain information from databases. Both expert and ordinary users face

significant challenges in understanding the information stored in databases. The

databases are so large in most cases that it is impossible to gain useful informa-

tion by inspecting data tables, which are the most common form of storing data

in relational databases. Visualization has emerged as one of the most important

techniques for exploring data stored in large databases. Appropriate visualization

techniques can reveal trends, correlations and associations in data that are very dif-

ficult to understand from a textual representation of the data. This thesis presents

several new frameworks for data visualization and visual data mining.

The first technique, VisEx, is useful for visual exploration of large multi-attribute

datasets and especially for exploring the correlations among the attributes in such

datasets. Most previous visualization techniques can display correlations among two

or three attributes at a time without excessive screen clutter. Though many data

exploration tasks require examining correlations among four or more attributes,

this can be done only indirectly using previous visualization tools. However, the

technique developed in this thesis allows the user to explore correlations among any

number of attributes seamlessly. This technique is also completely scalable in the

sense that it can handle small as well as very large datasets.

v

Page 6: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Many organizations are increasingly using data mining tools to discover important

associations in data stored in large data warehouses. Although many algorithms for

mining association rules have been researched extensively, they do not incorporate

users in the process and most of them generate a large number of association rules.

It is quite often difficult for the user to analyze a large number of rules to identify

a small subset of rules that is of importance to the user. In this thesis I present a

framework for the user to interactively mine association rules visually.

Another challenging task in data mining is to understand the correlations among

the mined association rules. It is often difficult to identify a relevant subset of

association rules from a large number of mined rules. A further contribution of this

thesis is a simple framework in the VisAR system that allows the user to explore a

large number of association rules visually.

A variety of businesses have adopted new technologies for storing large amounts

of data. Analysis of historical data quite often offers new insights into business

processes that may increase productivity and profit. On-line analytical process-

ing (OLAP) has become a powerful tool for business analysts to explore historical

data. Effective visualization techniques are very important for supporting OLAP

technology. A new technique for the visual exploration of OLAP data cubes is also

presented in this thesis.

vi

Page 7: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Preface

Much of the work presented in this thesis has been published as follows. The first

two papers are related to the material in Chapter 3. The third to fifth papers are

related to Chapter 4 and the last paper is related to the material in Chapter 5.

• K. Techapichetvanich, A. Datta and R. Owens. HDDV: Hierarchical dynamic

dimensional visualization. In Proceedings of IASTED International Confer-

ence on Databases and Applications, pages 157-162, 2004.

• K. Techapichetvanich and A. Datta. VisEx: A visualization framework for

exploring correlations among attributes in large multidimensional datasets,

Information Visualization, under review.

• K. Techapichetvanich and A. Datta. Visual mining of market basket asso-

ciation rules, In Proceedings of ICCSA 2004: International Conference on

Computational Science and Its Applications, Volume 3046 of Lecture Notes in

Computer Science, pages 479-488. Springer, 2004.

• K. Techapichetvanich and A. Datta. VisAR: A new technique for visualizing

mined association rules, In Proceedings of the First International Conference

on Advanced Data Mining and Applications (ADMA 2005), Volume 3584 of

Lecture Notes in Computer Science, pages 88-95. Springer, 2005.

• K. Techapichetvanich and A. Datta. Visual data mining for discovering asso-

ciation rules, In K. E. Voges and N. K. Ll.Pope (editors), Business Application

and Computational Intelligence, Chapter 11. Idea Group Publishing, 2005.

vii

Page 8: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

• K. Techapichetvanich and A. Datta. Interactive visualization for OLAP, In

Proceedings of ICCSA 2005: International Conference on Computational Sci-

ence and Its Applications, Volume 3482 of Lecture Notes in Computer Science,

pages 206-215. Springer, 2005.

Though this thesis and all published papers are mainly similar, the structure and

all details of individual systems have been described in this thesis in more details

and thoroughly. The author of this thesis is responsible for the originality of the

presented research and is also the primary author for each of these publications.

viii

Page 9: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Acknowledgements

First and foremost, I would like to thank Associate Professor Amitava Datta. He

has been my supervisor for the last two years of my candidature. Over the last two

years, he has provided invaluable motivation, inspiration, and guidance. I am glad

to have you as a supervisor. Many thanks are also extended to Professor Robyn

Owens for her help.

During the first period of my candidature, I have also benefitted from the tremen-

dous support of Dr. Sato Juniper, and Margaret Jones has provided English support

by both teaching and proof reading.

General thanks go to all staff at the School of Computer Science & Software Engi-

neering at the University of Western Australia. Specifically thanks also go to Dr.

Nick Spadaccini, the head of school during the last two years of my candidature. I

have also had the good fortune to be surrounded by a great number of postgraduates

both in the school and outside.

Last, but not least, I would like to offer my special thanks to my family for their

support and encouragement. Thanks also go to the Pocathikorn family for their

support and for caring for me as one of the family.

ix

Page 10: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

x

Page 11: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Contents

Abstract v

Preface vii

Acknowledgements ix

1 Introduction 1

1.1 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Visual exploration of large multidimensional datasets . . . . 5

1.1.2 Visual data mining and visualization of association rules . . 7

1.1.3 Interactive visualization for OLAP . . . . . . . . . . . . . . 10

1.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Previous Work 13

2.1 Information Visualization Techniques . . . . . . . . . . . . . . . . . 13

2.1.1 Geometric techniques . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Iconographic techniques . . . . . . . . . . . . . . . . . . . . 17

2.1.3 Hierarchical techniques . . . . . . . . . . . . . . . . . . . . . 20

2.1.4 Pixel-based techniques . . . . . . . . . . . . . . . . . . . . . 26

2.1.5 Table-based techniques . . . . . . . . . . . . . . . . . . . . . 28

xi

Page 12: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.2 The dynamic query framework . . . . . . . . . . . . . . . . . . . . . 30

2.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3.1 Association rules . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Visualization for OLAP . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 A New Technique for Visual Exploration of Large Datasets 43

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 VisEx system design . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 VisEx system architecture and implementation . . . . . . . . . . . . 50

3.4.1 Connection and Transformation in VisEx . . . . . . . . . . . 50

3.4.2 Visualizing multiple attributes in VisEx . . . . . . . . . . . 52

3.4.3 Querying in VisEx . . . . . . . . . . . . . . . . . . . . . . . 55

3.4.4 User interaction . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.5 Analysis scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5.1 Analysis 1: 1990 U.S. Census Data . . . . . . . . . . . . . . 65

3.5.2 Analysis 2: 1985 The Current Population Survey . . . . . . 68

3.6 User study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.6.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 69

3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4 Visualization for Association Rule Mining 79

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

xii

Page 13: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 The model for interactive association rule mining . . . . . . . . . . 82

4.3.1 Identifying Frequent Itemsets . . . . . . . . . . . . . . . . . 85

4.3.2 Selecting Interesting Association Rules . . . . . . . . . . . . 87

4.3.3 Visualizing Association Rules . . . . . . . . . . . . . . . . . 87

4.4 Data Structure used in VisDM . . . . . . . . . . . . . . . . . . . . . 89

4.5 A user study of VisDM . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.5.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 90

4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6 Visualization of many association rules . . . . . . . . . . . . . . . . 92

4.6.1 The VisAR system . . . . . . . . . . . . . . . . . . . . . . . 96

4.6.2 The advantages of VisAR . . . . . . . . . . . . . . . . . . . 102

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5 Interactive Visualization for On-line Analytical Processing 105

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3 VisOLAP system architecture and implementation . . . . . . . . . . 111

5.3.1 System connection . . . . . . . . . . . . . . . . . . . . . . . 111

5.3.2 Visualizing OLAP data cubes . . . . . . . . . . . . . . . . . 113

5.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.5 Visual Exploration and MDX query . . . . . . . . . . . . . . . . . . 119

5.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

xiii

Page 14: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

6 Conclusion 127

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Bibliography 131

Appendices 141

A 141

A.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.1.1 Tutorial Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.1.2 Experimental Tasks . . . . . . . . . . . . . . . . . . . . . . . 142

A.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

B 147

B.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B.1.1 Tasks from Dataset1 . . . . . . . . . . . . . . . . . . . . . . 147

B.1.2 Tasks from Dataset2 . . . . . . . . . . . . . . . . . . . . . . 148

B.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

xiv

Page 15: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

List of Figures

1 The KDD process overview. . . . . . . . . . . . . . . . . . . . . . . 8

2 Scatterplot matrix visualization. . . . . . . . . . . . . . . . . . . . . 15

3 Parallel coordinates visualization. . . . . . . . . . . . . . . . . . . . 16

4 Star coordinates visualization. . . . . . . . . . . . . . . . . . . . . . 17

5 Chernoff-face visualization. . . . . . . . . . . . . . . . . . . . . . . . 18

6 Star glyphs visualization. . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Stick figure visualization. . . . . . . . . . . . . . . . . . . . . . . . . 20

8 Worlds within worlds visualization. . . . . . . . . . . . . . . . . . . 22

9 Hierarchical axis visualization. . . . . . . . . . . . . . . . . . . . . . 23

10 Hyperbolic browser visualization. . . . . . . . . . . . . . . . . . . . . 24

11 Cone trees visualization. . . . . . . . . . . . . . . . . . . . . . . . . 24

12 An example of tree-maps. . . . . . . . . . . . . . . . . . . . . . . . . 25

13 An example of information slices. . . . . . . . . . . . . . . . . . . . 26

14 Spiral and axes query dependent visualization. . . . . . . . . . . . . 27

15 Circle segment visualization. . . . . . . . . . . . . . . . . . . . . . . 28

16 Table lens visualization. . . . . . . . . . . . . . . . . . . . . . . . . 29

17 An example of candidate and frequent itemsets. . . . . . . . . . . . 37

18 An example of visualizing association rules for text mining. . . . . . 39

xv

Page 16: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

19 An example of visualizing association rules with Mosaic plots. . . . 40

20 An Anchored Measures approach of ADVIZOR. . . . . . . . . . . . 41

21 An example of barstick visualization in VisEx. . . . . . . . . . . . . 47

22 VisEx System architecture . . . . . . . . . . . . . . . . . . . . . . . 51

23 A screenshot of the user interface with four barsticks queried in VisEx. 56

24 An example of fixed mode exploration. . . . . . . . . . . . . . . . . 58

25 An example result of five queried attributes. . . . . . . . . . . . . . 59

26 Display of the relationship of six queried attributes by Comparison

techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

27 An example result from Exploration techniques. . . . . . . . . . . . 62

28 An example result of four queried attributes. . . . . . . . . . . . . . 63

29 An example result of Selection techniques in barsticks. . . . . . . . 64

30 Display of the relationship of four queried attributes with equal-

height bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

31 An example of an analysis scenario with four selected attributes. . . 66

32 An example analysis of three selected attributes with the comparison

of the sex attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

33 An example analysis shows the relationships of five selected attributes

including Total personal incomes, Years of schooling, Occupations,

Class of worker, and Industry. . . . . . . . . . . . . . . . . . . . . . 68

34 An example analysis shows the relationships of five selected attributes:

Total personal incomes, Occupations, Age, Retirement Income, and

Social Security Income. . . . . . . . . . . . . . . . . . . . . . . . . . 69

35 A comparison of five selected attributes including Occupation, Sex,

Education, Race, and Wage. . . . . . . . . . . . . . . . . . . . . . . 70

xvi

Page 17: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

36 An example analysis with four selected attributes: Education, Expe-

rience, Age, and Wage. . . . . . . . . . . . . . . . . . . . . . . . . . 71

37 The mean time for completing each task. . . . . . . . . . . . . . . . 73

38 The correctness of each task. . . . . . . . . . . . . . . . . . . . . . . 73

39 The results from questionnaires in different categories: (a) Usability

(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 77

40 A model of the technique for mining association rules. . . . . . . . . 84

41 A screenshot and user interface of identifying frequent itemsets. . . 86

42 A screenshot and user interface of selecting interesting association

rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

43 A screenshot and user interface of visualizing association rules. . . . 89

44 The results from questionnaires in different categories: (a) Usability

(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 94

45 (a) The mean time of completing each task. (b) The correctness of

each task in each dataset. . . . . . . . . . . . . . . . . . . . . . . . 95

46 A diagram of the system for visualizing mined association rules. . . 97

47 A user interface of visualization for mined association rules. . . . . 99

48 Visualization of association rules with AND operation and support

sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

49 Visualization of association rules with AND operation and confidence

sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

50 Visualization of association rules from the selected items of interest

in Figure 47 but sorted according to confidence values. . . . . . . . 101

51 A diagram of ADOMD object model [29]. . . . . . . . . . . . . . . . 110

52 VisOLAP system architecture . . . . . . . . . . . . . . . . . . . . . 112

53 A framework of visualizing OLAP data cubes in VisOlap. . . . . . . 114

xvii

Page 18: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

54 A user interface of VisOLAP. . . . . . . . . . . . . . . . . . . . . . 114

55 A framework of the Drill down function in VisOlap. . . . . . . . . . 116

56 A framework of the Slice function in VisOlap. . . . . . . . . . . . . 117

57 Examples of OLAP functionalities including drilling down, rolling

up, and slicing on multidimensional data. . . . . . . . . . . . . . . . 118

58 Visualization of the exploration in Product Family, Store Type, Year,

and Quarter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

59 Visualization of the drill down operation into Product Department

on Product Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

60 Visualization for exploring Promotion media, Store type, and Unit

sales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

61 An example of visualization for exploring alcoholic beverage sales of

small groceries in Year 1997. . . . . . . . . . . . . . . . . . . . . . . 125

xviii

Page 19: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

A Visualization Framework for ExploringCorrelations among Attributes of a Large

Dataset and Its Applications in DataMining

This thesis is

presented to the

School of Computer Science & Software Engineering

for the degree of

Doctor of Philosophy

of

The University of Western Australia

By

Kesaraporn Techapichetvanich

2005

Page 20: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis
Page 21: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

c© Copyright 2005

by

Kesaraporn Techapichetvanich

iii

Page 22: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

iv

Page 23: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Abstract

Many databases in scientific and business applications have grown exponentially

in size in recent years. Accessing and using databases is no longer a specialized

activity as more and more ordinary users without any specialized knowledge are

trying to gain information from databases. Both expert and ordinary users face

significant challenges in understanding the information stored in databases. The

databases are so large in most cases that it is impossible to gain useful informa-

tion by inspecting data tables, which are the most common form of storing data

in relational databases. Visualization has emerged as one of the most important

techniques for exploring data stored in large databases. Appropriate visualization

techniques can reveal trends, correlations and associations in data that are very dif-

ficult to understand from a textual representation of the data. This thesis presents

several new frameworks for data visualization and visual data mining.

The first technique, VisEx, is useful for visual exploration of large multi-attribute

datasets and especially for exploring the correlations among the attributes in such

datasets. Most previous visualization techniques can display correlations among two

or three attributes at a time without excessive screen clutter. Though many data

exploration tasks require examining correlations among four or more attributes,

this can be done only indirectly using previous visualization tools. However, the

technique developed in this thesis allows the user to explore correlations among any

number of attributes seamlessly. This technique is also completely scalable in the

sense that it can handle small as well as very large datasets.

v

Page 24: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Many organizations are increasingly using data mining tools to discover important

associations in data stored in large data warehouses. Although many algorithms for

mining association rules have been researched extensively, they do not incorporate

users in the process and most of them generate a large number of association rules.

It is quite often difficult for the user to analyze a large number of rules to identify

a small subset of rules that is of importance to the user. In this thesis I present a

framework for the user to interactively mine association rules visually.

Another challenging task in data mining is to understand the correlations among

the mined association rules. It is often difficult to identify a relevant subset of

association rules from a large number of mined rules. A further contribution of this

thesis is a simple framework in the VisAR system that allows the user to explore a

large number of association rules visually.

A variety of businesses have adopted new technologies for storing large amounts

of data. Analysis of historical data quite often offers new insights into business

processes that may increase productivity and profit. On-line analytical process-

ing (OLAP) has become a powerful tool for business analysts to explore historical

data. Effective visualization techniques are very important for supporting OLAP

technology. A new technique for the visual exploration of OLAP data cubes is also

presented in this thesis.

vi

Page 25: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Preface

Much of the work presented in this thesis has been published as follows. The first

two papers are related to the material in Chapter 3. The third to fifth papers are

related to Chapter 4 and the last paper is related to the material in Chapter 5.

• K. Techapichetvanich, A. Datta and R. Owens. HDDV: Hierarchical dynamic

dimensional visualization. In Proceedings of IASTED International Confer-

ence on Databases and Applications, pages 157-162, 2004.

• K. Techapichetvanich and A. Datta. VisEx: A visualization framework for

exploring correlations among attributes in large multidimensional datasets,

Information Visualization, under review.

• K. Techapichetvanich and A. Datta. Visual mining of market basket asso-

ciation rules, In Proceedings of ICCSA 2004: International Conference on

Computational Science and Its Applications, Volume 3046 of Lecture Notes in

Computer Science, pages 479-488. Springer, 2004.

• K. Techapichetvanich and A. Datta. VisAR: A new technique for visualizing

mined association rules, In Proceedings of the First International Conference

on Advanced Data Mining and Applications (ADMA 2005), Volume 3584 of

Lecture Notes in Computer Science, pages 88-95. Springer, 2005.

• K. Techapichetvanich and A. Datta. Visual data mining for discovering asso-

ciation rules, In K. E. Voges and N. K. Ll.Pope (editors), Business Application

and Computational Intelligence, Chapter 11. Idea Group Publishing, 2005.

vii

Page 26: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

• K. Techapichetvanich and A. Datta. Interactive visualization for OLAP, In

Proceedings of ICCSA 2005: International Conference on Computational Sci-

ence and Its Applications, Volume 3482 of Lecture Notes in Computer Science,

pages 206-215. Springer, 2005.

Though this thesis and all published papers are mainly similar, the structure and

all details of individual systems have been described in this thesis in more details

and thoroughly. The author of this thesis is responsible for the originality of the

presented research and is also the primary author for each of these publications.

viii

Page 27: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Acknowledgements

First and foremost, I would like to thank Associate Professor Amitava Datta. He

has been my supervisor for the last two years of my candidature. Over the last two

years, he has provided invaluable motivation, inspiration, and guidance. I am glad

to have you as a supervisor. Many thanks are also extended to Professor Robyn

Owens for her help.

During the first period of my candidature, I have also benefitted from the tremen-

dous support of Dr. Sato Juniper, and Margaret Jones has provided English support

by both teaching and proof reading.

General thanks go to all staff at the School of Computer Science & Software Engi-

neering at the University of Western Australia. Specifically thanks also go to Dr.

Nick Spadaccini, the head of school during the last two years of my candidature. I

have also had the good fortune to be surrounded by a great number of postgraduates

both in the school and outside.

Last, but not least, I would like to offer my special thanks to my family for their

support and encouragement. Thanks also go to the Pocathikorn family for their

support and for caring for me as one of the family.

ix

Page 28: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

x

Page 29: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Contents

Abstract v

Preface vii

Acknowledgements ix

1 Introduction 1

1.1 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Visual exploration of large multidimensional datasets . . . . 5

1.1.2 Visual data mining and visualization of association rules . . 7

1.1.3 Interactive visualization for OLAP . . . . . . . . . . . . . . 10

1.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Previous Work 13

2.1 Information Visualization Techniques . . . . . . . . . . . . . . . . . 13

2.1.1 Geometric techniques . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Iconographic techniques . . . . . . . . . . . . . . . . . . . . 17

2.1.3 Hierarchical techniques . . . . . . . . . . . . . . . . . . . . . 20

2.1.4 Pixel-based techniques . . . . . . . . . . . . . . . . . . . . . 26

2.1.5 Table-based techniques . . . . . . . . . . . . . . . . . . . . . 28

xi

Page 30: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.2 The dynamic query framework . . . . . . . . . . . . . . . . . . . . . 30

2.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3.1 Association rules . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Visualization for OLAP . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 A New Technique for Visual Exploration of Large Datasets 43

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 VisEx system design . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 VisEx system architecture and implementation . . . . . . . . . . . . 50

3.4.1 Connection and Transformation in VisEx . . . . . . . . . . . 50

3.4.2 Visualizing multiple attributes in VisEx . . . . . . . . . . . 52

3.4.3 Querying in VisEx . . . . . . . . . . . . . . . . . . . . . . . 55

3.4.4 User interaction . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.5 Analysis scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.5.1 Analysis 1: 1990 U.S. Census Data . . . . . . . . . . . . . . 65

3.5.2 Analysis 2: 1985 The Current Population Survey . . . . . . 68

3.6 User study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.6.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 69

3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4 Visualization for Association Rule Mining 79

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

xii

Page 31: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 The model for interactive association rule mining . . . . . . . . . . 82

4.3.1 Identifying Frequent Itemsets . . . . . . . . . . . . . . . . . 85

4.3.2 Selecting Interesting Association Rules . . . . . . . . . . . . 87

4.3.3 Visualizing Association Rules . . . . . . . . . . . . . . . . . 87

4.4 Data Structure used in VisDM . . . . . . . . . . . . . . . . . . . . . 89

4.5 A user study of VisDM . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.5.1 Experimental methodology . . . . . . . . . . . . . . . . . . . 90

4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6 Visualization of many association rules . . . . . . . . . . . . . . . . 92

4.6.1 The VisAR system . . . . . . . . . . . . . . . . . . . . . . . 96

4.6.2 The advantages of VisAR . . . . . . . . . . . . . . . . . . . 102

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5 Interactive Visualization for On-line Analytical Processing 105

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3 VisOLAP system architecture and implementation . . . . . . . . . . 111

5.3.1 System connection . . . . . . . . . . . . . . . . . . . . . . . 111

5.3.2 Visualizing OLAP data cubes . . . . . . . . . . . . . . . . . 113

5.4 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.5 Visual Exploration and MDX query . . . . . . . . . . . . . . . . . . 119

5.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

xiii

Page 32: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

6 Conclusion 127

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Bibliography 131

Appendices 141

A 141

A.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.1.1 Tutorial Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A.1.2 Experimental Tasks . . . . . . . . . . . . . . . . . . . . . . . 142

A.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

B 147

B.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B.1.1 Tasks from Dataset1 . . . . . . . . . . . . . . . . . . . . . . 147

B.1.2 Tasks from Dataset2 . . . . . . . . . . . . . . . . . . . . . . 148

B.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

xiv

Page 33: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

List of Figures

1 The KDD process overview. . . . . . . . . . . . . . . . . . . . . . . 8

2 Scatterplot matrix visualization. . . . . . . . . . . . . . . . . . . . . 15

3 Parallel coordinates visualization. . . . . . . . . . . . . . . . . . . . 16

4 Star coordinates visualization. . . . . . . . . . . . . . . . . . . . . . 17

5 Chernoff-face visualization. . . . . . . . . . . . . . . . . . . . . . . . 18

6 Star glyphs visualization. . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Stick figure visualization. . . . . . . . . . . . . . . . . . . . . . . . . 20

8 Worlds within worlds visualization. . . . . . . . . . . . . . . . . . . 22

9 Hierarchical axis visualization. . . . . . . . . . . . . . . . . . . . . . 23

10 Hyperbolic browser visualization. . . . . . . . . . . . . . . . . . . . . 24

11 Cone trees visualization. . . . . . . . . . . . . . . . . . . . . . . . . 24

12 An example of tree-maps. . . . . . . . . . . . . . . . . . . . . . . . . 25

13 An example of information slices. . . . . . . . . . . . . . . . . . . . 26

14 Spiral and axes query dependent visualization. . . . . . . . . . . . . 27

15 Circle segment visualization. . . . . . . . . . . . . . . . . . . . . . . 28

16 Table lens visualization. . . . . . . . . . . . . . . . . . . . . . . . . 29

17 An example of candidate and frequent itemsets. . . . . . . . . . . . 37

18 An example of visualizing association rules for text mining. . . . . . 39

xv

Page 34: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

19 An example of visualizing association rules with Mosaic plots. . . . 40

20 An Anchored Measures approach of ADVIZOR. . . . . . . . . . . . 41

21 An example of barstick visualization in VisEx. . . . . . . . . . . . . 47

22 VisEx System architecture . . . . . . . . . . . . . . . . . . . . . . . 51

23 A screenshot of the user interface with four barsticks queried in VisEx. 56

24 An example of fixed mode exploration. . . . . . . . . . . . . . . . . 58

25 An example result of five queried attributes. . . . . . . . . . . . . . 59

26 Display of the relationship of six queried attributes by Comparison

techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

27 An example result from Exploration techniques. . . . . . . . . . . . 62

28 An example result of four queried attributes. . . . . . . . . . . . . . 63

29 An example result of Selection techniques in barsticks. . . . . . . . 64

30 Display of the relationship of four queried attributes with equal-

height bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

31 An example of an analysis scenario with four selected attributes. . . 66

32 An example analysis of three selected attributes with the comparison

of the sex attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

33 An example analysis shows the relationships of five selected attributes

including Total personal incomes, Years of schooling, Occupations,

Class of worker, and Industry. . . . . . . . . . . . . . . . . . . . . . 68

34 An example analysis shows the relationships of five selected attributes:

Total personal incomes, Occupations, Age, Retirement Income, and

Social Security Income. . . . . . . . . . . . . . . . . . . . . . . . . . 69

35 A comparison of five selected attributes including Occupation, Sex,

Education, Race, and Wage. . . . . . . . . . . . . . . . . . . . . . . 70

xvi

Page 35: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

36 An example analysis with four selected attributes: Education, Expe-

rience, Age, and Wage. . . . . . . . . . . . . . . . . . . . . . . . . . 71

37 The mean time for completing each task. . . . . . . . . . . . . . . . 73

38 The correctness of each task. . . . . . . . . . . . . . . . . . . . . . . 73

39 The results from questionnaires in different categories: (a) Usability

(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 77

40 A model of the technique for mining association rules. . . . . . . . . 84

41 A screenshot and user interface of identifying frequent itemsets. . . 86

42 A screenshot and user interface of selecting interesting association

rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

43 A screenshot and user interface of visualizing association rules. . . . 89

44 The results from questionnaires in different categories: (a) Usability

(b) Visualization (c) Interaction (d) Information . . . . . . . . . . . 94

45 (a) The mean time of completing each task. (b) The correctness of

each task in each dataset. . . . . . . . . . . . . . . . . . . . . . . . 95

46 A diagram of the system for visualizing mined association rules. . . 97

47 A user interface of visualization for mined association rules. . . . . 99

48 Visualization of association rules with AND operation and support

sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

49 Visualization of association rules with AND operation and confidence

sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

50 Visualization of association rules from the selected items of interest

in Figure 47 but sorted according to confidence values. . . . . . . . 101

51 A diagram of ADOMD object model [29]. . . . . . . . . . . . . . . . 110

52 VisOLAP system architecture . . . . . . . . . . . . . . . . . . . . . 112

53 A framework of visualizing OLAP data cubes in VisOlap. . . . . . . 114

xvii

Page 36: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

54 A user interface of VisOLAP. . . . . . . . . . . . . . . . . . . . . . 114

55 A framework of the Drill down function in VisOlap. . . . . . . . . . 116

56 A framework of the Slice function in VisOlap. . . . . . . . . . . . . 117

57 Examples of OLAP functionalities including drilling down, rolling

up, and slicing on multidimensional data. . . . . . . . . . . . . . . . 118

58 Visualization of the exploration in Product Family, Store Type, Year,

and Quarter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

59 Visualization of the drill down operation into Product Department

on Product Family. . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

60 Visualization for exploring Promotion media, Store type, and Unit

sales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

61 An example of visualization for exploring alcoholic beverage sales of

small groceries in Year 1997. . . . . . . . . . . . . . . . . . . . . . . 125

xviii

Page 37: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Chapter 1

Introduction

In the last few decades, increased computer usage has led to the generation and

collection of huge quantities of complex data in many areas, including engineering,

health, scientific, and business areas. Moreover, databases and data warehouses

have become integral parts of business activities and scientific research. A National

Science Foundation report (NSF) defined the term visualization [49] in order to

adapt the numerical abilities of computers to suit human perception by transform-

ing and preprocessing raw data into visual images. To understand clearly huge

quantities of data in a reasonable time frame an efficient and effective visualization

tool or application is needed to enable the viewing of data in graphical form.

Visualization combines computer graphics, computer science, visual arts, image

processing, and user-interface methodology to enhance insights into data. It can

be used to help scientists, analysts, and researchers to gain knowledge from data,

and to reduce the time taken to interpret information, as well as providing any

relationships and hidden phenomena in large datasets.

Information visualization can be defined as the field of visualization that represents

abstract or non-physical data, such as financial data, that cannot be obviously

mapped onto physical space. On the other hand, scientific visualization often has an

inherent geometry and relates to mathematical structure and models. It tends to use

1

Page 38: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2 CHAPTER 1. INTRODUCTION

physical data containing spatial mapping from fields such as meteorology, medical

images, and space exploration [16]. For example, in meteorology visualization of the

data representing the density of the cloud covering in the atmosphere is based on a

three-dimensional representation of the earth. In visualization of medical imaging,

magnetic-resonance imagery (MRI) scans or computerized tomography (CT) scans

show anatomical organisms or parts of the body that could not be viewed by any

other means. Information visualization usually requires less computation power

than scientific visualization and so can be done on personal computers. Information

visualization deals with the problem of identifying and displaying visibly important

portions of the data and effectively mapping non-spatial information onto visual

forms.

Both scientific and information visualization deal with the encoding and mapping of

data to visual form in geometric space. In scientific visualization, mapping physical

data to geometric space is important. In contrast, the geometric space is meaning-

less in mapping abstract data for information visualization. One commonly spatial

group to encode the abstract data for information visualization is a group of four

spaces [9] as follows. The four categorized spaces are grouped by mapping tech-

niques and their data characteristics even though sometimes the first space might

be grouped as a subset of the second space and the third space might be grouped

as a subset of the fourth space.

• 1-D, 2-D, and 3-D orthogonal axes or xyz Cartesian coordinates space is used

to encode data.

• Multiple dimensions > 3 are used to encode complex data dimensions onto a

limited screen space.

• Trees are used to display connections between multiple levels to encode rela-

tionships in data.

• Networks or graphs are used to encode relationships in data.

Page 39: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3

Multidimensional visualization is one of the most important techniques used in in-

formation visualization [49]. The purpose of multidimensional visualization is not

only to understand data but also to understand the underlying hidden relationships

present in the data. A variety of multidimensional visualization techniques have

been developed by researchers in many areas such as statistics, finance, medical

research, and mathematics. Multidimensional visualization uses understandable

graphical forms to represent relationships in multidimensional datasets (multiple

variables or parameters and their relationships) by mapping n-dimensional coor-

dinates (n ≥ 3) onto low dimensional coordinate spaces. There are various ap-

proaches to address the problem depending on the purpose of the visualization (i.e.

what relationships users want to look for). For example, the Worlds within Worlds

technique [19] by Besher and Feiner visualizes multidimensional data by placing a

coordinate space inside another in a hierarchical manner. Techniques developed by

Healey [24, 25] focus on performing exploratory data analysis rapidly and accurately

by using preprocessing.

An important challenge in developing a visualization tool for exploring multidimen-

sional data is a need for simplicity to understand and reduce time of training. It

is easy for a human observer to understand visual information presented as a two

dimensional bar chart where one attribute is displayed against another. It is still

possible to understand three dimensional bar charts or surface plots. However, plot-

ting one attribute against another one or two attributes does not scale beyond three

dimensions. Another problem is a lack of efficiency in using screen space resulting

in occlusion. Hence, a variety of techniques have been developed for visualizing

multidimensional data, and these are reviewed in Chapter 2.

Although the well known parallel coordinates [34] and scatterplot matrix [13] can

visualize multidimensional datasets, they generate occlusion when visualizing large

datasets. Since the encoding and mapping of a large amount of multidimensional

data is a difficult design task, the problem of occlusion easily occur. It is also well

known that we can only understand a limited amount of information visually at

Page 40: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4 CHAPTER 1. INTRODUCTION

a time [30]. Hence, the design of an appropriate visualization technique should

avoid this problems or at least reduce the problem as much as possible. Visu-

alization should be able to convey important information through efficient use of

limited screen space, simplicity of use, and clear, understandable representations.

Simplicity [81] can be interpreted as the way of using friendly and intuitive input

structures and providing an easily interpretable output. In addition, a good visual-

ization method should improve users’ ability to extract or discover interesting and

hidden relationships from large multidimensional datasets. Although pixel-based

techniques [38, 41] address these problems, users might need a lot of training time

to use these visualization techniques.

Typically, some basic requirements should be considered in designing an informa-

tion visualization system so that the system can effectively convey important and

interesting information to users [9]. These requirements are as follows.

• Perception and Cognitive amplification: Humans have a limited ability

to perceive visual information. Information visualization relies on human

perception and hence, we need to consider the load on human perception in

designing an effective visualization system.An effective visualization system

should take advantage of normal human cognitive abilities.

• Comprehension: The visualization should convey comprehensive informa-

tion to users.

• Visual structure: The visualization should be clear enough to preserve data

and convey information through visual representation.

• Computational Cost: A good visualization should minimize its computa-

tion time, which refers to the issues of algorithm optimization and real-time

response during interactions.

Page 41: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

1.1. CONTRIBUTIONS OF THE THESIS 5

1.1 Contributions of the thesis

In this thesis I investigate new visualization techniques for exploring large datasets

to help users gain insight into their data and to discover relationships, trends,

distributions, and patterns among data attributes. I categorize the visualization

techniques in this thesis into three major topics, namely, visual exploration of large

multidimensional datasets, visual data mining and visualization of association rules,

and interactive visualization for OLAP. All of these visualization techniques are de-

signed to reduce the occlusion when displaying a large number of records by reduc-

ing the number of graphic primitives representing records on the screen. Moreover,

the intention has been to design these techniques to be reliable, flexible, simple to

understand and to meet the requirements mentioned in the previous section.

1.1.1 Visual exploration of large multidimensional datasets

Simple tools for data visualization are needed for giving users who are not experts

in database technologies access to large databases. Such users require intuitive tools

that they can use without any prior knowledge of either any database technology or

the nature of the underlying database. Schneiderman and his co-workers recognized

the need for such intuitive tools in a series of papers almost a decade ago [61, 4,

3, 79]. Their main aim was to give the user sufficient freedom for database query.

Since most database query languages are relatively time consuming to learn, it is

unrealistic to expect that a user who wants to access a database for a specific need

will first learn a query language. Moreover, it may not be enough to know a query

language for extracting meaningful information effectively. Quite often a query may

produce either no records or a large number of records. Both of these cases do not

help the user, who usually wants a small number of records that can be effectively

examined further.

Williamson and Schneiderman [79] have presented a very interesting scenario where

a user wants to find meaningful information from a large database without any prior

Page 42: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

6 CHAPTER 1. INTRODUCTION

knowledge of a query language or the details of the underlying database. In their

dynamic home finder system, a user wants to find a suitable home depending on

the location of her workplace and other factors like price, number of bedrooms and

area of the plot. The usual way of choosing a house is to go through the brochures

from a real estate agent. However, in the system developed by Williamson and

Schneiderman [79], the user can progressively narrow down her search through visual

queries. Schneiderman [61], Ahlberg et al. [4], Williamson and Schneiderman [79]

and Ahlberg and Schneiderman [3] have proposed the dynamic query framework for

such visualization tasks.

Most large databases store records that have several attributes. For example, a car

database contains car records that have attributes like make, model, color, engine

capacity and price. A census database stores records that have usually many more

attributes. Some of these attributes are age, gender, year of education, occupation,

ethnic background etc. Each record in a U.S. census database has 72 attributes.

One of the major tasks both experts and non-experts face during exploration of a

database is the understanding of the correlations between the different attributes.

In many cases the user can focus the search for a particular item or a small group of

items by restricting the different attributes progressively. This is often the strategy

used by shoppers at internet shopping sites. Wittenburg et al. [80], Lanning et al.

[43] and Tweedie et al. [75] have extensively considered this scenario in the dynamic

query framework. In the EZChooser system, Wittenburg et al. [80] use a visual-

ization system called parallel bargrams for progressively narrowing down the search

by restricting attribute values. The Attribute Explorer [75, 66] and MultiNav [43]

tools are also based on similar strategies. As the ranges for different attributes are

progressively chosen, the number of objects satisfying these restrictions reduces.

The tools show the objects that satisfy all the constraints in a bottom panel of the

screen. The user can select or deselect different attributes and this allows the user

to experiment with different attribute restrictions.

This thesis presents a new visualization system called VisEx as further detailed in

Page 43: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

1.1. CONTRIBUTIONS OF THE THESIS 7

Chapter 3. VisEx is based on the dynamic query framework in the sense that it

allows even novice users to experiment with a large multi-attribute database and

frame meaningful queries. The user interaction in VisEx has some similarities with

the user interaction in EZChooser [80], Attribute Explorer [75] and MultiNav [43].

However, VisEx overcomes some of the key restrictions in these systems. VisEx is

a completely scalable system in the sense that it can handle small as well as very

large multi-attribute databases through a quantitative estimation of records. We

change the granularity at which the user views the records in a database depending

on the size of the database.

The main aim of the VisEx system is slightly different from the aims in systems

in [80, 75, 43]. The aim of these systems is to allow the user to zoom in to specific

items by restricting the ranges of the different attributes. The user can experiment

with different choices by constraining different ranges for the different attributes

so that she can have a better choice of an item at the end. Hence the focus in

these systems is to allow the user to choose an item or a few items from a large

collection according to the user’s specifications. VisEx can be used in a way similar

to the Attribute Explorer [75] and EZChooser [80] systems, however, VisEx is a

more versatile system for exploring correlations among attributes of large datasets.

In VisEx, our main aim is to give the user the facility to experiment with different

ranges for different attributes and see the effect of these restrictions on the other

attributes. It also provides a bar graph which allows users to view the proportions

of data values in each attribute. A user study is presented to judge the effectiveness

of the VisEx system.

1.1.2 Visual data mining and visualization of association

rules

Data mining is a core process of knowledge discovery from large databases. Data

mining can be illustrated through the KDD mechanism [18]. The KDD is the

Page 44: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

8 CHAPTER 1. INTRODUCTION

Figure 1: The KDD process overview.

process of extracting knowledge by identifying valid, potentially useful, and under-

standable patterns from data sources. The KDD process (as shown in Figure 1)

consists of a number of basic steps including data selection, data preprocessing, data

transformation, data mining, pattern discovery, and pattern evaluation. The data

mining tasks have different goals according to the kinds of knowledge to be mined.

An example of a data mining task is the generation of a mailing list of purchasing

customers. The data mining process helps managers to extract the mailing list of

loyal customers.

Even though some research has been done in visual data mining, its definition is

still unclear. Visual data mining can be described as the integration of visualization

into the data mining process. The integration combines the human ability of iden-

tifying patterns visually and the ability of a computer to do large scale numerical

Page 45: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

1.1. CONTRIBUTIONS OF THE THESIS 9

computations rapidly. The main problem with automatic data mining algorithms

is that they often mine a large number of association rules all of which are not of

equal importance. On the other hand, it is possible for a human expert to partici-

pate in the mining process for extracting a small set of interesting association rules.

Visualization is an important tool for a human expert to participate in the mining

process. Visual data mining can be categorized into three groups based on how the

visualization is integrated in the data mining process.

• Pre-applying visualization into data mining for exploring datasets: data is

firstly visualized to generate initial views before applying data mining algo-

rithms.

• Post-applying visualization into data mining for conveying the mining results:

the data mining algorithms extract patterns in data and then the extracted

patterns are visualized.

• Intermediate-application of visualization in the mining process: Users can

apply their domain knowledge to support knowledge extraction through the

mining process. This approach has been stated [81] as a tight coupling of the

human and computer in the mining process.

In the first two techniques, human experts cannot participate and apply their knowl-

edge in data mining. In the third technique, humans can examine information

through visualization and apply their knowledge at each step of the mining process.

This technique helps users to efficiently extract interesting patterns hidden in their

data and learn more through visual interaction. Surprisingly, this technique has

not been used so far in the literature for mining association rules. In this thesis, I

introduce a new tight coupling technique which enables users to apply their domain

knowledge to improve the quality of data mining approaches through visualization.

I also provide a new technique for visualizing a large number of mined association

rules.

Page 46: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

10 CHAPTER 1. INTRODUCTION

1.1.3 Interactive visualization for OLAP

On-line analytical processing (OLAP) has become an important tool for interactive

analysis of multidimensional databases such as data warehouses. Many businesses

have adopted data warehouses as the preferred mode of data storage in order to

manage the explosive growth of their databases [11]. OLAP helps analysts to ex-

plore, analyze, and extract interesting patterns from massive amounts of data stored

in multidimensional databases and data warehouses. Since most multidimensional

databases contain hierarchical structures, it is difficult for users to explore multi-

dimensional data with a tool providing only overviews of data. It is important for

users to be able to explore their multidimensional databases interactively to refine

their views. Interactive textual displays, such as a PivotTable, are not enough for

understanding or extracting patterns from multidimensional databases.

Chapter 5 presents a new interactive visualization technique called VisOLAP. The

aim of this technique is to assist analysts to improve their performance in explor-

ing, analyzing, and understanding large databases through interactive visualization.

The tool incorporates visualization into OLAP service which enables analysts to ex-

plore overviews of high levels of data and drill down into levels of detail in each

dimension directly. The incorporation of both visualization and OLAP not only

helps users to extract interesting patterns but also helps them to interpret and

analyze the extracted information faster.

1.2 Structure of the thesis

The remainder of this thesis is organized as follows. In Chapter 2, I discuss the

current state of research in information visualization. In particular, developments

of some visualization techniques related to the research presented in this thesis are

outlined. The application of visualization in other areas such as visualization for

mining association rules and visualization for other applications including OLAP is

Page 47: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

1.2. STRUCTURE OF THE THESIS 11

also considered.

In Chapter 3, a new technique for visual exploration of correlations among attributes

in large multidimensional datasets is presented. Although there are some visual-

ization techniques for comparing attributes in multidimensional datasets, most of

these techniques work only for a small number of dimensions or attributes, in most

cases only three. This new technique on the other hand is completely scalable and

can handle any number of attributes. Also, most techniques [41, 40], are quite

complex and users need time to understand and use the techniques. The system

presented in this chapter helps analysts and users to extract hidden relationships,

correlations, and trends and it addresses the occlusion problem effectively. In ad-

dition, this technique is highly interactive so that users can understand and gain

insight into datasets. A user study is presented for evaluating the system.

Chapter 4 describes the integration of a visualization technique into association

rule mining algorithms in a new framework called VisDM. This technique is a com-

promise between completely manual mining by users and purely automatic mining

algorithms. Again, a user study is described for this system. In addition, a new

technique called VisAR, for visualizing a large number of mined association rules is

presented. This technique improves visualization of a large number of association

rules generated through data mining algorithms.

The VisOLAP system for interactive visualization of OLAP data cubes is presented

in Chapter 5. I show how users can visually explore large data cubes with many

dimensions without any prior knowledge of OLAP technology.

Chapter 6 concludes the thesis with a discussion of the contributions presented

in this thesis. It highlights the limitations that remain in data visualization, and

points to future research.

Page 48: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

12 CHAPTER 1. INTRODUCTION

Page 49: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Chapter 2

Previous Work

This chapter reviews existing research in information visualization, visual data min-

ing and visualization of OLAP data cubes. An overview of some general techniques

is given before concentrating on the techniques that are closely related to this thesis.

2.1 Information Visualization Techniques

Visualization tools have become important in helping users to discover and interpret

useful information from a large amount of data. A considerable amount of research

has been done on information visualization techniques in the past decades. This

research can be broadly categorized into several groups as discussed below [37].

• Geometric techniques involve geometric transformation and projection of data.

This category includes techniques like scatterplot matrix, parallel coordinates,

and star coordinates. These techniques are discussed in detail in Section 2.1.1.

• Iconographic techniques use features of icons or glyphs to represent data.

Some examples of techniques in this category are chernoff-faces, star glyphs,

and stick-figure icons. These techniques are explained in Section 2.1.2.

• Hierarchical techniques map variables into different recursive or hierarchical

13

Page 50: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

14 CHAPTER 2. PREVIOUS WORK

levels. Worlds within worlds, hierarchical axis, hyperbolic browser, cone trees

and tree-maps are examples of techniques in this category. The details of these

techniques are described in Section 2.1.3.

• Pixel-based techniques [41, 38, 40] try to represent individual data records by

pixels, and characteristics of a record are denoted by coloring the correspond-

ing pixel using a color map. VisDB and pixel bar charts are some examples

of techniques in this category. An overview of these techniques is provided in

Section 2.1.4.

• Finally, table-based techniques like table lens, FOCUS and Polaris employ

table features to visualize different characteristics of data. More details of

these techniques are explained in Section 2.1.5.

2.1.1 Geometric techniques

Geometric techniques project and geometrically map datasets, especially multidi-

mensional datasets, onto the display device. One of the earlier approaches in infor-

mation visualization is the scatterplot [13], in which two variables are projected and

mapped onto x-y Cartesian coordinates. The scatterplot matrix is a combination of

several scatterplots, and a different pair of variables is used in each scatterplot. In a

scatterplot matrix (Figure 2), individual variables are arranged along the diagonal

of a matrix and each display panel illustrates relationships or correlations between

variables. The number of variables in a dataset that can be sensibly visualized

simultaneously by scatterplots is limited by the size of the display device, so often

only a subset of the data is visualized at any particular time.

The best-known geometric technique is the parallel coordinates technique proposed

for visualizing multidimensional data by Inselberg and Dimsdale [34]. In this ap-

proach, the dimensions are represented by parallel vertical lines, which are per-

pendicular to and uniformly distributed along a horizontal line (Figure 3). Each

variable, attribute or dimension is assigned to a specific parallel axis. A record

Page 51: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 15

Figure 2: Scatterplot matrix represents six dimensions, mpg, weight, drive ratio,horse power, displacement and cylinders of cars. The figure is taken from Basalaj [7].

Page 52: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

16 CHAPTER 2. PREVIOUS WORK

is represented by plotting a zigzag line by connecting its attribute values on the

different axes. The relationships among attributes that are represented by nearby

vertical lines are easy to perceive. However, it gets harder to perceive relationships

among attributes that are represented by widely separated vertical lines. Hence,

the initial choice and ordering of the axes has a big effect on the visualization. The

other limitations of this method are the restriction of the horizontal axis and screen

space. If the number of data points become large, the plotted result becomes a

solid blob of color and the correlation among attributes represented by distant axes

is hard to understand or explore.

Figure 3: Parallel coordinates from Inselberg and Dimsdale, 1987 [34].

Star coordinates [36] is another approach to project multidimensional data onto a

two-dimensional plane. As the name suggests, different attributes of the dataset

are represented by a set of radial axes that emanate from the center of a circle.

As shown in Figure 4, nine attributes (namely horse power, mpg, origin, cylinders,

model year, name, acceleration, displacement and weight) are represented by nine

axes. In contrast to the parallel coordinates technique, the star coordinates tech-

nique transforms each data item and displays it as a point. Similar to the parallel

coordinates technique, this approach is based on the projection and geometrical

mapping of datasets. To visualize a large amount of multidimensional data, the

display can become a solid blob of color which is hard to use for interpreting the

Page 53: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 17

correlation of attributes. Since this technique introduces the display of an excessive

number of graphic primitives representing data records on the screen, the occlusion

of different graphic primitives by other graphic primitives occurs.

Figure 4: Star coordinates from Kandogan, 2001 [36].

2.1.2 Iconographic techniques

In iconographic displays, icons or glyphs are used to visualize multidimensional data.

The common implementation of these techniques is through mapping dimensions

of data to graphical attributes such as size, color, shape, and orientation.

One of the first iconographic techniques was developed by Chernoff [12] as shown

in Figure 5. Multidimensional data are represented in the form of a human face.

The design of this technique was based on the ability of humans to recognize and

differentiate human faces and therefore to perceive regions of clustered data and

outliers. Each data variable is assigned to facial features such as eyes, eyebrows,

Page 54: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

18 CHAPTER 2. PREVIOUS WORK

facial area, nose, and mouth, which have different attributes of shape, location,

orientation, length, and size to represent the values in each dimension. The appear-

ance of similar faces or indistinguishable faces can occur due to the varying order

of assignment of variables to the different features.

Figure 5: Chernoff-faces from Chernoff, 1973 [12].

A star glyph [62] represents multiple attributes by line segments, like the radii of

a circle, when each line emanates from the center of the glyph and the length

represents the value of each dimension (as shown in Figure 6). The number of

line segments generated depends on the number of dimensions, for example, n-

dimensions require n line segments. A star glyph represents all selected dimensions

(i.e. a row of a data table) of a data point. The ability to display many dimensions

depends on the uniformly mapped angles. However, the glyphs become cluttered

with many dimensions or attributes. Also, it is difficult to display many glyphs at

a time corresponding to many data points due to the limitation of screen space.

Pickett and Grinstein have developed a technique to map multidimensional data

onto a two-dimensional plane [57] such as a computer screen. The applications

of the technique have been focused on spatially or temporally coherent data such

as multispectral imagery datasets. The original approach of Pickett and Grinstein

Page 55: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 19

Figure 6: Star glyphs represent six dimensions, mpg, weight, drive ratio, horse power,displacement and cylinders of cars from Seigel et al., 1972 [62]. The figure is takenfrom Basalaj [7].

Page 56: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

20 CHAPTER 2. PREVIOUS WORK

displays multispectral imagery data by using colors so that each dimension of data

controls the intensity of a primary color: red, green, and blue. The changes in

colors show relationships among datasets.

The method by Pickett and Grinstein [57] uses icons, called stick-figure icons,

to represent data elements. Each stick-figure icon is composed of five connected

line segments. Four are limbs and the other is the body of the icon. The first

four dimensions from the data can be mapped onto the four limbs with each value

controlling the angle of a limb. The last dimension controls the orientation of the

body (as shown in Figure 7). In addition, color, thickness, or length can be encoded

to the limbs and body to represent higher dimensionality.

Figure 7: Examples of stick figure icons from Pickett and Grinstein, 1988 [57].

After data elements have been mapped to icons, the icons are displayed in two-

dimensions. Data that have close values are clustered into the same groups and

have similar icon shapes. When these icons are displayed as groups, they form a

texture pattern in the image. The boundary of each group can be noticed because

each group generates different patterns of textures. Pickett and Grinstein have

investigated the possibilities of dynamic icon coding, users to interact with dynamic

icons and dynamic icons to interact with each other. Since one glyph visualizes one

data object, most techniques in this category are limited by the small number of

dimensions and the number of data records that can be displayed.

2.1.3 Hierarchical techniques

Visualization of hierarchical techniques represents datasets by partitioning space

hierarchically into subspaces. Some techniques in this group are based on recursively

Page 57: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 21

embedding dimensions, which stacks subspaces onto each other. Examples are

Hierarchical Axes [50, 52, 51] and worlds within worlds [19, 10], that are discussed

later in this section. Each subspace has a relationship with its parent subspace (i.e.,

an inner subspace has a relationship with an outer space). Some techniques used in

information visualization such as cone trees [59], tree-maps [35], and the hyperbolic

browser [42] are structured node links in which child nodes are extended from their

parents. Most of these information visualization techniques are used to visualize

and interact with data sets with large hierarchies.

Worlds within worlds, also called a nested heterogeneous coordinate system [19], is

a three-dimensional hierarchical space technique, in which lower dimensions (inner

worlds) are recursively placed in higher dimensions (outer worlds) as shown in

Figure 8. A height field or vertical axis of inner worlds represents the value of a

function and all remaining variables, and is used to code the constant value of the

outer world (at most three variables at each level). The positions of the outer worlds

are related to the inner world’s origin. Moving or translating the inner worlds affects

representing values of variables of outer worlds so the height field of inner worlds

is adjusted, but not vice versa. In order to increase flexibility in manipulating the

relationships of multivariate data to be represented, an extension of worlds within

worlds called AutoVisual [10], has been developed. The zooming and selection

tools in this technique allow users to perceive interesting areas of a dataset more

accurately.

Hierarchical axis methods [50, 52, 51] use one-dimensional subspace embedding and

aim at visualizing high dimensionality on two-dimensional graphics space. One of

the techniques used is to plot scalar fields on an n-dimensional lattice by categorizing

the data into two sets of dependent and independent variables. In Figure 9, the

former are mapped on the vertical axis, and the latter are recursively mapped onto a

single horizontal axis. The term “speed” was introduced to this method and colors

were used to distinguish the values of each parameter in the hierarchically horizontal

axis from the others. The first mapped variable is termed the “fastest”, the next the

Page 58: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

22 CHAPTER 2. PREVIOUS WORK

Figure 8: Example of worlds within worlds by Beshers and Feiner, 1990 [19].

“second fastest”, and so forth. Three classification techniques [51] were described

for individual functions. For common multidimensional analysis, three rules [52]

were applied to determine the vertical plot including minimum/maximum, sum,

and mean or standard deviation methods. A histogram, or a binned matrix, using

a mean function to gain better visual perception, was compared to a traditional

scatterplot matrix [13]. In addition, the zoom and clone tools were developed to gain

a large number of data representations by allowing users to display the subspace of

an interesting area.

Hyperbolic browser [42], uses a hierarchical technique with a Focus+Context (fish-

eye) approach applied for interaction. The display of hyperbolic browser is a tree

as shown in Figure 10. The root is initially placed at the center. Focus+Context

technique allows viewers to focus on the details of small areas or other nodes while

retaining the context of the entire hierarchy. The hyperbolic browser approach draws

the hierarchy uniformly on a hyperbolic plane and then maps this plane onto a cir-

cular disk on the display. During laying a tree on a hyperbolic plane, recalculation

is done to see if there is any change of the node focus. Transformation of space is

used to magnify a region at the center of focus while the rest of the region shrinks.

This allows users to explore or browse selected regions of interest in more detail,

but moving a node in hyperbolic browser affects the orientation of its children and

Page 59: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 23

Figure 9: Example of hierarchical axis method in which years is the fastest axis,the second fastest axis is subject, and class is the slowest axis from Mihalisin et al.,1991 [51].

the viewing can be disoriented while the children are rotated.

Other hierarchical techniques are cone trees and tree-maps. Cone trees [59] are an

animated three-dimensional visualization techniques instead of the two-dimensional

techniques used with hierarchical data structures. Sub-trees or child nodes of one

trees (shown in Figure 11) are evenly expanded in a circle at a lower level around

the apex of the cone. In the first implementation, each level of cone trees was

implemented with the same height. Diameters of cone trees at each level were

reduced so that they fitted into the display space, called a room. For the visual

and interactive aspects, some nodes are labeled as transparent to avoid occlusion,

and viewers rotate cone trees to explore the data, and can adjust the cone radius

and height, name levels of the cone trees, and shift perspective angles between a

parent and child node of the cone trees. The problem with cone trees is that the

user experience deteriorates with the density of the data. Also, occlusion becomes

a problem with dense datasets.

Tree-maps (a space-filling approach) [35] illustrated in Figure 12 is a visualization

technique used to map the entire hierarchical information (which is both structural

Page 60: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

24 CHAPTER 2. PREVIOUS WORK

Figure 10: Hyperbolic browser from Lamping, 1995 [42].

Figure 11: Cone trees from Card and Mackinlay, 1997 [59]

Page 61: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 25

Figure 12: Tree-maps from Schneiderman, 1998 [35].

and content information) to a rectangular partitioned display space. In this way,

large hierarchical information structures are presented on a two-dimensional dis-

play. Each partitioned rectangle is assigned a weight based on the size of the node.

This weight determines the area of the associated rectangle. In the original imple-

mentation of this visualization tool, hard disk drives with large directory structures

were used as datasets. Tree-maps allowed viewers to visualize the entire hierarchy

simultaneously and to set display properties such as colors and borders to enhance

the image.

Information slices technique [5] is another visualization approach. It uses semi-

circular discs to represent large hierarchies in two-dimensional space by dividing

the disc into multiple levels as shown in Figure 13. Deeper hierarchies can be

viewed by expanding a series of semi-discs from each section of each level.

Most hierarchical techniques are more suitable for representing dense data than

Page 62: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

26 CHAPTER 2. PREVIOUS WORK

Figure 13: Information slices from Andrews, 1998 [5].

some of the visualization techniques previously mentioned. Viewers can see and

compare the closer groups of datasets rather than distributed datasets on the same

screen. Hierarchical techniques are not straightforward, as they require appropriate

mapping of data in order to interpret data efficiently.

2.1.4 Pixel-based techniques

Pixel-based techniques aim to display as many data items as possible. Each data

record is mapped onto a pixel and each pixel is colored from a fixed range of colors

according to its value, so that its value falls into each attribute range.

In VisDB [41], each data record is mapped to individual pixels on a screen after

sorting and arranging the relevant data according to a query. The colors are chosen

by considering relevance factors. The VisDB system uses visualization approaches

to provide feedback on query results. In VisDB, there are two main techniques:

query independent and query dependent display. The query independent technique

employs line ordering or column ordering, using space-filling curves and recursive

Page 63: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 27

pattern approaches to order data items based on an attribute. The query dependent

approach, including Spiral and Axes techniques, arranges the closest results from

queried data items, mapping them to colors in a color ramp onto the center of the

display. The Query Dependent approaches display only the region of data items

within a certain distance to the reference point as shown in Figure 14. The other

variables are represented in different windows and the distances are distinguished

by different colors in each dimension. The different parts of the database can be

visualized by changing the reference point.

Figure 14: Pixel-based visualization of query dependent techniques (Spiral andAxes) from left to right, from Keim, 1996 [41].

The circle segments technique [6] is the pixel-based technique which maps data

attributes onto circle segments. Each attribute is sorted independently and arranged

line-by-line from the center to the border of the circle segment. Figure 15 shows

the pixel arrangement into circle segments of four attributes.

Pixel bar charts [39] is the technique which applies Pixel-based and x-y plotting

into traditional bar charts. The bars are used to represent categorical data while

x-y plotting and color coding inside the bars are used to represent numerical data.

Although the techniques use pixels to represent data objects for efficient space

Page 64: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

28 CHAPTER 2. PREVIOUS WORK

Figure 15: The representation of the circle segments arrangement of data items ontopixels from Ankerst et al., 1996 [6].

usage, users might need considerable training to use and understand the outcome

of the visualization. In addition, sometimes users might be overwhelmed by the

mixing of colors.

2.1.5 Table-based techniques

The techniques in this group employ table characteristics such as rows and columns

to visualize datasets. Some Table-based techniques have integrated interaction tech-

niques such as Focus+Context to make the table interactive and applied graphical

representation to display data attributes into their systems.

Table lens [58] in Figure 16 is a visualization technique based on Focus+Context

or the fisheye technique to display multidimensional data in a tabular style. This

technique displays a dataset by using horizontal bar charts and a Focus+Context

technique onto a table rather than in a text form. The system can be used for

visualization of large datasets represented by compressed tables. Users can also

zoom into specific areas of the table to see the distribution of specific attributes

visually.

InfoZoom [67] developed from FOCUS [68] represents attributes along rows and data

records along columns of the table. Similar to the table lens technique, InfoZoom

allows the users to gain a flexible overview of an object-attribute table through the

Page 65: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.1. INFORMATION VISUALIZATION TECHNIQUES 29

Figure 16: Table lens technique from Inxight, 1994 [58].

fisheye technique. The goal of this technique is to present and compare products on

the Internet. The user can progressively explore specific areas of the table through

formulation of interactive queries. However, this technique is predominantly textual

with only limited visual feedback to the user for comparing different attributes.

Both table lens and InfoZoom do not support instant viewing of individual attributes

in the specific ranges across other attributes and the entire dataset.

Polaris [72] is a table-based visualization technique which allows users to explore

multidimensional databases. A table in Polaris comprises rows, columns, and lay-

ers. The system treats nominal and ordinal data as independent variables called

dimensions, and all quantitative data as dependent variables called measures. Rows

and columns of Polaris represent the data attributes which may contain nested di-

mensions. To generate a graphical display of the table, the system uses table algebra

to specify table configuration and types of graphical display such as a bar chart or

line chart. It maps a set of records retrieved by database queries to each pane of

the table through the graphical representation. The graphical encoding employs

retinal properties [8] such as size, shape, and colors as graphical display of markers

on the pane.

An example of a technique that is not included in the five groups discussed above

is XmdvTool [78]. XmdvTool is a brushing technique, in which data points can be

selected to display interesting areas of data. This method was integrated from other

Page 66: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

30 CHAPTER 2. PREVIOUS WORK

multidimensional visualization methods for projecting data onto a two-dimensional

screen. XmdvTool supports scatterplots, star glyphs, parallel coordinates, and dimen-

sional stacking approaches for displaying multivariate data. N-dimensional brushing

in XmdvTool allows users to change, highlight, select, or delete a subset of graphi-

cally displayed objects by proper input devices. In addition, n-dimensional Brushes

have characteristics like shape, size, boundary, position, motion and display, which

allow the user to gain the perception of relationships in the n-space of selected data

points. Linking, which is an associated method of brushing, enables multiple views

to be displayed simultaneously for the same data.

2.2 The dynamic query framework

About a decade ago, Schneiderman [61] argued strongly for an intuitive and visual

mechanism for accessing and experimenting with databases. He argued that there

are two main difficulties in using a database query language for retrieving records

from a database. First, many users do not know such a query language and second,

in many situations the user does not have sufficient information about the underly-

ing database. Quite often database queries result in either no records matching the

query or too many records matching the query. The result is not helpful for the user

in either case. On the other hand, a visual query mechanism can provide the user

with useful information about the underlying database, so that the user can frame

queries visually and also see the results of these queries visually. Schneiderman calls

such visual interfaces direct manipulation interfaces.

Williamson and Schneiderman [79] mention four criteria to judge the quality of a

direct manipulation interface.

• Continuous visual representation of objects and actions of interest,

• Physical actions or labeled button presses instead of complex query syntax,

Page 67: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.2. THE DYNAMIC QUERY FRAMEWORK 31

• Rapid, incremental, reversible operations whose results are immediately visi-

ble, and

• Layered or spiral approaches to learning that permit usage with minimal

knowledge.

Williamson and Schneiderman [79] illustrate these four criteria in their dynamic

home finder system. In a typical scenario, a user wants to purchase a suitable

home within an affordable price range, with a required number of rooms and in

a convenient locality. The user does not have any knowledge of the underlying

database of available homes, and she can experiment with her requirements by

relaxing them if necessary. The system allows the user to narrow down the search

progressively by gradually restricting the attributes for her search. The primary

focus of the work by Schneiderman and his co-workers [61, 4, 3, 79] is to provide

the user complete flexibility in changing the search criteria and rapid feedback when

the attribute ranges are changed.

Spence and Tweedie [66] argue that the traditional approach of information retrieval

through a database query language works only for a small fraction of real world

problems. In most situations the user needs to have a clear idea about the structure

of the underlying database for framing meaningful queries. Moreover, database

queries retrieve only the records that exactly match the query. Hence, the user does

not get any idea about the records that might be just outside the query range, but of

interest to the user. Spence and Tweedie [66] put forward the idea of information

synthesis rather than information retrieval. In their opinion, problem or query

formulation is as important an activity as the retrieval of records. They emphasize

the need for the user to learn about the structure of the underlying database through

a visual query mechanism. The underlying philosophy of their Attribute Explorer

system can be described in the following sentence [66].

Given a collection of objects, each described by the values associated with a set

of attributes, find the most acceptable such object or, perhaps, a small number of

Page 68: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

32 CHAPTER 2. PREVIOUS WORK

candidate objects suited to more detailed consideration.

The Attribute Explorer system [66, 75] allows user interaction satisfying the require-

ments of the dynamic query framework [79]. The user can set the upper and lower

limits for each attribute. Each attribute is displayed as a histogram in a separate

window. The x-axis for the histogram is the selected range and the y-axis is the

number of items satisfying a particular attribute value. Since the main aim of

Attribute Explorer is to help the user to narrow down the search for a particular

item or items, it is important to show the items individually. Hence, each object is

displayed separately in each histogram as a small rectangular box. A bar of the his-

togram is a stack of such boxes. When the user specifies the range for an attribute,

all the objects satisfying this range are marked with a specific color according to a

color coding scheme.

Another strong feature of the system is attribute interaction. The system colors

an object with the same color when the object satisfies the current selected range

for one of the attributes. This helps the user to judge the position of the object

in different attribute windows and the interrelation between different attributes.

For example, if the object is a house, and the user chooses a price range between

$200,000 and $300,000, all the houses satisfying this constraint are coded with the

same color in the other attribute windows. Suppose another attribute is ‘number of

bedrooms’. The user now can see clearly the distribution of number of bedrooms in

the houses within this price range. In the information synthesis scenario of Spence

and Tweedie [66], the user may want to revise the initial choice of the price range

if the number of bedrooms is not adequate for her need. The system also colors

objects that fail one or more attribute limits specified by the user. The purpose is

to inform the user that a change of one or more attributes may bring these objects

back within the limits of all the attributes chosen by the user. The main aim of

the system is to guide the user to a small number of objects that satisfy the user

requirements expressed as attribute ranges.

The EZChooser system designed by Wittenburg et al. [80] follows a strategy similar

Page 69: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.2. THE DYNAMIC QUERY FRAMEWORK 33

to the Attribute Explorer system. This approach has been further illustrated in

the paper by Lanning et al. [43]. The main focus is to use the dynamic query

mechanism for providing the user complete freedom in choosing the attribute ranges.

The user can choose the range for one attribute and see the effect of that choice

on other attributes. However, the visualization framework in EZChooser system

is quite different from the Attribute Explorer. Instead of histograms, Wittenberg

et al. use parallel bargrams to show the attribute ranges and the user selections.

A parallel bargram is a horizontal histogram which shows all the objects in the

database according to the increasing values of a specific attribute.

Wittenburg et al. [80] illustrate the use of the EZChooser system through a vehicle

choosing interface for prospective buyers. Categorical attributes like car make or

model are converted into ordinal attributes by assigning an ordering to the nominal

fields. The display for EZChooser has two frames. The upper frame contains the

bargrams for all the attributes in the underlying database. The lower frame displays

the objects that satisfy the constraints specified by the user in the bargrams. The

bargrams for different attributes are displayed in parallel, one below the other,

according to increasing attribute values. For example, cars with lower prices appear

to the left of a bargram and cars with higher prices to the right of the bargram

specified for showing the price attribute. As the user chooses attribute ranges

progressively, all the cars that satisfy these ranges are highlighted through coloring

in the different bargrams. As an example, if the user chooses a price range of

$20,000 to $22,000, the cars that satisfy this price range will be highlighted in the

bargrams for all other attributes. Moreover, all the cars satisfying the constraints

will be displayed in the lower frame through icons or pictures of specific cars.

Both the Attribute Explorer and the EZChooser systems emphasize the importance

of displaying individual items in the database. This is important since the user needs

to view how many objects are selected due to the restriction of a specific attribute.

The objects are displayed as rectangles in the Attribute Explorer system and as

icons in the EZChooser system. However, this requirement imposes a constraint

Page 70: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

34 CHAPTER 2. PREVIOUS WORK

on how many objects can be displayed in a histogram in Attribute Explorer or in

a bargram in EZChooser. Spence and Tweedie [66] do not address the issue of

scalability as all the datasets they work with are small. Wittenburg et al. [80]

mention the scalability issue. They form different aggregation of values in bins in

each bargram. In other words, each bin shows a collection of items satisfying a

range of values. As the user progressively narrows down the ranges for successive

attributes, the bins show a smaller and smaller number of items and eventually

individual items. Recall that the lower frame in EZChooser shows the individual

items that satisfy the user-selected ranges on all attributes. There is a problem

with scalability in showing these items in EZChooser as the smallest representation

is a single pixel for an item. Hence, it is only possible to show the number of pixels

satisfying the screen resolution. However, items are given larger and larger space

as the user narrows down the search and eventually small icons are shown when

the selected items form a small enough set. Wittenburg et al. [80] conclude that

the EZChooser system allows users to explore datasets interactively for item sets

consisting of up to 1000 items when each item has about 10-20 attributes.

Spence [65] has emphasized the importance of sensitivity encoding to support navi-

gation in information space. According to Spence, the exploration of an information

space consists of four interrelated activities, interpretation, decision, browsing and

modeling. The user interprets the data in order to take a decision on the directions

of movement in an information space. The user creates an internal model of the

underlying data through browsing. Spence [65] defines sensitivity as a specific trans-

lation in information space and the related action required to achieve it. Spence has

discussed in detail how the systems like Attribute Explorer and EZChooser fit into

this framework of sensitivity encoding.

Although there are systems in the dynamic query framework helping the user to

narrow down the search for a particular item or items, no practical system ad-

dresses the issue of size scalability of datasets and provides the user to search for

correlation with different ranges among attributes of large datasets. In Chapter 3

Page 71: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.3. DATA MINING 35

I base my work on the dynamic query framework and present a novel visualiza-

tion framework called VisEx for exploring correlations among attributes in large

multidimensional datasets. The VisEx system also follows the paradigm of sensi-

tivity encoding through a particular choice of sensitivity encoding for comparing

attributes. The other visualization frameworks presented in this thesis like VisDM

and VisAR in Chapter 4 and VisOlap in Chapter 5 are also based on the dynamic

query framework.

In the next section, a brief overview of data mining and a process in knowledge

discovery including some review of data mining tasks are described.

2.3 Data Mining

Data mining is a process for extracting knowledge or useful information from a huge

amount of data. It can be considered as a knowledge discovery process [18]. As

briefly mentioned in the previous chapter, data mining tasks have different targets

for both gaining insight into data and/or predicting trends in the data. The data

mining results can be primarily categorized as one of the following [21]: association

rules, classification, regression, prediction, data processing and clustering. Data

mining methods are not reviewed in detail in this section, except for association

rule mining which is relevant to this thesis.

2.3.1 Association rules

Association rule mining is one of the data mining methods which focuses on explor-

ing relationships among items in datasets such as transactional databases. Associ-

ation rules contain discovered patterns or conditions under which the data records

frequently occur together. For example, an association rule might show which

products are frequently purchased together or the purchase of a particular item

may imply (with some probability) the purchase of other items. These types of

Page 72: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

36 CHAPTER 2. PREVIOUS WORK

association rules are also called market basket association rules. Store managers

or marketing officers can use an analysis of the market basket association rules to

learn purchasing behavior of their customers and to promote product sales or to

improve their marketing plans.

The earliest well-known algorithms for generating association rules are AIS [1],

SETM [28], Apriori, and AprioriTid [2]. The Apriori algorithm constructs frequent

itemsets by generating candidate k-itemsets (Ck) and then determining the support

of each candidate itemset. The process of generating the candidate k-itemsets is

also known as joining and pruning. The first iteration through the transactional

database is done to count the number of appearances of individual items. Each

subsequent iteration checks the support of candidate itemsets generated from pre-

vious iterated frequent itemsets (the joining step). The joining step combines two

k-1 itemsets which have identical k-2 itemsets. In other words, the candidate k-

itemsets from individual iterations, of which the count qualifies minimum support,

are the frequent k-itemsets (the pruning step). The algorithm will stop when there

is no new frequent itemset. Figure 17 illustrates an example of generating candidate

itemsets and finding frequent itemsets.

The FP-tree (Frequent Pattern tree) algorithm [22] searches frequent itemsets with-

out generating candidate itemsets. Similar to Apriori, the FP-tree obtains the

1-itemset from scanning the database. The frequent items in each transaction are

sorted according to their frequency of occurrence. The algorithm then scans through

the database again to construct the FP-tree. To generate frequent itemsets from

FP-tree, the algorithm proceeds along three major steps: constructing conditional

pattern bases (sets of items of each node when their parent node exists) by traversing

the FP-tree based on the order of the frequency table, constructing FP-trees (called

the conditional FP-tree) from the conditional pattern bases, and then recursively

mining the conditional FP-trees.

Not all discovered association rules qualifying user-predefined minimum support and

minimum confidence are interesting. Interestingness of association rules has been

Page 73: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.3. DATA MINING 37

Figure 17: An example of candidate itemsets and frequent itemsets generated fromthe Apriori algorithm [2].

Page 74: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

38 CHAPTER 2. PREVIOUS WORK

researched in [64, 56]. Since the final process of determining the interestingness of

the association rules depends on users, visualization for association rules has been

researched in recent years.

Visualization techniques have been integrated into data mining to help users in

understanding datasets, discovering associations and patterns in their data. Various

methodologies have been developed to visualize association rules generated by data

mining algorithms. Prior research can be categorized into three main groups: Table-

based, Matrix-based, and Graph-based.

First, Table-based techniques are the most common and traditional approaches to

visualize association rules in the form of a table. In general, the columns of a rule

table represent the items, the number of antecedents and consequents, the support,

and the confidence of association rules. Each row represents an association rule.

Some examples of Table-based techniques are included in SAS Enterprise Miner [32]

and DBMiner [21].

Second, Matrix-based techniques such as MineSet [33] (2-D matrix), 3-D matrix [81],

and grid represent the antecedent and consequent on a square grid based on the

coordinate axes. In 3-D matrix, the height and color of columns are used to represent

the properties of the association rules such as support and confidence. Similar to

2-D matrix, the grid techniques relying on frame display represent antecedents and

consequents by a square matrix. A cell with color and brightness is used to represent

the confidence and support of an association rule. For example, MineSet [33] uses

a 2-D matrix technique to visualize a large number of association rules. Wong et

al. [81] use the 3-D matrix in which both of the antecedents and consequents are

represented by a matrix based on x-y coordinates, but its 2-D matrix tiles represent

the relationships of rule-to-item rather than item-to-item. In this technique, the

blue and red columns illustrate the antecedents and consequents respectively as

shown in Figure 18. The columns of the confidence and support of association rules

are scaled and plotted at the farthest end of the x-y plane.

The last group is Graph-based techniques such as Directed Graph. These techniques

Page 75: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.3. DATA MINING 39

Figure 18: An example of visualizing mined association rules including their an-tecedents, consequents, support, and confidence from Wong et al., 1999 [81].

use nodes to represent the items and edges to represent the associations of items

in the rules. For example, a rule A ⇒ B is represented by a directed graph with A

and B as the nodes. The edge connecting A and B has the arrow pointing to the

consequent (B) of the rule. DBMiner [21] uses a technique called Ball graph which

is based on a directed graph. The nodes in Ball graph are called balls whose size

varies depending on the number of items represented by a ball.

Some prior work has integrated the above techniques into their systems. For in-

stance, CrystalClear [55] is an integrated technique based on a grid that applies a

tree technique to view the number of items and the lists of antecedents and conse-

quents. Another technique, that has not been discussed above, is Interactive Mosaic

plots for visualizing association rules [27] as shown in Figure 19. As its name sug-

gests, this technique applies Mosaicplot visualization to represent the relationships

among items in each association rule from a contingency table. To visualize the

relationships of items in association rules, the technique displays all items in the

Page 76: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

40 CHAPTER 2. PREVIOUS WORK

Figure 19: An example of visualizing with Mosaic plots from Hofmann et al.,1999 [27].

left hand side of rules by using Mosaicplot and the right hand side of rules by

highlighting the corresponding categories in a barchart. The Mosaicplot technique

represents each cell of a table by using a bin whose size varies depending on the

number of occurrences of items in the cells.

Although there are many existing algorithms for association rule mining, most of

them are automatic mining algorithms. There is still the challenge to incorporate

human knowledge into automatic association rule mining algorithms to retrieve as-

sociation rules of interest. Chapter 4 investigates a new technique for visual mining

of association rules that allows humans to participate in the mining processes. I also

present a hybrid technique for visualizing mined association rules which reduces the

complexity of visualizing large number of association rules on a single screen.

2.4 Visualization for OLAP

On-line Analytical Processing (OLAP) has been a very active area of research in

recent years. Only OLAP research that employs information visualization has been

discussed in this thesis.

One of the popular ways of viewing OLAP results in a textual presentation of

Page 77: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

2.4. VISUALIZATION FOR OLAP 41

Figure 20: An example of visualization by Anchored Measures from Eick 2000 [17].

queried results [14, 54, 74] is a technique such as a pivot table. ADVIZOR [17] pro-

vides visualization tools for exploring databases through visual query and analysis.

Three techniques, Single Measure, Multiple Measure, and Anchored Measure, are

parts of this tool. The Single Measure approach represents a measure by using a

3D bar chart on a centered window called the 3D Multiscape. The height of each

bar shows a measure value. The Multiple Measure approach applies a scatterplot to

visualize two measures along x-y axes rather than the 3D Multiscape. Colors can

display the third measure. The Anchored Measures approach combines ParaBox,

bubble plots, parallel coordinates, and box plots to visualize multidimensional data as

shown in Figure 20. Bubble plot axes represent dimensions and box plots measures.

Both bubble plot axes and box plot axes are arranged in the style of parallel coordi-

nates [34]. The system allows drilling down into low levels of abstraction, however,

the user can drill down only one dimension at a time.

Polaris is used for visualizing multidimensional databases as well as for viewing

data cubes. It is an interactive visual exploration tool which employs a table based

visualization technique [72]. A graphical display of the table looks similar to a pivot

table in a textual format. The extension of Polaris [73] provides an additional tool

Page 78: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

42 CHAPTER 2. PREVIOUS WORK

for interactive exploration of hierarchical structures of datasets. The system allows

users to get overviews of data and drill down into lower levels similar to a pivot table

approach. In contrast, the technique presented in Chapter 5 allows users to explore

independent overviews of data, low levels of detail, and any particular region of

interest anytime during navigation.

Andreas et al. [47] have developed their model for OLAP screens, called the Cube

Presentation Model (CPM), and applied a visualization technique, table lens, into

their model. The CPM model consists of two layers, logical and presentational.

The logical layer deals with data retrieval while the presentational layer is for data

presentation. The model employs cross-joins for retrieving maximum, minimum,

and closest average values. The main goal of the system is in determining the

window of interest for viewing the data in particular areas in large overviews of

the cross-join window. The system does not provide drilling down and rolling up

features for exploring multiple levels along different hierarchies.

Although visualization has been integrated in some OLAP tools, no practical tech-

nique providing visual feedback and interactive visualization for exploring hierarchi-

cal data has been researched. Chapter 5 details an interactive visualization tool for

analytical tasks which reduces the user responsibility in remembering exploration

paths.

2.5 Summary

This chapter has reviewed prior research relevant to this thesis. It has given an

overview of the field including visualization techniques, dynamic query framework,

data mining, visual data mining and visualization for analytical tasks. Several new

visualization techniques related to these topics are now presented in the subsequent

chapters of this thesis.

Page 79: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Chapter 3

A New Technique for Visual

Exploration of Large Datasets

3.1 Introduction

In this chapter, a new technique is presented for visual exploration of large mul-

tidimensional datasets for discovering correlations among attributes. There is an

explosion of datasets in many different areas like business, government and scientific

disciplines. The demand to extract meaningful trends and correlations from these

datasets is also increasing.

Data visualization and visual data exploration play important roles in extracting

trends and correlations in large datasets [15]. It is quite often impossible for a

human expert to understand large multidimensional datasets through manual ex-

amination or by viewing the data tables in text format. Visualization tools are

extremely important for this purpose. A visualization tool can quickly show trends

and correlations in the underlying dataset that are impossible to find through other

means. Although the well known visualization techniques such as parallel coordi-

nates [34], star coordinates [36], and scatterplot matrix [13] are accepted and com-

monly used, most of them have the problem of occlusion when visualizing large

43

Page 80: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

44CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

datasets. Moreover, these techniques are not very useful for visualizing correlations

among attributes of large datasets.

The original dynamic query framework was introduced by Schneiderman [61], Ahlberg

et al. [4], Williamson and Schneiderman [79] and Ahlberg and Schneiderman [3].

Many systems including MultiNav [43], EZChooser [80], and Attribute Explorer [75,

66] have been developed and extended, based on this framework. However, these

systems do not scale well for large datasets and their focus is on searching for a

single item or a few particular items, rather than visualizing correlations among

attributes.

VisEx is a new tool for exploratory visualization of large multidimensional datasets.

This tool allows the user to visualize a large dataset dimension by dimension or at-

tribute by attribute. VisEx is based on the dynamic query framework in the sense

that it allows even novice users to experiment with large multi-attribute datasets

and to frame meaningful queries. Users can explore the dataset through what-if type

analysis by imposing restrictions on the values of the different attributes. Once a

range has been restricted for one attribute, the tool displays all the records that

satisfy this range in the values of the other attributes. VisEx is a completely scal-

able system to handle both small as well as very large multi-attribute databases.

The system also provides users the flexible granularity for viewing the records in a

database depending on the size of the database. As the number of records increases,

the granularity at which records are shown is made coarser. Moreover, VisEx can

be used for selecting specific items by restricting the values of the attributes pro-

gressively, just as in Attribute Explorer [66, 75] and EZChooser [80].

As a motivation for the need of such a system, consider an example scenario when a

user needs to compare the different attributes of a dataset to learn about the corre-

lations between these attributes. Consider a commonwealth database of all primary

school children in Australia (this scenario is equally valid for other countries with

some modifications). Suppose each child has six attributes in the database, age

Page 81: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.1. INTRODUCTION 45

(AGE), parent’s median income (MINCOME), parent’s median educational back-

ground (MED), whether the child attends a private, catholic, Anglican or state

school (TSCHOOL), literacy level of the student : poor, average, satisfactory or

excellent (LITLEVEL) and whether the child comes from a single or two parent

family (NPARENT).

Assume that a state or the commonwealth education department is trying to im-

prove the literacy levels of the primary school children through the framing of new

policies or initiatives. The purpose of a visual analysis in this case is not to choose

specific records, unlike the Attribute Explorer [66, 75] or the EZChooser [80] sys-

tems. Rather the emphasis is on framing hypotheses and testing them through

restrictions of different attributes. For example, a policy maker may have a hy-

pothesis that children in the lower primary age group (6-10) attend higher levels

of literacy if they come from two-parent homes rather than single-parent homes.

The policy maker can test this hypothesis in the following way. She first selects

the AGE attribute for display and a range 6-10 for the AGE attribute. She next

selects the LITLEVEL attribute for display. Only the student records with age in

the range 6-10 are displayed in the LITLEVEL attribute barstick (my term for a

horizontal histogram as described in the next section). In other words, the display

of the LITLEVEL attribute is constrained by the selection of the AGE attribute.

The analyst then restricts the range of the LITLEVEL attribute as 3-4 (satisfactory

or excellent). Next, she chooses the NPARENT attribute and only the records that

satisfy the restrictions on the previous two attributes are displayed in the barstick

for the NPARENT attribute. Since the NPARENT attribute has two quantitative

levels 1 (single-parent) and 2 (two-parents), the policy maker can easily check the

distribution of the highlighted records in these two levels and see whether her hy-

pothesis is correct. Suppose she finds no difference in the two distributions, in other

words, children who attain a higher level of literacy may come equally from single

or two-parent homes. She may now want to test whether children in the upper

primary age group satisfy her hypothesis. She can change the range of the AGE

Page 82: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

46CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

attribute to 11-14 and only the records that satisfy this restriction will be displayed

in the other barsticks. Hence, she can immediately check her second hypothesis vi-

sually. This is only an example scenario illustrating the power of VisEx in providing

rapid visual feedback to the user for testing many hypotheses quickly without any

detailed knowledge of the underlying database.

This chapter is organized as follows. Section 3.2 introduces the terminologies used

throughout this chapter. In Section 3.3, I discuss the system design including ben-

efits and requirements of the visualization system. The VisEx system architecture

is presented in Section 3.4. The details of subsystems are then described and fol-

lowed by user interaction in the system along with an example. Analysis scenarios

of the system as well as a user study are given in Section 3.5 and 3.6, respectively.

Section 3.7 summarizes the contributions of the chapter.

3.2 Terminology

A row in a relational table or a flat file can be referred to as a tuple, item, or record

and a column as a field, dimension, or attribute. However, in this chapter, I refer to

a row as a record or item and a column as an attribute or dimension. The display

in VisEx consists of two separate visual entities called barstick and bar as discussed

below.

Barstick

A barstick is a histogram placed horizontally as shown in Figure 21. Each attribute

of the underlying dataset is represented by a separate barstick. Each barstick is

initially empty, but has the potential to display all the records in the dataset.

VisEx starts displaying the records when the user starts restricting the ranges of

the different attributes. The length of each barstick is restricted to be the same to

optimize the use of screen space.

Page 83: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.2. TERMINOLOGY 47

Figure 21: An example of querying multiple attributes in VisEx. The figure showsthree barsticks for three selected attributes. The attributes are selected by the user inthis order. (a) The first barstick is displayed by sorting the value of the first queriedattribute from the dataset. The red-colored area and the partitions (represented bytwo black vertical lines) represent the selected range of this attribute. The color ofthe selected bars is red since all the records in this range are selected and each barrepresents the maximum number of records. (b) The second barstick displays thesecond attribute of the records. The colored bars represent the records whose firstattributes are within the selected range in the first barstick. The coloring of thebars show the density of the records. For example, the first set of bars in the secondbarstick is colored in yellow and the second set of bars is colored in blue. This meansthat the first set of bars has a higher density of records. The user selected range in thesecond barstick is shown by the two vertical black lines. (c) The third barstick displaysthe records that have their first and second attributes within the selected ranges inthe first and second barsticks. Again, the color of the bars shows the density of therecords. The other interaction technique is shown by the highlighted bars in gray inthe three barsticks. The user selects a group of bars in the second barstick by clickingon them. All the records affected by this selection are highlighted in gray in the otherbarsticks.

Page 84: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

48CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

Bar

Each bar spans the width of a barstick. Each bar represents at least one record from

the dataset. However, there is no upper limit on how many records a single bar

can represent. A variable color coding scheme is used for representing the number

of records represented by a bar. The scheme varies from red to blue along the

spectrum. If all the records in the dataset are displayed in a barstick, the color of

each bar is the same.

The details of how to visualize and how to map data records into bars and data

attributes into barsticks are further provided in Section 3.4.

3.3 VisEx system design

VisEx has been designed to discover relationships, correlations, distributions and

trends in large datasets, and overcome the occlusion problem. To overcome the

occlusion problem, the system is designed to reduce the number of graphic primitives

on the screen. I do this through the quantitative estimate display of the bars with

a color scheme. The granularity of the data is adjusted according to the number

of records to be displayed. In other words, each bar may show a higher number of

records if the number of records to be displayed is large. The color of a bar indicates

the number of records represented by the bar. For example, red indicates a higher

density of records and blue indicates a lower density of records. The other colors

in between indicate different degrees of densities. The system has four benefits,

namely simplicity, scalability, flexibility, and dynamism.

• Simplicity: The simplicity of this visualization is to display clear and un-

derstandable visualization in a limited amount of screen space at a time. In

this technique, barsticks and bars (as discussed further in the next section)

are exploited for visualizing large multidimensional datasets.

Page 85: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.3. VISEX SYSTEM DESIGN 49

• Scalability: VisEx also supports the display of small as well as a large num-

ber of records on a limited screen space through the same approach, the

quantitative estimate of records with the color scheme of the bars, as used in

occlusion.

• Flexibility: The system provides human experts capabilities of selection and

exploration to conduct what-if experiments on large datasets.

• Dynamism: The general idea of dynamism in VisEx is to generate dynamic

visualization which is capable of reconfiguring the attributes and handling

dynamic analysis of large amounts of multidimensional data.

Any visual analysis tool needs to meet requirements [40, 72] for effective visualiza-

tion of large amounts of data. VisEx provides specific features to handle the display

of large datasets. These features are:

• Data-dense displays: Large number of data records are transformed and

displayed in a single barstick.

• Screen management: Screen space is effectively managed to avoid screen

occlusion which limit the ability of analysts in interpreting data from visual-

ization.

• Locality: Data records are arranged by ordering and grouping similar at-

tribute values to each other. Data locality helps analysts to obtain a clear

view of the distribution of queried subsets among all the data records.

• Filtering: Analysts are able to generate queries to limit the range of data

which they are interested in so that unrelated data are not displayed on the

screen.

Page 86: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

50CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

3.4 VisEx system architecture and implementa-

tion

VisEx system architecture consists of four main components: connection and trans-

formation, querying, visualizing, and interaction as shown in Figure 22. The system

connects to the data files or files from databases which users provide. Users can set

up the queries or data subsets of interest through the interaction tool. The system

then retrieves queried subsets from the relational database and organizes the data

records in barsticks according to the queries. After arranging queried results, Vi-

sEx displays the subsets of data records that satisfy the constraints in the query.

This visual feedback helps users to understand the relationships and correlations

of the queried subsets so that they can set up a sequence of queries to discover

deeper correlations of attributes in datasets. The interaction allows users to obtain

or browse details of the subsets, including regenerating new attribute and range

selection queries. To enable the system to visualize data from a variety of data

sources, VisEx allows accessing data from both flat files and relational databases.

The connection to database servers is made through an ODBC Driver manager.

VisEx has been implemented in Visual C++ and tested on many datasets coming

from flat files and from relational databases through ODBC. ODBC enables pro-

gramming applications to access a variety of databases depending on the availability

of an ODBC Driver for each DBMS.

3.4.1 Connection and Transformation in VisEx

The connection and transformation component in VisEx supports the communica-

tion with datasets. VisEx accesses flat files (i.e., a text file) where each flat file

consists of rows and columns that have delimiters, such as a tab and comma, be-

tween the columns. A relational database is a commonly used data storage device.

To increase capability and applicability of the visualization tool in representing

Page 87: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 51

Figure 22: VisEx System architecture

data from a variety of data sources such as MS SQL server, Oracle DB, and MS

Access, VisEx supports visualizing data from these large data sources via ODBC.

Relational databases in Microsoft Access have been used for the experiments. Vi-

sEx communicates with a database in three major steps, namely, connecting to an

ODBC data source, executing SQL statements, and retrieving data. The execution

of SQL statements allows records from the database to be retrieved, updated, and

created. To communicate with the data source (or DBMS), an application needs to

link to the ODBC Driver Manager implemented in ODBC32.dll on the Microsoft

Windows platform. The ODBC Driver Manager then passes ODBC function calls

from the application to the appropriate ODBC drivers to set up the communication.

ODBC drivers process all ODBC function calls, such as calls for connecting to the

Page 88: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

52CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

specified data source.

There are seven major steps in generating ODBC function calls for setting commu-

nication between an application and DBMS as follows.

1. Adding data source

2. Allocating handles for application

3. Connecting application to data source

4. Retrieving data source and connection information

5. Executing SQL

6. Retrieving results

7. Disconnecting from data source and freeing all allocated handles

3.4.2 Visualizing multiple attributes in VisEx

Before presenting the details of the querying component in the next section, this

subsection introduces how the system visualizes the queried results and provides

visual feedbacks. VisEx uses barsticks to represent the attributes in a dataset, one

barstick for each attribute. The barsticks are created dynamically, according to

the number of selected attributes. The minimum and maximum scales and the

name of the attributes are shown with the barstick after users select that queried

attribute. All data records of the selected attribute are arranged in ascending order

and partitioned into bars based on the number of data records in the dataset, as

shown in Figure 21.

A bar within a barstick represents a group of records which have closely related

values for the attribute represented by the barstick. Each bar represents the same

number of data records except the last bar which might contain fewer data records.

The number of data records in the last bar is the remaining number of data records

Page 89: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 53

after dividing all data records equally among the other bars. The number of data

records represented by each bar is derived from dividing the total number of data

records by the length of a barstick, i.e., the total number of bars a barstick can

accommodate. In the implementation, each barstick accommodates 500 bars. The

size of bars relies on the number of data records in each dataset. Suppose there are

two datasets. The first dataset contains 500 data records and another has 250 data

records. The width of bars in the first dataset, 1 pixel wide, is twice the width of

bars in second dataset, 2 pixels wide.

Data attributes in datasets, including databases, have different data characteristics,

which can be categorized as categorical and quantitative. VisEx converts categorical

data to a numerical form and treats them as quantitative data. For example, there

are two genders of people in census data and in VisEx the genders are represented

by 0 and 1 as males and females respectively. Analysts can gain insight into the

distribution of values of records for an attribute by observing the density and color

of the bars in the corresponding barstick.

Figure 21 shows how bars are organized into each barstick. The figure also illustrates

three barsticks for three selected attributes. The attributes are selected by the user

in this order.

Suppose, there are N records in the dataset and a barstick can accommodate M

bars. Then typically each bar represents NM

records if all the records are displayed.

The bars are sorted according to the range of attribute values in the barstick. For

example, if the attribute is the age of people, the barstick may have a minimum

value of 0 and maximum value of 100. If each barstick can accommodate 100 bars,

each bar will represent the number of people with the same age. For example, the

50-th bar will represent all people of age 50 in the dataset. Also, each bar will be

represented by the highest color from the range of colors, which is red.

However, in most analysis scenarios in VisEx I am not interested in displaying all

the records. Instead, only those records that satisfy the constraints specified by

the user are displayed. To continue with the example, suppose there are 10, 000

Page 90: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

54CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

people with age 50 in the dataset. The user has constrained some other attribute

in the dataset so that only 800 of these 10, 000 records in the 50-th bar need to be

displayed. In that case, an appropriate color is chosen for displaying the 50-th bar.

Hence, the color of a bar gives an intuitive meaning to the number of records that

is represented by that bar.

Color Selection

A color scale has been used to distinguish between different ranges of values in

both categorical and quantitative data and to represent the distribution of the

data [20]. The color scale should satisfy these requirements: order, uniformity and

representative distances, and no artificial boundaries [44, 45].

I decided to use colors along the full spectrum (blue to red) for coloring each bar.

There are two reasons behind this decision. First, the main aim of VisEx is to help

an analyst to discover correlations in large multidimensional data sets. Hence, I

am interested in displaying only quantitative estimates and not an exact number

of records. The second reason is the convenience of the user to recognize and

distinguish the coloring. Each bar in VisEx associates a quantitative estimate of

the number of records with each of these colors. However, it is easy to change the

color scheme used in VisEx.

The following example gives a perspective of the quantitative estimate provided

by VisEx. Assume that a dataset contains one million records and 500 bars per

barstick are used. Hence, each bar represents 2, 000 records if all the records in the

dataset are displayed. In this case, red is used for coloring each bar. Now consider

a scenario where only 864 records need to be displayed in a bar when these records

satisfy the constraints imposed by the user. Suppose a color scheme has ten levels

along the full spectrum. The i-th level of the color scheme, 1 ≤ i ≤ 10, is used to

represent the number of records between 200 ∗ (i− 1) and 200 ∗ i. Hence, the 5-th

level of the color scheme to color a bar for 864 records is chosen.

Page 91: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 55

The minimum and maximum boundaries of the selected range are shown by two

black lines. The minimum and maximum range values are written in blue and red

respectively below the corresponding barstick. The minimum and maximum range

could be the same for categorical attributes.

The user interface of VisEx is shown in Figure 23. The left panel allows the user to

load a database for exploration and also choose the attributes one by one. The user

can switch between the three modes of exploration, normal, comparison and fixed

any time during the exploration. It is possible to fix an attribute by checking a box

during fixed mode exploration and the user can specify a range by typing the lower

and upper limits for the range. Also, the user can choose different quantitative

values of an attribute for comparison in the comparison mode. The right panel is

used for displaying the barsticks corresponding to the chosen attributes. Both the

panels can be scrolled up and down to choose and display any number of attributes.

3.4.3 Querying in VisEx

The querying subsystem in VisEx arranges subsets of data records from which

analysts select attributes and specify their ranges. The querying process can be

divided into three phases:

• Ordering: VisEx sorts attribute values of selected attributes in datasets

according to the specified queries.

• Grouping: VisEx places similar attribute values close to each other into the

same group.

• Filtering: The specified range selection is used to filter unrelated data records

or to hide irrelevant data records, i.e., the records that do not satisfy the

constraints of the query, from the screen.

A dataset for Boston house prices reported by Harrison and Rubinfeld [23] is used

as an example in visualizing and demonstrating both querying and user interaction

Page 92: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

56CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

Figure 23: The user interface of VisEx is explained through an example. The leftpanel allows the user to choose the attributes in any order. The user can selectan attribute from a drop-down list of attributes. In this example, three selectedattributes: median value of owner occupied homes, per capita crime rate by town,and residential zone proportion, are queried in this order. This example shows VisExscreen application with high value range of median price (30-50), low percentage ofper capita crime (.01-1), and higher (20-100) percentage of residential zone selection.The minimum and maximum ranges are shown in blue and red respectively below abarstick. The result shows that the per capita crime is low for areas where medianvalue of owner occupied homes is high (second barstick). Similarly, residential zoneproportion is high (indicating wealthy suburbs) when median value is high and crimerate is low (third barstick). Finally, if the residential zone proportion is selected atthe higher end in the third barstick, the non retail business proportion i.e., numberof industrial sites is low in the fourth barstick.

Page 93: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 57

(as presented in the next section) in the VisEx system. This dataset is available

from the StatLib-Datasets Archive of Carnegie Mellon University [71]. The records

in this dataset contain approximately 15 attributes (e.g., Median value of owner-

occupied homes, per capita crime rate by town, proportion of residential land zoned

for lots over 25,000 sq.ft.,etc.) and 500 house records. Many of the attributes in this

dataset are categorical. For example, the attribute ‘median value of owner occupied

homes’ uses categories between 5 and 50, where each of these categories actually

represents a range of prices.

VisEx has three modes in which a user can generate queries and explore a multidi-

mensional dataset. I call these three modes normal, fixed and comparison modes.

The user can switch between these three modes depending on the requirements of

the exploration.

Normal mode exploration

The querying process starts when the user selects one of the attributes as the first

attribute. The user also selects a range for this first attribute to be displayed in

the first barstick. I call the first attribute att(1) and its range range(1). Next, the

user chooses the second attribute (att(2)). VisEx displays only those records in the

barstick of att(2) with their att(1) values within range(1). The user now selects a

range of values for the second attribute att(2) from among the records displayed

in the barstick for att(2). This process continues for the subsequent attributes. In

general, the barstick for the N -th attribute att(N) displays the records that have

their att(i) value within range(i), for 1 ≤ i ≤ N − 1.

Fixed mode exploration

The fixed mode exploration is suitable when the user has in mind some hard con-

straints on the ranges of some of the attributes. In other words, the user is sure

about the ranges of two or more attributes (called chosen attributes) and she wants

Page 94: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

58CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

to experiment with the other attributes after imposing the ranges on the chosen

attributes. An attribute can be fixed any time during the exploration. Suppose the

user chooses to fix the i-th, j-th and k-th attributes and chooses range(i), range(j)

and range(k) for these three attributes. Only the records that satisfy all these three

ranges will be displayed in the three barsticks for the three fixed attributes i, j and

k. The user can subsequently choose other attributes that are not fixed (called float-

ing attributes) and the corresponding ranges for these floating attributes. Only the

records that satisfy all the ranges for all the fixed attributes are displayed in the

barstick of a floating attribute. There is no limit on the number of fixed attributes.

Note that the normal mode exploration can be viewed as a fixed mode exploration

when only the very first attribute is fixed.

Figure 24: An example of fixed mode exploration. First three attributes in this caseare fixed and hence only the records that satisfy all the three ranges are displayedin the first three barsticks. The next two attributes are floating, i.e., the user canexperiment with different ranges for these two attributes. This is an analysis of thedataset in [23]. If the median value of homes is high, per capita crime is low andresidential zone proportion is high (the first three fixed attributes), the houses tendto be new with more number of rooms (the last two attributes).

The fixed mode operation is useful in situations when the user has made up her

mind about the ranges of some of the attributes and wants to experiment with

ranges of other floating attributes. The user can be sure that any records displayed

for any of the floating attributes already satisfy the restrictions of the fixed at-

tributes. Moreover, the selected records due to the fixed attributes do not have

any dependency on the order of selection unlike in the normal mode exploration.

As a comparison between the fixed mode and normal mode exploration, Figures 24

Page 95: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 59

Figure 25: An example with five queried attributes: median value of owner occupiedhomes, per capita crime, residential zone proportion, age of house, and number ofrooms. The red vertical bars in each barstick represent the selection by the user. Theresult shows that in areas where the median price is high and per capita crime is low,higher residential zone proportions and new houses with higher number of rooms arefound.

and 25 are referred. All selected attributes in Figure 24 are the same as in Figure 25

except the first three attributes which are fixed in Figure 24. The first three fixed

attributes display only the records satisfying all the three specified ranges.

Comparison mode exploration

The comparison mode is useful for comparing two or more categorical attributes in

detail. Recall that all categorical attributes in VisEX are treated as quantitative

attributes by assigning serial numbers. For example, gender is treated as a quan-

titative attribute with two values 0 (male) or 1 (female). Similarly, if the dataset

has ‘town’ as an attribute and names of towns as the values for that attribute, each

town is given an integer label to convert the attribute to a quantitative attribute.

Once the user switches over to the comparison mode from normal mode opera-

tion, she can choose different values for a quantitative attribute from a scrolling

menu. Each subsequent barstick for the other user selected attributes is split into

i barsticks if the user chooses i quantitative values for comparison. To enhance the

understanding of this mode, a comparison example scenario is presented. Consider

Page 96: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

60CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

Figure 26 and the second barstick representing ‘Town’ which is a categorical at-

tribute. If the user wants to compare other attributes like ‘Tax Rate’ and ‘Pupil

teacher ratio’ for different towns, it can be done in the following way. The user

selects three towns 28, 75 and 76 for comparison. Each subsequent barstick for the

other attributes is split into three barsticks for these three towns. Figure 26 shows

that correlations of the most expensive houses with residential land, and indus-

trial zones are opposite to the correlations seen from Figure 23 and Figure 29 for

these three towns. Town number 75 and 76 tend to have high pupil teacher ratios

whereas town number 28 has a lower pupil teacher ratio. The use of this technique

along with exploration and selection techniques also helps users to extract hidden

correlations and different characteristics, or outliers, of attribute values.

Figure 26: Display of the relationship of six queried attributes: median value ofowner occupied homes, town, residential zone proportion, non retail business propor-tion, tax rate, and pupil-teacher ratio; with three categorical attribute values: townnumbers 28, 75, and 76.

Page 97: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 61

3.4.4 User interaction

To effectively handle user interaction, an interactive tool must deal with many

human factors [20]. VisEx supports some principles of interactive design such as

consistency, providing feedback, and ease of use or simplicity without extensive

training. The interaction features of VisEx can be categorized into two main groups,

namely Exploration and Selection techniques. The interaction techniques provide

the possibility of visual feedback when users generate queries and interact with the

system.

Exploration techniques

An attribute list allows users to explore all attributes in the dataset. A barstick

is displayed for each selected attribute and all values of the selected attribute are

sorted and shown in each list box. Users can specify and adjust ranges of attribute

values. After selection of individual attributes, subsets of attribute values falling

into the ranges of the previous selected attribute are colored.

In Figure 23, there are four selected attributes: median value of owner-occupied

homes (MEDV), per capita crime rate by suburb (CRIM), proportion of residential

land zoned for lots over 25,000 sq.ft. (ZN), and non retail business proportion

(INDUS). First, the user selects the attribute MEDV from the first attribute list

and specifies the range in between 30 and 50 to examine how the most expensive

houses correlate to the subsequent selection of attributes. The first two barsticks

show that expensive houses tend to be in areas with low crime rates. The user then

specifies the lowest percentage range of per capita crime rate (0.01-1), and higher

(20-100) percentage range of residential zone selection. Hence, the correlations of

four selected attributes from the visualization that the more expensive houses tend

to be not only in areas with low crime rates but also with a higher proportion of

residential zone and with a low proportion of industrial zones can be summarized.

The opposite is true for the cheaper houses which tend to be in areas with high

Page 98: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

62CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

crime rates, a low proportion of residential zones, and a high proportion of industrial

zones as shown in Figure 27. In other words, both MEDV and ZN attributes have

opposite correlation with the CRIM and INDUS attribute. Users can select further

attributes that might be affected by the first three selected attributes. Figure 25

shows that most of the expensive houses in areas with low crime rates on larger

blocks are quite new and have more rooms.

Figure 27: The selections in this example are opposite to that shown in Figure 23.If median value is selected as low (first barstick), per capita crime is selected as high(second barstick), residential zone proportion is selected in the medium range (thirdbarstick), the number of industrial sites is high in those localities.

In addition, VisEx allows users to reselect the attributes to be displayed by a bar-

stick to examine their hypotheses dynamically. The system maintains and updates

all remaining attributes and attribute values according to the last change in an

attribute selected by the user.

Changing the queried ranges in a previously selected attribute affects how the bars

in the following barsticks are colored. For example, if the range of the first selected

attribute is changed, the selected colored areas of the subsequent barsticks, such

as the second barstick, will change. Corresponding to the first queried range, the

second queried range of the second barstick will affect the selected colored areas

of the third barstick. For reselected attributes, as shown in Figure 28, the new

first selected attribute is the proportion of industrial land or, ‘non retail business

proportion’. The result shows that the crime rate is high in industrial areas. The

subsequent reselection of other attributes reveals that industrial areas also have a

high concentration of nitric oxide pollutant and higher property-tax rate.

Page 99: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.4. VISEX SYSTEM ARCHITECTURE AND IMPLEMENTATION 63

Figure 28: An example with four queried attributes: non retail business proportion,per capita crime, nitric oxides (NOX) concentration (parts per 10 million), and taxrate. The result shows that per capita crime, nitric oxide concentration and tax rateare higher in industrial areas (areas with higher non retail business proportion).

Selection techniques

Selection techniques are designed to support viewing the details and distribution

of data records in selected attributes. Users can view the distribution of records in

any particular area of the selected barstick in all other barsticks by clicking on the

areas of that barstick. For example, users click on the last rectangle of the third

barstick in Figure 29. All areas responding to the selected area in other barsticks are

highlighted in gray. The highest residential land areas tend to have the lowest crime

rates, fewer industrial land and low pupil to teacher ratios. This highlighting tool

helps users to understand characteristics and distributions of selected data records

in other attributes. Selection supports on-demand details. When users click a right

mouse button on any colored areas of each barstick, details of specified range (e.g.,

numbers of queried data records and minimum and maximum values of that selected

area) are shown in a pop-up window. For example, a right click on the last group

of bars in the third barstick pops up the details that nine houses have minimal and

maximal proportion of residential zone from 90 to 100%.

In addition, the system provides an equal-height bar chart for viewing the distribu-

tion of data values in the selected barstick. The motivation for using an equal-height

bar chart is for scalability and space efficiency. When users double click the left

mouse button on any selected areas in each barstick, the equal-height bar chart of

Page 100: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

64CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

Figure 29: An example of selection in a barstick. The relationship between fivequeried attributes is shown. These are median value of owner occupied homes, percapita crime, residential zone proportion, non retail business proportion, and pupil-teacher ratio by town. The user selects the last group of bars in the third barstick(residential zone proportion) and all the affected bars in the other barsticks are high-lighted in gray.

that barstick is popped up to show the distribution and accumulation of attribute

values under range selection of the previous queried attribute. Figure 30 displays an

example of the equal-height bar chart of the selected attribute, TOWN. There are

only two towns, 26 and 27, in the industrial zones with a low rate of crime. Town

number 27 has more houses than Town number 26. In other words, the houses in

industrial zones tend to be in Town number 27 more than in Town number 26.

3.5 Analysis scenarios

A variety of departments (e.g., federal government, business organizations, etc.) use

census data to analyze, evaluate and improve their services. For example, federal

government uses the census data to measure economic circumstances by an analysis

of average capital incomes with other related factors. Businesses can use data as

an investment guide. The transportation department uses the census data to plan

highway improvements, develop public transportation services, design programs to

ease traffic problems, or reduce pollution. In my experiments, I have analyzed

two scenarios including U.S. census data and the current population survey to

demonstrate the data exploration capabilities of VisEx.

Page 101: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.5. ANALYSIS SCENARIOS 65

Figure 30: Display of the relationship of four queried attributes: non retail businessproportion, town, per capita crime, and median value of owner occupied homes, withthe equal-height bar chart of town attribute.

3.5.1 Analysis 1: 1990 U.S. Census Data

A part of the 1990 United States census data from the KDD archive of the Univer-

sity of California at Irvine [26] has been used for the analysis in this section. The

census data consists of 72 attributes such as age, gender, income, education, indus-

try, occupation, and social class of workers, and has approximately 300,000 data

records. Many correlations, trends, and relationships may be discovered through

VisEx. I have experimented with some of the relationships and correlations. Fig-

ure 31(a) shows that entertainment, recreation, and professional service businesses

tend to pay high salaries to highly educated people to work in managerial and pro-

fessional speciality occupations. In contrast, finance, insurance, real estate, and

personal service businesses tend to pay more for highly educated people to work as

technicians, salesmen, and related support services than other occupations as shown

in Figure 31(b). More highly educated males earn higher total personal incomes

than females with the same levels of education, as shown in Figure 32.

Page 102: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

66CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

(a)

(b)

Figure 31: An example of an analysis scenario with four selected attributes: occupa-tion (dOccup), industry (dIndustry), total personal incomes (dRpincome), and yearsof schooling (iYearsch). (a) Managerial and professional specialty jobs are selectedin the first barstick. The second barstick shows that most of these jobs are in pro-fessional services, and entertainment and recreation businesses. The third barstickshows that the salaries for such jobs are usually high. Finally, the fourth barstickshows that people employed in these jobs have higher years of schooling usually. (b)Technicians and related support occupations and sales occupation have been selectedin the first barstick. The second barstick shows that people in these occupations havelower years of schooling compared to those in example (a). The third barstick showsthat they usually earn less and are employed in finance, insurance, real estate, andpersonal service businesses.

Page 103: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.5. ANALYSIS SCENARIOS 67

(a)

(b)

Figure 32: An example analysis with three selected attributes: years of schooling(iYearsch), gender (iSex), and total personal incomes (dRpincome). 14-17 years ofschooling is selected in both examples (a) and (b). (a) The second barstick shows thedistributions of males and females with 14-17 years of schooling. The left block ofbars (males) is selected. The third barstick shows that highly educated males earnhigh salary. (b) Females (the right block of bars) with 14-17 years of schooling areselected in this example. The third barstick shows that highly educated females earncomparatively less salary compared to their male counterparts.

Page 104: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

68CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

Figure 33: An example analysis shows the relationships of five selected attributes:Total personal incomes (dRpincome), Years of schooling (iYearsch), Occupations(dOccup), Class of worker (iClass) and Industry (dIndustry). The first two bar-sticks show the range of the highest total personal income and degree of education.The third and forth barsticks show the selected group of people who work in man-agerial and professional roles for private companies. In the last barstick, these peopletend to work in the manufacturing group and in the entertainment, recreation, andprofessional service groups.

People who work in managerial and professional speciality careers for private profit

companies in the manufacturing group and in the entertainment, recreation, and

professional service groups tend to earn higher total personal income than people

who have occupations in other areas as shown in Figure 33. Most of those people

have completed bachelor or higher degrees. Figure 34 shows that people in the 65

or older age group, receive high incomes and are unemployed. Most of these people

receive social security income as well as retirement, survivor, or disability pension

incomes, while the rest have only one of these sources of income.

3.5.2 Analysis 2: 1985 The Current Population Survey

The population survey from StatLib-Datasets Archive of Carnegie Mellon Univer-

sity [71] has been used. The dataset consists of 534 data items and 11 dimensions

or attributes. In Figure 35, some queries are made on Occupation, Sex, Education,

Race, and Wage attributes. The exploration of Figure 35(a) illustrates that more

males have professional jobs than females. Almost all of those males are white and

are highly educated and well paid. In contrast, in Figure 35(b) more females work

as clerks than males and have average education.

Page 105: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.6. USER STUDY 69

Figure 34: An example analysis shows the relationships of five selected attributes:Total personal incomes (dRpincome), Occupations (dOccup), Age (dAge), Retirementincome (dIncome7), and Society security income (dIncome5). The first selected at-tribute and range represent higher total personal income. The second barstick showsunemployed people earning higher personal income. The highest range of age (at least65 years old) has been selected as the third attribute. The selections of the last twobarsticks show that most of these people receive retirement income.

In Figure 36(a), Education, Experience, Age, and Wage attributes are queried and

the example shows that persons who have less education, a lot of experience and

are older tend to have low wages. In Figure 36(b), young highly educated persons

tend to have little experience and low wages. There are only two persons earning

the highest wages as detected from the outliers of the wage bar. Older people with

much experience have less education while younger people with little experience

have higher education. Younger persons (age around 15-18 years) tend to have

12-15 years of education.

3.6 User study

3.6.1 Experimental methodology

To evaluate the efficiency of the system, I conducted a user study with eighteen

participants from postgraduate students in the School of Computer Science and

Software Engineering by asking them to perform tasks and report their findings of

the assigned tasks.

The experiment was divided into two sessions. The first session was a tutorial

Page 106: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

70CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

(a)

(b)

Figure 35: Five selected attributes: Occupation, Sex, Education, Race, and Wage,are queried with specified ranges in each attribute. (a) The “Professional career” inthe Occupation attribute is selected in the first barstick. The second barstick showsthe distribution of males and females. The third barstick visualizes the distributionof education of males and the specified range of education is 15-18. “Race” and“Wage” are selected as the fourth and last attribute respectively. (b) In the first bar-stick, “clerk” is selected as the attribute of interest. The second barstick shows thedistribution of males and females who are clerks. The third barstick shows the dis-tribution of females with the specified education range of 15-18. “Race” and “Wage”are selected as the fourth and last attributes respectively.

Page 107: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.6. USER STUDY 71

(a)

(b)

Figure 36: An example analysis with four selected attributes: “Education”, “Expe-rience”, “Age”, and “Wage”. (a) The first barstick shows 2-10 years of the specifiededucation ranges. The second barstick shows the specified ranges of experience from35 to 55 years. The third and fourth barsticks display the selected age 50-64 yearsand wage between $1-6/hour, respectively. (b) The first barstick shows 13-18 yearsof education. The second barstick shows experience from 0 to 5 years. The thirdand fourth barsticks display the selected age between 18-25 years and wage between$1-6/hours, respectively.

Page 108: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

72CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

session. At first a brief introduction of visual representations in the system was

provided to the participants. Then the participants were asked to explore a car

dataset [71] and answer five example tasks (as shown in Appendix A) by using

the system so that they could learn how to use the tool including all features for

exploring the dataset. They could also ask any question during this session until

they were ready to continue with the next session. In the other session of the

experiment, participants were asked to complete ten tasks related to the census

dataset [71]. To evaluate all features in the tool, the tasks were set up so that the

participants could use the main features including normal mode, fixed mode, and

comparison mode explorations as well as interactive tools for exploring the dataset.

The performance of the participants was timed and marked as correct or incorrect

in order to evaluate how easy they found the tool. The ten tasks can be categorized

into three main groups including identifying a group of records, finding correlation,

and comparing groups of relevant records.

3.6.2 Results

Time and Correctness

All participants spent less than five minutes on individual tasks and spent more

time in completing tasks involving more attributes. Task 7, 9, and 10 consisted

of two questions with four to five attributes so unsurprisingly these tasks were the

most time consuming tasks as shown in Figure 37. Task 3 was a comparison task

and Task 1 involved searching for correlation. Both of these tasks involved only two

attributes so participants spent less time for these tasks. Participants spent less

time in correlation tasks (Task 2 and Task 5) than in comparison tasks (Task 4 and

Task 6). The correctness of the given tasks is shown in Figure 38. All participants

correctly answered Task 3 while Task 5 and Task 7 were the least correctly answered

and about 89% of participants answered these two tasks correctly. I observed that

a few participants did not carefully read these questions and did not completely

Page 109: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.6. USER STUDY 73

answer all questions in the tasks. However, the correctness of all tasks was more

than 85%. I also observed that some participants tried different features of the tool

for answering the tasks.

Figure 37: The mean time for completing each task.

Questionnaire and feedback

After finishing all tasks, the participants were asked about their experience in data

analysis and visualization using VisEx. All but one of the participants had no

experience in data analysis and none of them had experience in using visualization

tools. The questionnaire was categorized into four major groups including usability,

visualization, interaction, and information and their corresponding feedbacks are

Figure 38: The correctness of each task.

Page 110: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

74CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

presented in Figure 39.

In usability, 27.78% and 44.44% of participants strongly agreed and agreed that

visualization was easy to understand and 16.66% of participants rated it fair. The

tool was found to be easy to use (5.56% strongly agree, 61.11% agree, and 27.78%

fair) though only 5.56% of participants did not agree. Greater than 88.88% of

participants found that the tool was easy to learn. 66.67% of participants agreed

that the tasks were easy to complete with the tool and 22.22% of participants

provided a fair rating.

In the visualization category, greater than 90% of participants could identify the

difference between normal mode and fixed mode exploration. Participants pre-

ferred the normal mode exploration. They reported that they could visualize more

information in the normal mode exploration than in the fixed mode exploration.

However, a few participants said that they liked both of these modes and used

them depending on the goals of the exploration. Participants did not give any

negative feedback on using and understanding barsticks, identifying specific groups

of records, comparing groups of records and the clarity of visual representation

in VisEx. Participants also provided a lot of positive feedback for the ability of

identifying correlations among specified attributes. 27.78% and 50% of participants

strongly agreed and agreed that they could easily understand the displayed infor-

mation.

In the interaction category, all participants provided positive feedback on their

ability in using and changing exploration modes. Greater than 78% of participants

strongly agreed and agreed that it was easy to change the selection of parameters

as well as to correct mistakes. Participants reported that the search for data of

interest was easily directed (11.11% strongly agree, 55.56% agree, and 27.78% fair).

Moreover, most participants provided positive feedback for more than 70% of all

features in this category. In addition, participants also provided further comments.

A few participants commented that they would have more confidence if they could

spend more time using the system. Finally, most participants found the tool quite

Page 111: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.7. SUMMARY 75

useful.

3.7 Summary

Visualization is an important tool for understanding large and complex datasets.

It has been employed in many fields to help users gain insight into their data.

However, most of the visualization techniques encounter the problems of occlusion

and scalability. Most systems also require some prior training for the users. Hence,

it is an interesting challenge to design a visual exploration system that can provide

clear and understandable visualization as well as simple and flexible user interaction.

VisEx provides a new framework for visualizing correlations among attributes in

large multidimensional datasets. The display technique in VisEx avoids occlusion

through the quantitative estimates of the data. It is possible to compare only a few

attributes (usually two) in most previous visualization techniques for multidimen-

sional datasets. However, a user can discover correlations among many attributes

at a time in VisEx through its coupled display system. VisEx also provides analysts

with facilities for selecting ranges within attribute values and all the records affected

by these selections are highlighted in all the other barsticks for the other attributes.

This helps analysts to conduct what-if type experiments in discovering correlations

among attributes. VisEx is completely scalable for small to large datasets since

the aim is to display quantitative estimates rather than the actual records. Hence,

VisEx maintains similar screen appearance without occlusion of graphic primitives

for datasets of all size.

In the next chapter I introduce an integration of a visualization technique similar to

VisEx into the data mining process to enhance effective human intervention in data

mining. A framework for visual data mining is presented for discovering interesting

association rules. Moreover, I propose a new visualization technique for displaying

mined association rules.

Page 112: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

76CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

(a)

(b)

(c)

Page 113: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

3.7. SUMMARY 77

(d)

Figure 39: The results from questionnaires in different categories: (a) Usability (b)Visualization (c) Interaction (d) Information

Page 114: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

78CHAPTER 3. A NEW TECHNIQUE FOR VISUAL EXPLORATION OF LARGE DATASETS

Page 115: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Chapter 4

Visualization for Association Rule

Mining

4.1 Introduction

Data mining algorithms in general have different purposes, e.g., gaining insight into

data, predicting trends and discovering hidden associations in large datasets. Dif-

ferent data mining methods such as mining association rules, cluster analysis, and

classification have different goals according to the kind of knowledge to be mined.

In this thesis, I focus only on visual mining of and visualization for association rules.

An example of the use of association rules is to help store managers study purchas-

ing behaviors of their customers and promote sale of specific items to their loyal

customers. The size of databases like transaction records in supermarkets, telecom-

munication companies, e-marketing and credit card companies has been growing

rapidly and it is difficult to extract meaningful information from such databases.

Analysts need a tool to transform large amounts of data into interpretable knowl-

edge and information, and to help make decisions, predict trends, and discover

relationships and patterns. Association rule mining is one of the most important

data mining processes. It is a powerful tool that helps analysts to understand and

79

Page 116: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

80 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

discover the relationships in their data. Market basket analysis is also an exam-

ple of mining association rules which help marketing analysts to analyze customer

characteristics to improve their marketing strategies. To increase the number of

sales, one such marketing strategy could be the placing of associated items in the

same area of the floor so that customers can access (place the items in their market

baskets) the items easily. For example, if bread and cheese are frequently purchased

items, placing such items in close proximity may increase sales because customers

who buy bread may also buy cheese when they see cheese on a nearby shelf. Fur-

thermore, promotion of items frequently purchased together on the store catalogs

may increase sale of those items.

Mining association rules is a well researched area within data mining [21]. There are

many algorithms for generating frequent itemsets and mining association rules [1,

60, 69]. Such algorithms can mine association rules which have confidence and

support higher than a user-supplied level. However, one of the drawbacks of these

algorithms is that they mine all rules exhaustively and many of these rules are

not interesting in a practical sense. Too many association rules are difficult to

analyze and it is often difficult for an analyst to extract a meaningful (small) set of

association rules. Hence there is a need for human intervention during the mining of

association rules [2, 77] so that an analyst can directly influence the mining process

and extract only a small set of interesting association rules.

However, it is quite often impossible for a human expert to understand large multi-

dimensional datasets through manual examination. Visual data mining helps users

to extract interesting patterns hidden in their data and learn more about the data

through visualization. It is also important for the analyst to participate in the min-

ing process in order to identify meaningful association rules from a large database

through her guidance and knowledge. Any such participation should be easy from

an analyst’s point of view. Hence, visual association rule mining seems to be a

natural way of directing the mining process.

The visualization technique presented in VisEx has been modified for helping an

Page 117: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.2. TERMINOLOGY 81

analyst to mine association rules. This modification introduces a new tight coupling

technique, called VisDM which enables users to apply their domain knowledge to

enhance decision making processes which cannot be done by only automatic pro-

cesses. The algorithms and all user interfaces are implemented in Visual C++ and

tested on both synthetic and real world datasets.

This chapter is organized as follows. Section 4.2 provides terminologies about as-

sociation rules. The new tight coupling technique is introduced for visual mining

of association rules in Section 4.3. The data structure used in the implementation

of VisDM is discussed in Section 4.4. An example of visual mining of market bas-

ket association rules as well as an evaluation through a user study is discussed in

Section 4.5. Section 4.6 presents a new technique for visualizing mined association

rules. The conclusion of the chapter is presented in Section 4.7.

4.2 Terminology

An association rule [2] is formally described as a rule of type A ⇒ B where A is an

item set called antecedent, body, or left-hand side (LHS) and B is an item set called

consequent, head, or right-hand side (RHS). Each item set consists of items from a

transactional database. Items existing in the antecedent are not in the consequent.

In other words, an association rule is of the form.

A ⇒ B

where A, B ⊂ I and A ∩ B = φ.

I = {i1, i2, ..., in } is a set of items in the transaction database where ij, 1 ≤ j ≤ n, is

an item in the database that may appear in a transaction. Two common measures

for evaluating the importance of an association rule are support and confidence. The

support of a rule is defined as the percentage of frequency with which all items in

the rule appear together. The confidence of the rule is the ratio of frequency of

items in both antecedent and consequent (frequency of A and B) to frequency of

Page 118: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

82 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

items in the antecedent appearing together. The probability of both support and

confidence is,

Support (A ⇒ B) = P (A⋃

B)

Confidence (A ⇒ B) = P (B|A)

An example of an association rule is {cheese, bread} ⇒ {milk, eggs}. If this rule

has a support of 12%, it means that the four items cheese, bread, milk, eggs appear

together in 12% of all transactions. If this rule has a confidence of 52%, it means

that 52% of all customers who purchased cheese and bread also purchased milk and

eggs in the same transaction.

A term, frequent itemset [21] or large itemset [60], is used to define item sets whose

number of co-appearances in the database is greater than a user specified support.

In other words, it is known as the items frequently purchased together based on the

specified minimum support.

4.3 The model for interactive association rule min-

ing

Figure 40 shows a diagram of the model. The VisDM system can be divided into

three stages as follows. Each step has been designed to enhance the ability of the

users to interact in the mining process.

• Identifying frequent itemsets

• Mining association rules

• Visualizing the mined association rules

In the first stage of VisDM, the user first finds a suitable frequent itemset. In most

data mining algorithms, the selection of a frequent itemset is done automatically.

Page 119: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.3. THE MODEL FOR INTERACTIVE ASSOCIATION RULE MINING 83

Any item that has an occurrence above the user specified support is chosen as the

member of the frequent itemset. Though this method is efficient for identifying all

the frequently occurring items, the subsequent association rule mining step quite

often discovers a large number of association rules involving these frequently oc-

curring items. The technique gives the user complete control for choosing items to

form the frequent itemset. The detail of this stage is described in Section 4.3.1.

In the second stage, the user participates in generating interesting association rules

by specifying antecedents and consequents of each rule from the frequent itemset

chosen in the first stage. The user can experiment with different combinations

of the antecedents and consequents and save a generated association rule if it is

interesting. Section 4.3.2 provides more details on how this stage works.

Finally, in the third stage, the user can visualize all the discovered rules saved

during the second stage. Further details of this stage are presented in Section 4.3.3

VisDM splits the application window into two areas: left and right panels. The left

panel is a user control panel which allows the user to input parameters. The right

panel is a visualizing panel which displays results in response to the parameters set

in the left panel.

To effectively handle user interaction, an interactive tool must deal with many

human factors such as consistency and feedback [20]. My interactive technique takes

into account some requirements of interactive design such as consistency, providing

feedback, reducing memorization, and ease of use without extensive training. In

addition, an analyst has complete control over deciding on the antecedents and

consequents of each rule and the whole process is intuitively simple for the analyst.

Although a complete visual mining process is slow compared to an automated pro-

cess, it has the advantage of exploring only interesting association rules. As men-

tioned before, an automated process can mine many association rules that are not

meaningful practically. The visualization tool is extremely simple to use and avoids

screen clutter. This makes it an attractive option to use both for small and large

Page 120: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

84 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

Figure 40: A model of the technique for mining association rules.

Page 121: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.3. THE MODEL FOR INTERACTIVE ASSOCIATION RULE MINING 85

databases.

4.3.1 Identifying Frequent Itemsets

This part of the system assists analysts to search for frequent itemsets based on a

user-specified minimum support. An analyst can provide the minimum support to

filter only items that she is interested in. After specifying the minimum support,

all items exceeding the threshold are loaded and sorted in descending order of their

support. The analyst can use the sorted list as a guide in selecting each item

in the frequent itemset. Each selected item is represented by a barstick with the

percentage of its support. After the first selection of an item, the system generates

a list of items that co-exist with the first selected item. All the items in this co-

existing item list have supports greater than the user-specified minimum support.

The co-existing item list is also generated each time a subsequent item is chosen.

The percentage of support is calculated by comparing the numbers of the first and

second selected items appearing together with the total number of appearances of

the first selected item. At each step, the barsticks are displayed in a way similar to

the VisEx system discussed in Chapter 3.

VisDM helps an analyst to find items which tend to appear together in the transac-

tions. In addition, the system supports user interaction to find the details of each

selected item. When the analyst clicks in each bar, the percentage of each item in

the co-existing item list and its support are displayed to help make decisions and

compare selected interesting items and their supports.

As shown in Figure 41, the display window is divided into two sub-windows. The

left panel comprises the specified minimum support, lists of the items through com-

boboxes, and the list of co-existing selected items with their supports in descending

order. For example, the co-existing item list of cereal with milk, bread, and cheese

consists of biscuits (40% support), chocolate (28% support), and juice (36% sup-

port). The right panel shows the selected items with the number of purchases from

Page 122: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

86 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

Figure 41: The right drawing space represents each selected item as a barstick withthe number of purchases from all transactions. The control tab represents a comboboxfor each selected item and the list of its co-existing items.

all transactions as hierarchical barsticks. Milk, bread, cheese, and cereal are se-

lected in that order as items of interest. The user can change a previously chosen

item at any stage of choosing the frequent itemset. Each item in the set is cho-

sen from a drop-down list of items and the user can resize the frequent itemset by

deleting the last item at any stage. The user can change any previously chosen

item by successively reselecting any item from any drop-down list. Once the user

has chosen the frequent itemset, it can be saved for the later stages of the mining

process. Only seven items have been shown in Figure 41; however, it is possible to

include any number of items in the left panel through a scrolling window.

Page 123: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.3. THE MODEL FOR INTERACTIVE ASSOCIATION RULE MINING 87

4.3.2 Selecting Interesting Association Rules

In this stage, the selected frequent itemset from the first stage is used to generate

association rules. Again, complete freedom is provided to the user for choosing the

association rules including the items in the antecedent and consequent of each rule.

The items in the antecedent and consequent of an association rule are not limited

only to one-to-one relationships. The system supports many-to-many relationship

rules as well. In Figure 42, the left panel shows the selected frequent itemset of

interest including milk, bread, cheese, and cereal from the first stage. The user

is allowed to generate many-to-many relationship rules e.g., milk and bread as

antecedent and cheese and cereal as consequent or any combination of the items in

the antecedent and consequent. In the right panel, the first colored bar illustrates

the proportion of selected items, milk and bread for the antecedent. The second

colored bar represents all selected items of the association rule or in other words it

shows the proportion of the consequent items, cheese and cereal, appearing together

with the antecedent of the rule. In the left control panel, the system shows the

support of antecedent, the support of the selected itemset, and the confidence of

the association rule.

4.3.3 Visualizing Association Rules

This part deals with visualization of the mined association rules in the second

stage. The visualization allows analysts to view and compare the mined association

rules generated from the first two steps. Among the selected interesting rules, the

visualization bars allow analysts to obtain the most significant and interesting rules.

Figure 43 represents three association rules. For example, the first rule shows the

relationship of the antecedent: milk and bread and the consequent: cheese and

cereal. The confidence, the antecedent support, and the itemset support of this

rule are 49, 51, and 25, respectively. For the second rule, the first bar, with support

51, represents the antecedent: milk and bread and the second bar, with support 40,

Page 124: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

88 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

Figure 42: The right drawing space represents two barsticks. The first bar showsthe proportion of the antecedent of the association rule. The second bar shows theconsequent based on the selected antecedent. The control tab on top of the left handside is to input the antecedent and consequent of the rule. The bottom of the tabdisplays the confidence, the antecedent support, and the itemset support.

Page 125: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.4. DATA STRUCTURE USED IN VISDM 89

Figure 43: Illustration for deriving interesting association rules from the selectionof the rules in Figure 42. The two bars and the texts represent each rule and itsproperties.

represents cheese. The confidence is 78. The antecedent support of the last rule is

30, the frequency of item set is 25, and the confidence is 83. The last rule has the

highest confidence while its antecedent support is the lowest and the frequency of

the itemset is equal to that of the first rule.

4.4 Data Structure used in VisDM

The VisDM algorithm scans a market basket transaction database twice. The first

scan is to count the support of each item in the transaction records. The second

Page 126: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

90 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

scan is to generate a bitwise table to store the item lists of the original transaction

records. A bitwise operation is used for representing both existing and non-existing

items.

In the first stage, for identifying the frequent itemset, an item identification list

including non-existing items of each transaction is converted into a bit-vector rep-

resentation, where 1 represents an existing item and 0 represents a non-existing

item in the record. For example, suppose a market basket transaction database

consists of four items including milk, bread, cheese, and cereal in ascending order

of item identifications and a transaction contains two items: milk and cheese. A

bit-vector of this transaction is 1010. Hence, the associated items can be retrieved

by applying a bitmask operation to each transformed item list. Each bitmask is

generated by transforming all selected items to bits which are set to 1. After se-

lecting each interesting item from a list in the first stage, an associated item list

is generated to support the user’s search for the next interesting item. To reduce

search time of associated items in each transaction, the associated item list contains

only the indexes of transactions with all selected items appearing together. Each

transaction index is linked to the bitwise table so that all associated items in that

transaction can be retrieved. This technique can support a large number of items

in a transaction database. Though the bitwise technique needs some preprocessing

time to convert the transaction records to a bitwise table, it is efficient and effective

for searching the existing and associated items at run time.

4.5 A user study of VisDM

4.5.1 Experimental methodology

To evaluate the usability of the system, I conducted a user study with seven post-

graduate students from the School of Computer Science and Software Engineering

by asking them to perform data mining tasks and reporting their findings. The

Page 127: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.5. A USER STUDY OF VISDM 91

experiment was run on two datasets and all participants had to complete four main

tasks in each dataset as shown in Appendix B.

The first dataset is derived from UCI Machine Learning Repository [26] and the

items are denoted by numbers. The other dataset is from Data Mining II (DMII) [46]

with the associated item names for each transaction. Before starting an experiment,

each participant was given a tutorial on terminologies and descriptions for interpret-

ing association rules and frequent itemsets and given instructions for using VisDM.

An example of VisDM in action was also shown to the participants. At the end

of the experiment, the participants completed a brief usability questionnaire partly

derived from Stasko [70] and Marghescu and Rajanen [48].

4.5.2 Results

The participants were asked about their experience in mining and visualization using

VisDM. All of the participants had no extensive experience in data analysis and only

three participants had some experience in using visualization tools. The usability,

visualization, interaction, and information data from the study are presented in

Figure 44. 57% of the participants found that parameters shown in the tool are

understandable and the tool is easy to use, though 29% of the participants did not

agree that the tool is easy to use. The tool was found easy to learn (29% strongly

agree, 43% agree, and 28% fair). 43% of the participants agreed that the tool was

easy to use for completing the tasks while 14% of participants did not agree. For

quality of visualization, all participants provided positive feedback for identifying

most and least often bought items. Greater than 55% of participants found that

they could identify the maximum and minimum percentage of items purchased

together and appreciate the clarity of visual representation, though about 14% of

the participants did not agree. For quality of interaction, most participants provided

(i.e., ability to change the selection of items, to explore data, to use parameters, and

to direct search for data of interest) positive feedback. 86% of participants agreed

and strongly agreed that they were able to correct their mistakes, though 14%

Page 128: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

92 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

of the participants did not agree. Moreover, most participants provided positive

feedback for the quality of information they could collect about the underlying

dataset. Figure 45 shows that the participants spent more time to complete the

second and third tasks of Dataset1 compared to the corresponding tasks while using

Dataset2. This result suppports the idea that participants found it easy to search

for frequent itemsets and association rules of interest faster after gaining some

experience from the previous tasks. However, the user study is limited to a small

set of participants, all of whom had no experience in data analysis.

4.6 Visualization of many association rules

I have discussed the facility of visualizing a small number of association rules in

the VisDM system. This section discusses a system called VisAR which is suitable

for visualizing a large number of association rules. Typically, association rules

generated by mining algorithms are difficult for users to understand. Visualization

allows users to visually analyze and understand the mined association rules.

Zhao and Liu have proposed a visualization technique [82] for association rules.

Their technique uses a line to represent each association rule. The x-axis represents

time data and the y-axis represents the support or confidence value. Although this

technique is designed to help users to understand mined association rules through

visual analysis of time, their visualization uses a technique similar to the parallel

coordinates technique [34]. In practice, this technique causes occlusion and screen

clutter when visualizing a large number of association rules.

Wong et al. use a 3D visualization framework for association rules [81]. The ap-

proach is based on a Matrix-based technique. Although this technique can visualize

many-to-one association rules, the number of association rules generated from as-

sociation rule mining algorithms is massive. It is difficult to display all generated

association rules by using this technique. In particular, this technique is prone to

occlusion. Though the author claims that the height of the columns is scaled, the

Page 129: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.6. VISUALIZATION OF MANY ASSOCIATION RULES 93

(a)

(b)

(c)

Page 130: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

94 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

(d)

Figure 44: The results from questionnaires in different categories: (a) Usability (b)Visualization (c) Interaction (d) Information

higher columns representing antecedent and consequent items of the rules can still

occlude the columns of low support and confidence. However, in my technique it

is possible to view not only many-to-one but also many-to-many association rules.

The technique allows users to select items existing in association rules so that the

users can view only the association rules containing their items of interest.

Although Table-based, Matrix-based, and Graph-based techniques as well as some

commercial visualization systems are capable of representing mined association

rules, they visualize all mined association rules in a single view. Typically, visual-

izing all association rules at once produces too much information and might also

generate screen clutter. It is difficult for users to interpret and extract interesting

association rules from a single view of all rules.

This chapter presents a new technique called VisAR for visualizing association rules

derived from data mining algorithms. The aims of the VisAR system are similar to

VisEx presented in chapter 3. I focus on reducing the complexity of visualizing large

number of association rules in a single screen so that users are able to effectively

understand and interpret information from a large number of association rules.

The system is also designed to eliminate occlusion from visualization. This new

technique visualizes the association rules containing user specified items. Users can

explore association rules through their specified items of interest. The input for

Page 131: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.6. VISUALIZATION OF MANY ASSOCIATION RULES 95

(a)

(b)

Figure 45: (a) The mean time of completing each task. (b) The correctness of eachtask in each dataset.

Page 132: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

96 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

visualizing many association rules in the VisAR system is derived from algorithms

in CBA [46]. Next section presents the details of VisAR with an example.

4.6.1 The VisAR system

The system has been designed based on the diagram in Figure 46. I have categorized

all processes in the diagram into four major stages.

• Managing association rules

• Filtering association rules of interest

• Visualizing selected association rules

• Interaction during visualization

The first stage includes two processes: specifying and loading association rules that

have been generated by an automated data mining tool as shown in Figure 46.

The specified association rules are first loaded into memory. The system counts all

provided association rules and the number of distinct items in both antecedents and

consequents as well as manages lists of items in antecedents and consequents. Then

the system sorts the association rules according to the support values of individual

association rules. The support is used as a default for sorting association rules.

The purpose of the second stage is to specify the items of interest in association

rules and filter association rules according to the specified items. The user specifies

the items of interest and the system filters the association rules for which the user-

specified items exist in the antecedents.

The aim of the third stage is to visualize the association rules containing the se-

lected items from the previous stage. Figure 47 shows the visualization result of

the selected items, namely cd and rice, and the user interface of VisAR. After the

user selects the items of interest, all association rules containing the specified items

are visualized on the right panel. All antecedents and consequents of all qualified

Page 133: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.6. VISUALIZATION OF MANY ASSOCIATION RULES 97

Figure 46: A diagram of the system for visualizing mined association rules.

Page 134: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

98 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

association rules are displayed along the y-axis. The antecedents are placed above

and the consequents below the x-axis, which is displayed as a bold and black line.

The selected items of interest are displayed above other unselected items in the an-

tecedents along the y-axis. In Figure 47, items in antecedents of all association rules

are cd, rice, battery, soya sauce, newspaper, and sweets. The items in consequents

of association rules are newspaper, battery, soya sauce, and sweets. The selected

items are cd and rice. These two items are listed above battery, soya sauce, newspa-

per, and sweets. The system displays all association rules parallel to the y-axis by

the sorted support values. Each rule is visualized by a vertical line parallel to the

y-axis with circular dots representing items in each association rule. For example,

in Figure 47, the first vertical line represents an association rule with five circular

dots. Four dots representing cd, rice, battery, and soya sauce are in the antecedent

section and another dot in the consequent section represents the newspaper item.

The association rule is {cd, rice, battery, soya sauce} ⇒ {newspaper}. This is the

rule with highest support among all rules that include cd and rice in the antecedent.

Each confidence of an association rule is mapped to a color ramp so that the user

can identify and group similar association rules according to color.

Ten different colors are used for representing ten equal scales of either support or

confidence in terms of percentage from zero to hundred. This color range has been

designed to enhance the human ability of grouping items according to color. Red

represents the maximum value range, 90−100%, while blue represents the minimum

value range, 0− 10%. All association rules in Figure 47 are in the same range and

are mapped to the third color range, i.e., 20− 30%.

The last stage in VisAR is the interaction stage. This stage allows users to view

details of each association rule and provides flexible adjustments to view specific

association rules. The support and confidence values of an association rule are

shown when the user moves the mouse over the vertical line representing the rule.

The user can change both defaults of the system to visualize association rules. The

first option is to change the viewing of association rules to display only association

Page 135: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.6. VISUALIZATION OF MANY ASSOCIATION RULES 99

Figure 47: The left panel displays all antecedent items of association rules withthe interactive options (operation and sorting) for visualizing association rules. Theright panel visualizes association rules whose antecedent items are selected. cd andrice are the selected items in this figure. This visualization represents a selected ORoperation.

Page 136: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

100 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

Figure 48: Visualization of association rules from the selected items of interest inFigure 47. This visualization represents the selected operation AND which shows onlyassociation rules containing exactly the selected items, cd and rice in the antecedent.

Figure 49: This visualization represents the sorting of confidence which shows onlyassociation rules containing exactly the selected items, cd and rice. The color ofvertical lines represents the support value of the association rules

rules containing exactly the specified items of interest. Figure 48 and Figure 49

show association rules in which only cd and rice appear in the antecedents of the

association rules. The default of the system is set to display association rules

containing both specified items and all other items in each antecedent. The second

option is to change the sorting order from support to confidence. The default of

sorting in the system is according to support.

Page 137: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.6. VISUALIZATION OF MANY ASSOCIATION RULES 101

Figure 50: Visualization of association rules from the selected items of interest inFigure 47 but sorted according to confidence values.

Page 138: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

102 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

4.6.2 The advantages of VisAR

The VisAR system can be considered as a hybrid of the Matrix-based and Graph-

based techniques. The technique has many advantages over the Table-based, ordi-

nary Matrix based and Graph-based techniques as follows.

• VisAR allows users to specify items of interest for visualizing association rules

containing such items. This feature in the technique provides users to focus

on specific association rules instead of viewing all association rules in a single

view.

• VisAR has no limitation on the number of items in both the antecedent and

the consequent to be displayed. The system can visualize both many-to-one

and many-to-many association rules seamlessly.

• VisAR employs the benefits of both Matrix-based and Graph-based techniques

for placing and linking items in association rules to solve the occlusion prob-

lem. The Matrix-based technique organizes the items like an array in which

items are placed in rows while association rules are displayed by columns.

The employed Graph-based technique links the same groups of items and the

items of the same association rules so that users can easily identify the groups

of items and individual association rules.

• There is no screen clutter or occlusion even when a large number of rules are

displayed on the same screen.

• VisAR visually separates antecedent items and consequent items so that the

users can clearly distinguish between the antecedent items and the consequent

items of the association rules.

• The simplicity of VisAR helps the users to enhance their ability of interpre-

tation. The users can identify groups of association rules which have close

values of support or confidence.

Page 139: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

4.7. SUMMARY 103

4.7 Summary

Visualization techniques have been widely researched and integrated into many ap-

plications involving data analysis tasks including data mining in order to increase

human abilities to deeply understand data and extract hidden patterns from large

datasets. However, currently association rule mining algorithms have some short-

comings. Most of these algorithms usually mine a large numbers of association rules

and some of these rules are not practically interesting. Moreover, it is difficult for

analysts to understand and interpret a large number of rules. Most of the visualiza-

tion techniques display all mined association rules in a single screen. It is difficult

for an analyst to interpret such large amounts of information. In addition, some

visualization techniques encounter problems of screen clutter and occlusion.

The VisDM system has been introduced for mining association rules. The tight

coupling of VisDM helps users in filtering only interesting association rules. The

interactive visualization technique of VisDM is useful in mining market basket as-

sociation rules so that users can obtain visual feedback and apply their knowledge

in guiding the mining process.

The VisAR technique reduces the number of visualized association rules for effec-

tively interpreting and understanding a large number of rules. The analysts can also

choose to view specific association rules through their choice of items of interest. In

addition, the visualization technique has overcome the problems of screen clutter

and occlusion.

The next chapter introduces an integration of a visual exploration technique sim-

ilar to VisEx into an on-line analytical processing (OLAP) system to enhance the

analysis and decision making capabilities of analysts.

Page 140: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

104 CHAPTER 4. VISUALIZATION FOR ASSOCIATION RULE MINING

Page 141: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Chapter 5

Interactive Visualization for

On-line Analytical Processing

5.1 Introduction

Modern business processes generate an enormous amount of data that needs to be

analyzed and understood for better business performance. Executives, managers,

and analysts need a tool for making decisions and planning strategies. On-line

analytical processing (OLAP) has become an important tool for interactive analysis

of multidimensional databases such as data warehouses. This tool helps analysts

to explore, analyze, and extract interesting patterns from massive amounts of data

stored in multidimensional databases. A variety of industries have adopted data

warehouses as the preferred mode of data storage in order to manage the explosive

growth of their databases [11].

OLAP tools provide functionalities such as slicing, rolling up, and drilling down

for an end user to analyze and navigate through dynamic multidimensional data

cubes. Though OLAP research has been conducted extensively in the past several

years, most research in OLAP has been focused on modeling tasks with textual

forms of presentation [14, 54, 74] e.g., pivot tables. Some commercial systems [31,

105

Page 142: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

106CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

54] provide the combinations of visualization techniques such as bar charts, line

graphs, and histograms which allow users to view a single snapshot of each textual

representation. There is very little available research in interactive visualization for

OLAP.

Visualization is a powerful tool supporting visual representation and exploration

of massive datasets. The capability of humans to interpret and capture informa-

tion from graphical formats such as a chart is better than from a list of numbers

or from text files. This chapter introduces a novel interactive visual exploration

technique for analysis of multidimensional data cubes from data warehouses. To

obtain an effective and powerful analysis, the tool incorporates visualization into

OLAP services, which enables analysts to explore overviews of high levels of data

and drill down into levels of detail of each dimension directly. The integration of

both visualization and OLAP not only helps users to extract interesting patterns

but also helps them to interpret and analyze the extracted information from OLAP

faster. My technique allows users to view the visualization of all previously selected

paths of interest so that users do not need to recognize which levels and dimensions

they are looking at.

Since hierarchical structures have been deployed in most multidimensional databases,

I feel that it is difficult for users to explore multidimensional data with a tool pro-

viding only overviews of data. It is important for users to interactively drill down

through the low levels of details to refine their views. Furthermore, only interac-

tive textual displays such as the PivotTable are not effective for understanding or

extracting patterns from multidimensional databases.

Sifer has presented the SGViewer [63] tool, an interface technique for querying in

OLAP. The technique consists of three different viewing parts including progressive,

global and result. A user can drill down through the progressive coordinated view

and view details of the results through result coordination. The global view displays

the trend of a found result. Although SGViewer provides similar conceptual ideas,

it is different from the technique developed in this thesis. The SGViewer technique

Page 143: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.2. TERMINOLOGY 107

is only a design exercise and supports only leaf nodes in five dimensions. However,

the approach in this thesis has no limitation on this.

This new technique, called VisOLAP allows users to make a decision about whether

they want to get overviews of data, to drill down into low levels, to roll up to high

levels, or to view any particular region of interest of data anytime. The VisOLAP

system provides a navigation facility which reduces user responsibility of remem-

bering the exploration path of interest. In addition, the user is able to keep track of

the exploration and view the distribution of navigation results across the selected

dimensions with their explored levels and members.

This chapter is organized as follows. Section 5.2 introduces terminology used

throughout this chapter. I then introduce the system architecture and discuss

how the system is implemented in Section 5.3. Section 5.3.1 describes the system

components, followed by a discussion on how to visualize the OLAP data cube in

Section 5.3.2. The details of the remaining components of the VisOLAP system,

namely an interaction tool are provided in Section 5.4 and a query generation tool

in Section 5.5. An analysis of the VisOLAP performance including experimental

results is given in Section 5.6 and the conclusion in Section 5.7.

5.2 Terminology

A detailed discussion of OLAP technology is beyond the scope of this thesis, how-

ever, a brief overview of some of the concepts used in this chapter is given below. An

on-line transaction processing (OLTP) system is related to relational database sys-

tems. The OLTP system serves everyday transactions and operations. In contrast,

an On-line analytical processing (OLAP) system is related to multidimensional

database systems and is normally stored in data warehouses [21, 76]. The OLAP

system helps analysts or knowledge users such as managers in analysis tasks and

decision making. Data in warehouses are historical data which is summarized and

aggregated from a variety of relational databases. In both OLAP tools and data

Page 144: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

108CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

warehouses, a model of data is formed as a multidimensional data cube.

The multidimensional data model, known as a data cube, consists of three major

components including schema, dimensions, and measures. The schema typically

contains a fact table and dimension tables. The schema can be categorized into

three types as follows.

• Star schema: The fact table is a large central table surrounded by dimension

tables and containing measures and dimension keys linking to individual di-

mension tables. Each dimension table also contains a set of attributes. This

schema conceptually forms a shape like a star.

• Snowflake schema: This schema is an alternative to the star schema. Its shape

is like a snowflake. Some dimension tables of this schema are in normalized

form. In other words, those dimension tables are split into sub-tables to reduce

redundancies. However, more joining operations can reduce the effectiveness

of this schema when executing a query.

• Fact Constellation schema: This schema is a set of star schemas. It contains

multiple fact tables sharing dimensional tables.

Dimensions and measures are organized formats of a data cube which allow viewing

of aggregated data from different perspectives. The term dimension is used to

represent categories of data. Dimension and measure are similar to independent and

dependent variables in statistics. The distinction between dimension and measure

are as follows.

• Dimensions are organized in a hierarchical fashion and are similar to inde-

pendent variables. Dimensions are distributed along the dimension tables of

the schema. For instance, a product is the dimension and the number of unit

sales of the product is the measure. The dimensions usually have hierarchies

consisting of multiple levels of abstraction from a high level to a low level.

Page 145: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.2. TERMINOLOGY 109

For example, the same product dimension may be composed of product fam-

ily, product department, and product name. A time dimension comprises year,

quarter, and month.

• Measure is similar to the dependent variable and is a numeric value. The

aggregation of the measure should generate a new sensing number. Typically,

measures are organized in the fact table of the schema. An example of a

measure is the sales amount.

Microsoft SQL Server for OLAP maps a data schema to ADOMD objects as the

diagram in Figure 51 shows. The diagram is mainly composed of collections and

objects.

To communicate with an OLAP server of Microsoft SQL Server Analysis Ser-

vices [29], there are three main approaches as follows.

• Decision Support Object model (DSO)

• Add-ins Interface and Objects

• PivotTable Services

VisOLAP relies on PivotTable Services, so the implementation supports only ac-

cessing the OLAP server through PivotTable Services. PivotTable Service provides

OLE DB functionalities for accessing both multidimensional data and data mining

through Multidimensional Expression (MDX). Similar to a SQL syntax for query-

ing and manipulating data from relational databases, MDX is a powerful syntax for

querying and manipulating multidimensional data from OLAP data cubes [53].

Page 146: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

110CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

Figure 51: A diagram of ADOMD object model [29].

Page 147: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.3. VISOLAP SYSTEM ARCHITECTURE AND IMPLEMENTATION 111

5.3 VisOLAP system architecture and implemen-

tation

VisOLAP system architecture consists of four main components, system connec-

tion, visualization, interaction, and query generation as shown in Figure 52. The

system connects to a multidimensional database or the data cube which a user pro-

vides through an OLAP Server. The user selects dimensions for exploring from the

user interface in the left panel as shown in Figure 54. The system then retrieves

details of members in the particular dimensions from the data cube and organizes

these members in barsticks. This visual feedback gives users the member details

of each selected dimension so that they can interact and explore the correlations

among these selected dimensions. The interaction tool allows users to obtain de-

tails of individual members of the dimensions and browse into deeper levels of each

dimension. More details of each component are described in the following sections.

I have implemented the system for visualizing OLAP data cubes in Visual C++

with both ActiveX Data Object (ADO) and ActiveX Data Object Multidimen-

sional (ADOMD) interface. The ADOMD interface is an extension of the ADO

interface. The system has been developed to access multidimensional databases

through PivotTable service in Microsoft SQL Server 2000 Analysis Services.

5.3.1 System connection

To connect VisOLAP with a multidimensional database, the system needs to open

a connection to the multidimensional database and create a catalog to activate

the connection. The system then prepares a system structure and retrieves the

multidimensional schema information to obtain details of a data cube structure.

The system initializes a tree visualization of the schema as well as organizes a data

structure for retrieving properties of dimensions through ADOMD objects.

Page 148: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

112CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

Figure 52: VisOLAP system architecture

Page 149: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.3. VISOLAP SYSTEM ARCHITECTURE AND IMPLEMENTATION 113

5.3.2 Visualizing OLAP data cubes

I have modified the idea in Chapter 3 for exploring hierarchical structure of OLAP

data cubes and have used a barstick for representing each dimension of the data

cube but the details of the display are different.

One of the frameworks for visualizing OLAP data cubes in VisOlap is shown in

Figure 53. The figure represents a framework of a product sales data cube and

its visualization. The data cube has three hierarchical dimensions including Time

(T), Product (P), and Location (L) and consists of eight data cells. Each data cell

shows a number of product sold in a specific location at a time. For example, V1 is

a number of P1 sold in a location L1, at time T1. Each dimension is displayed on a

barstick. Barsticks are vertically arranged in hierarchical fashion when users select

the highest level of dimension of interest. A barstick is divided into small rectangles

which represent all existing members of the level in that dimension. A barstick does

not show the member with no measure value. In other words, the member which

does not have a measure value is hidden from the barstick. The length of each

rectangle is calculated based on the proportion of measure value of the member in

the selected dimension. For example, if the selected dimension is ‘time’, it contains

four members in the next level called ‘quarter’. The total profit is $250,000 and

profits for the quarters are $70,000, $55,000, $37,500 and $87,500. The calculated

proportions of all profits in each quarter are 28%, 22%, 15%, and 35%, respectively

and they are the proportion of the length of all rectangles representing the quarters.

The system interface can be divided into three areas as follows.

• First, the left panel represents the multidimensional data model.

• The upper right panel displays the visual exploration through interaction.

• Finally the lower right panel is for viewing one deeper level.

Figure 54 presents an example of visualization for exploring OLAP data cubes. A

tree structure in the left panel of Figure 54 represents the hierarchical arrangement

Page 150: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

114CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

Figure 53: The left side of the figure shows a data cube consisting of hierarchicaldimensions including Time (T), Product (P), and Location (L), and eight data cellswith the numbers of product sales. The right side is visualization of the selecteddimensions with their members.

Figure 54: The left panel shows the tree structure of the OLAP data cube. Theupper right panel visualizes the selected dimensions as hierarchical barsticks includingProduct Family, Product Department, Store Type, Year, and Quarter levels. Thelower right panel displays one deeper level of the data cube in advance depending onthe position of the mouse. In this case, the user has positioned the mouse on ‘Q3’and this panel shows the ‘Month’ level of the fourth quarter.

Page 151: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.4. INTERACTION 115

of the data cube. The upper right panel displays explored hierarchical barsticks

while the lower right panel illustrates one deeper level of the currently selected

member. In this example, ‘Sales’ is selected as a data cube of interest and the

variable ‘Unit sales’ is selected as a measure of interest. The first barstick repre-

sents a ‘Product Family’ level of the ‘Product’ dimension. There are three members

including ‘Drinks’, ‘Foods’, and ‘Non-consumable’ in ‘Product Family’ and ‘Foods’

has the highest unit sales displayed by its longest proportion of the rectangle in the

barstick. The second barstick illustrates a ‘Product Department’ level of the ‘Prod-

uct’ dimension. In ‘Product Family’, the members of ‘Drink’ consist of ‘Alcoholic

Beverages’, ‘Beverages’, and ‘Dairy’. ‘Store Type’ level of the ‘Store Type’ dimen-

sion is selected in the third barstick and ‘Supermarket’ has the highest proportion

of unit sales of the drink product followed by ‘Deluxe Supermarket’, ‘Gourmet Su-

permarket’ (G), ‘Mid-Size Grocery’ (M) and ‘Small Grocery’ (S). The fourth and

last barsticks are explored in the ‘Year’ and ‘Quarter’ levels of the ‘Time’ dimen-

sion. The data for unit sales of ‘Drink’ product exist only for 1997 and the fourth

quarter has the largest amounts of unit sales. The lower right panel displays the

month level consisting of July, August, and September or 7, 8, and 9 when the user

places the mouse over the third quarter of the last barstick. This is the next deeper

level in the ‘Time’ dimension.

5.4 Interaction

To efficiently support the analysis and exploration processes of the hierarchical

structure, the technique provides several navigational functions:

Drill down: Drill down is a function to navigate into deeper hierarchical levels of

each dimension in a data cube to obtain more details in a particular member. A

framework of this function is shown in Figure 55. This framework represents the

Drill down interaction into one lower level of the Product dimension and mapping

the numbers of the cells to the visualization. Figure 57, for example, shows the drill

Page 152: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

116CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

Figure 55: An illustration representing the mapping numbers of the drilled downcells in the data cube to the visualization.

down function on the ‘Location’ dimension from ‘Countries’ level to ‘States’ level in

the data cube view. My technique provides this function through left mouse double

click. Users can drill down into deeper levels any time in any dimension. This

feature allows users to view a dimension in more details. Figure 54 shows drilling

down in the ‘Product’ dimension from ‘Product Family’ to ‘Product Department’

and in the ‘Time’ dimension from the ‘Year’ level to the ‘Quarter’ level of ‘Unit

sales’ sold in the ‘Deluxe Supermarket’ store type.

Roll up: In contrast to drill down, roll up is a function to navigate for exploring

upper levels of dimensions so that a user can see an overview of explored members.

A framework of this function is shown in Figure 53. This framework represents

rolling up in the Product dimension of the framework in Figure 55. Figure 57

also shows the data cube view of a roll up operation from the ‘States’ level to

the ‘Countries’ level in the ‘Location’ dimension. Users can roll up any particular

dimension by double clicking the right mouse button. Figure 58 illustrates rolling

up of the ‘Product’ dimension from the ‘Product Department’ level to the ‘Product

Family’ level.

Slice: This function allows users to view a particular sub-cube for any selected

Page 153: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.4. INTERACTION 117

Figure 56: An illustration representing the mapping numbers of the sliced cells inthe data cube to the visualization.

dimension of the data cube. Figure 56 represents a framework of the Slice func-

tion with a change of the selection from P1 in the framework of Figure 53 to P2.

In Figure 57, a slice operation is shown on the ‘Drink’ member in the ‘Product’

dimension. My tool provides this function through left mouse click for viewing a

measure value of other members in the same level. Figure 59 shows the change of

exploration from Deluxe Supermarket (in Figure 54) to Supermarket.

All navigational functions can be automatically combined when users interact with

each barstick so that they can view any particular region of interest in the data

cube. For instance, the combination of drill down and slice functions allow users to

explore unit sales of ‘Alcoholic Beverages’ in ‘Product Department’ for all quarters

in 1997. Moreover, users can view each independent dimension when each barstick

is first created or by clicking the right mouse button on any barstick. The system

supports on demand details. When users move the mouse over any rectangle in the

barstick, the details of the rectangle including the name of the specified member

and its measure value are shown in a pop-up box as shown in Figure 54, ‘Q3’.

Page 154: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

118CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

Figure 57: Examples of OLAP functionalities including drilling down, rolling up,and slicing on multidimensional data.

Page 155: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.5. VISUAL EXPLORATION AND MDX QUERY 119

Figure 58: This Figure shows the selected dimensions including Product Family,Store Type, Year, and Quarter. The Product dimension is rolled up from the ProductDepartment level to Product Family level of Figure 54.

5.5 Visual Exploration and MDX query

I have implemented the binding of MDX queries with the navigational functions of

the interactive tool to enable users who are not OLAP experts to explore OLAP

data cubes and data warehouses without generating sophisticated MDX queries.

The basic syntax of an MDX statement looks similar to a SQL statement. An

example syntax of the MDX statement is:

SELECT < member selection > on axis1, < member selection > on axis2, ..

FROM < cube name >

A calculated member is a member of a dimension which is derived from values of the

other members. It has been used in the interaction tools of the VisOLAP system.

The definition of the calculated member is stored and calculated in response to

a query. The calculated member can be described by MDX statements, namely

WITH MEMBER and CREATE MEMBER. Only the WITH MEMBER statement

is used for setting up interactive queries in the system to aggregate new member

values and measures without increasing the size of a cube.

I describe some examples of the combination of MDX and the interactive tools

Page 156: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

120CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

Figure 59: This figure shows the change of member selection from Supermarket toDeluxe Supermarket in the Store Type level.

based on Figure 54. Suppose ‘Sales’ is a selected data cube, ‘Unit Sales’ is a selected

measure, and ‘Product Family’ is the selected level of the ‘Product dimension’ for

querying. To view all members of the ‘Product Family’ level in proportion, a MDX

query implying this process can be described as:

WITH MEMBER Measures.[sum] AS

′sum([Product].[ProductFamily].members, Measures.[UnitSales])′

MEMBER Measures.[percent] AS ′(([Product].CURRENTMEMBER,

Measures.[UnitSales])/(Measures.[sum]))′,

FORMAT STRING = ′Percent′

SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,

NON EMPTY [Product].[ProductFamily].members on rows

FROM Sales

When the user drills down on the ‘Product Department’ level through ‘Drink’ in the

‘Product Family’ level, an equivalent MDX query as shown below is automatically

generated to display the ‘Product Department’ members on the second barstick.

Page 157: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.5. VISUAL EXPLORATION AND MDX QUERY 121

WITH MEMBER Measures.[sum] AS

′sum([Product].[AllProducts].[Drink].children, Measures.[UnitSales])′

MEMBER Measures.[percent] AS ′(([Product].CURRENTMEMBER,

Measures.[UnitSales])/(Measures.[sum]))′,

FORMAT STRING = ′Percent′

SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,

NON EMPTY {DESCENDANT[All Products].[Drink], [Product Department])}

on rows

FROM Sales

After the user drills down onto ‘Product Department’ level, the user might explore

several new dimensions for viewing the correlations. It is possible for the user to

explore dimensions which are drilled down. An equivalent MDX query representing

the exploration interaction as shown in Figure 54 is:

WITH MEMBER Measures.[sum] AS

′sum([Time].[1997].children, Measures.[UnitSales])′

MEMBER Measures.[percent] AS ′(([Time].CURRENTMEMBER,

Measures.[UnitSales])/(Measures.[sum]))′,

FORMAT STRING = ′Percent′

SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,

NON EMPTY {DESCENDANT([Time].[1997], [Time].[Quarters]) on rows

FROM SalesWHERE ([Product].[All Products].[Drink].[Alcoholic Beverages],

[Store Type].[All Store Type].[Deluxe : Supermarket])

As shown in Figure 54 and Figure 59, suppose the user changes the queried members

in the barstick representing the ‘Store Type’ level from ‘Deluxe Supermarket’ to

‘Supermarket’. The equivalent MDX statement for this process to display the ‘Year’

members on the fourth barstick of the ‘Time’ dimension is:

Page 158: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

122CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

WITH MEMBER Measures.[sum] AS

′sum([Time].[Year].members, Measures.[UnitSales])′

MEMBER Measures.[percent] AS ′(([Time].CURRENTMEMBER,

Measures.[UnitSales])/(Measures.[sum]))′,

FORMAT STRING = ′Percent′

SELECT {[Measures].[percent], [Measures].[UnitSales]} on columns,

NON EMPTY [Time].[Year].members on rows

FROM SalesWHERE ([Product].[All Products].[Drink].[Alcoholic Beverages],

[Store Type].[All Store Type].[Supermarket])

When the user rolls up on the ‘Product’ dimension from ‘Product Department’ level

to ‘Product Family’, the user needs to take a few more steps on the interaction in

case that there is no selected member of the upper rolled up level in the following

selected dimension. I generate these MDX queries automatically depending on user

interaction.

5.6 Analysis

A FoodMart 2000 database [29] is used in the case study. The database consists of

data cubes such as ‘Budget’, ‘HR’, ‘Sales’, and ‘Warehouse’. The ‘Sales’ data cube

comprises twelve dimensions excluding the hierarchical levels of each dimension

and seven measures. The ‘Product’, ‘Time’, ‘Store Type’, ‘Promotion Media’, and

‘Promotions’ dimensions and ‘Unit sales’ and ‘Profit’ measures are used for the

case study. Suppose the store manager would like to increase the sales of the drink

product stocked in the store. Figure 54, 58, and 59 show exploration of the drink

product family. The manager can extend the exploration of the ‘Promotions’ and

‘Promotion Media’ dimensions to obtain how they affect the sales amounts in each

year. For example, it is easy to find the following correlations from exploration of

the data cube.

Page 159: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.7. SUMMARY 123

Daily paper, radio, and TV tend to be the most effective media to increase the

amount of sales in sales days promotion of the supermarkets, while bulk mails tend

to be the most effective way to advertise the promotions including ‘You Save Day’,

‘Shelf Emptiers’, and ‘Sales Galore’ for gourmet supermarkets. Daily paper is the

most effective medium to advertise the promotions such as ‘Big Time Discounts’

for mid-size groceries, and ‘In-Store Coupon’ is the most effective way to increase

the sales promotions for small groceries as shown in Figure 60. In addition, the

number of unit sales varies over time. For instance, the fourth quarter has the

highest number of alcoholic beverage and beverage sales in all stores except the

small groceries which have the highest number of alcoholic beverage sales in the

third quarter as shown in Figure 61. However, as the analysts, managers, and

executives know better market and store situations, they can explore and analyze

the data in several efficient ways.

5.7 Summary

The applications for OLAP have been extensively researched but most of them

are only investigated in modeling tasks and presenting results through textual for-

mats. The integration of visualization and interaction tools into OLAP enhances

the human capability to analyze and understand multidimensional databases.

A novel interactive visual exploration tool has been introduced for analysis of OLAP

data cubes. The technique provides visual feedbacks while users explore data cubes

in graphical formats rather than textual table formats. The incorporation of both

visualization and the OLAP service helps users to deeply understand, gain insight

and extract useful information faster from their data. Hierarchical barsticks are

presented for exploring and visualizing hierarchical structures of the OLAP data

cubes. Users can view the trail of the exploration through visualization in order

to reduce the recognition load and also view one deeper level in advance before

drilling down into deeper levels. In addition, the technique provides users overviews

Page 160: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

124CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

(a)

(b)

(c)

(d)

Figure 60: This Figure shows an example of visualization for exploring Promotionmedia, Store type, and Unit sales.

Page 161: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

5.7. SUMMARY 125

Figure 61: This figure shows an example of visualization for exploring alcoholicbeverage sales of small groceries in Year 1997.

and refined views of interest in the data cubes. Users are allowed to change the

exploration views anytime through the combination of navigational functions and

interactive tools. VisOLAP is expected to be useful for interactive visual exploration

of data cubes.

The next chapter provides a discussion of the implications of the systems presented

in this thesis and emphasizes the contributions and the limitations of individual

chapters. Some future research directions are also discussed.

Page 162: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

126CHAPTER 5. INTERACTIVE VISUALIZATION FOR ON-LINE ANALYTICAL PROCESSING

Page 163: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Chapter 6

Conclusion

6.1 Summary

I have investigated several visualization frameworks based on the dynamic query

mechanism in this thesis. The emphasis in designing all of these frameworks was on

simplicity, flexibility, giving the user all the controls for selection and exploration

and finally, reducing the overload of information and occlusion that is present in

other existing systems.

The setting and contribution of the thesis is presented in Chapter 1. Next, a

detailed overview of a diverse range of visualization techniques has been given in

Chapter 2. Since the interest in this thesis is in designing dynamic and interactive

visualization frameworks, the design methodologies are placed within the dynamic

query framework. The dynamic query framework is explained in detail in the second

part of Chapter 2.

Chapter 3 presented a novel technique, VisEx, for visual exploration of multidimen-

sional datasets, in particular for exploring correlations among attributes in large

datasets. Most previous visualization systems can display the correlations among

a small number of attributes, usually two or three, whereas it is possible to explore

correlations among many attributes in VisEx. Moreover, the user does not need

127

Page 164: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

128 CHAPTER 6. CONCLUSION

to go through any prior training or have any prior knowledge of the underlying

database for using VisEx. The system provides visual feedback to the user during

interactive visualization sessions and the user can dynamically change the attributes

and their ranges for testing hypotheses. A user study of the system has also been

done for evaluating its simplicity and usefulness.

Chapter 4 presented a new technique called VisDM for integrating visualization in

the association rule mining algorithms. Most algorithms for association rule mining

generate a large number of rules all of which are not interesting. Hence, there is a

need for integrating human expertise in a mining algorithm so that an analyst can

mine interesting association rules. The user has complete freedom in choosing the

antecedents and consequents of rules in the VisDM system, and hence, it is possible

for the user to mine interesting rules instead of all rules that are above a threshold

of support and confidence. The simplicity of the VisDM system was demonstrated

through a user study.

A new framework called VisAR is also presented for visualizing a large number of

association rules in Chapter 4. Most previous systems display a large number of

rules in a single view and it is difficult for a user to concentrate on a subset of

interesting rules. Moreover, the display of a large number of rules usually results

in occlusion and screen clutter. VisAR integrates matrix-based and graph-based

techniques in a single framework to display a large number of association rules.

Moreover, VisAR gives the user complete freedom in choosing and visualizing the

rules that the user is interested in.

In Chapter 5 a novel interactive visualization technique called VisOLAP for OLAP

analysis tasks is presented. The visualization technique in VisOLAP has been

modified from VisEx to explore and analyze hierarchical structure of OLAP data

cubes in order to reduce user responsibility of remembering the exploration paths

of interest. Moreover, analysts can view exploration tracks and distribution of

navigating results across the specified dimensions and levels.

All the four systems, VisEx, VisDM, VisAR and VisOLAP, developed in this thesis

Page 165: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

6.1. SUMMARY 129

are scalable with respect to the size of the datasets. VisEx can handle small as

well as very large datasets. Each bar in a barstick can represent one data record

when the dataset is very small and can represent an arbitrarily large number of

records for large datasets. Since VisEx is designed to give the user a quantitative

estimate of the correleations among attributes, I have not tried to represent each

data record individually. However, the user can get a quantitative estimate of the

number of records first by comparing the different color levels that are used for

coloring the bars and also by clicking on a bar or group of bars, through a dialogue

box. The representation of data through the two primitives barstick and bar has

helped me to incorporate this scalability in VisEx. Similarly, there is no limit on the

size of transactional databases that VisDM can handle. VisAR can handle a large

number of association rules compared to other systems due to its two dimensional

display. While three dimensional displays can give rise to occlusion, this is not a

problem in VisAR. However, occlusion is still a problem if the number of rules is

higher than a few hundred. One of the main features of VisAR is the ability of the

user to visualize selected association rules instead of seeing all the rules at once. I

expect that any user will utilize this facility more often as it allows users to focus on

specific association rules. Perhaps the least scalable of the systems is VisOLAP as it

is difficult to display a large number of dimensions at a time within a limited screen

space. On the other hand, it is not possible to represent a collection of dimensions

by a bar, as in VisEx, as the user may need to see individual dimensions for drilling

down. This aspect of VisOLAP needs to be explored further in future.

The user studies reported in this thesis are only preliminary. It was not possible

to compare the systems with similar systems mainly due to two reasons. First,

it is difficult to get implementations of most of the other systems either because

they are commercial systems or because I could not get any response from the

authors. Second, I had only limited time and hence could not organize large scale

user studies. It is very important to conduct more extensive user studies of the

systems reported in this thesis with more participants and also with participants

Page 166: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

130 CHAPTER 6. CONCLUSION

from a diverse range of backgrounds.

6.2 Future Work

Although the work presented in this thesis covers the contributions in the area

of information visualization, several research directions remain open. Some of the

possible research directions are mentioned below.

• An interesting research problem is to integrate visualization into other data

mining algorithms such as clustering and classification. For example, appro-

priate visualization for clustering may provide analysts important feedback for

understanding the association between clusters. In addition, an integration

of visualization into other knowledge discovery areas is possible.

• Another interesting problem is the incorporation of visualization into other

automatic data mining algorithms through a tight coupling mechanism like

VisDM.

• It is also important to explore the visualization of OLAP data cubes in more

details as OLAP technology is becoming an integral part of most business

processes.

Page 167: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Bibliography

[1] R. Agrawal, T. Imielinski and A. Swami. Mining association rules between

sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD

International Conference on Management of data, pages 207–216. ACM Press,

1993.

[2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in

large databases. In Proceedings of the 20th International Conference on Very

Large Data Bases, pages 487–499. Morgan Kaufmann Publishers Inc., 1994.

[3] C. Ahlberg and B. Schneiderman. Visual information seeking : tight coupling

of dynamic query filters with starfield displays. In Proceedings of the ACM

SIGCHI Conference on Human Factors in Computing Systems, pages 313–317.

ACM Press, 1994.

[4] C. Ahlberg, C. Williamson and B. Schneiderman. Dynamic queries for infor-

mation exploration : An implementation and evaluation. In Proceedings of the

ACM SIGCHI Conference on Human Factors in Computing Systems, pages

619–626. ACM Press, 1992.

[5] K. Andrews and H. Heidegger. Information slices: Visualising and exploring

large hierarchies using cascading, semi-circular discs. In IEEE Symposium on

Information Visualization (IEEE InfoVis’98), pages 9–12, October 1998.

131

Page 168: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

132 BIBLIOGRAPHY

[6] M. Ankerst, D. A. Keim and H. P. Kriegel. Circle segments: A technique for

visually exploring large multidimensional data sets. In In Visualization ’96,

Hot Topic Session, 1996.

[7] W. Basalaj. Proximity visualisation of abstract data. In Technical Report 509.

University of Cambridge Computer Laboratory, 2001.

[8] J. Bertin. Semiology of Graphics. Madison, Wis.: University of Wisconsin,

1983.

[9] S. K. Card, J. D. MacKinlay and B. Shneiderman. Readings in Information

Visualization: Using Vision to Think. Elsevier Science & Technology Books,

January 1999.

[10] C.Beshers and S.Feiner. Autovisual: rule-based design of interactive multivari-

ate visualizations. Computer Graphics and Applications, IEEE, Volume 13,

Number 4, pages 41–49, 1993.

[11] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap tech-

nology. SIGMOD Rec., Volume 26, Number 1, pages 65–74, 1997.

[12] H. Chernoff. The use of faces to represent points in k-dimensional space graph-

ically. Journal of the American Statistical Association, Volume 68, pages 361–

368, 1973.

[13] W. S. Cleveland. Visualizing data. Hobart Press Summit, 1993.

[14] Microsoft Corperation. Microsoft excel:user’s guide 2, version 4.0. Redmond,

WA Microsoft Corperation, 1992.

[15] M. C. Ferreira de Oliveira and H.Levkowitz. From visual data exploration

to visual data mining: a survey. IEEE Transactions on Visualization and

Computer Graphics, Volume 9, pages 378–394, 2003.

Page 169: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

BIBLIOGRAPHY 133

[16] T. A. DeFanti, M. D. Brown and B. H. McCormick. Visualization: expand-

ing scientific and engineering research opportunities. Computer, Volume 22,

Number 8, pages 12–16,22–5, August 1989.

[17] S. G. Eick. Visualizing multi-dimensional data. SIGGRAPH Computer Graph-

ics, Volume 34, Number 1, pages 61–67, 2000.

[18] U. Fayyad, G. Piatetsky-Shapiro and P. Smyth. From data mining to knowledge

discovery in databases. AI Magazine, Volume 17, Number 3, pages 37–54, 1996.

[19] S. K. Feiner and C. Beshers. Worlds within worlds: metaphors for exploring n-

dimensional virtual worlds. In Proceedings of the 3rd Annual ACM SIGGRAPH

Symposium on User Interface Software and Technology, pages 76–83. ACM

Press, 1990.

[20] J. D. Foley, A. V. Dam, S. K. Feiner and J. F. Hughes. Computer Graphics:

Principles and Practice Second edition in C. Addison Wesley, 1997.

[21] J. Han and M. Kamber. Data Mining Concepts and Techniques. Morgan

Kaufmann, 2001.

[22] J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate gener-

ation. In Proceedings of the 2000 ACM SIGMOD International Conference on

Management of data, pages 1–12. ACM Press, 2000.

[23] D. Harrison and D. L. Rubinfeld. Hedonic prices and the demand for clean air.

J. Environ. Economics & Management, Volume 5, pages 81–102, 1978.

[24] C. G. Healey, K. Booth and J. And. Harnessing preattentive processes for

multivariate data visualization. In Proceedings Graphics Interface ’93, pages

107–117, 1993.

[25] C. G. Healey and J. T. Enns. Large datasets at a glance: Combining textures

and colors in scientific visualization. IEEE Transactions on Visualization and

Computer Graphics, Volume 5, Number 2, pages 145–167, 1999.

Page 170: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

134 BIBLIOGRAPHY

[26] S. Hettich, C. L. Blake and C. J. Merz. UCI repository of machine learning

databases, 1998.

[27] H. Hofmann, A. P. J. M. Siebes and A. F. X. Wilhelm. Visualizing associ-

ation rules with interactive mosaic plots. In Proceedings of the sixth ACM

SIGKDD International Conference on Knowledge Discovery and Data mining,

pages 227–235. ACM Press, 2000.

[28] M. A. W. Houtsma and A. N. Swami. Set-oriented mining for association rules

in relational databases. In Proceedings of the Eleventh International Conference

on Data Engineering, pages 25–33. IEEE Computer Society, 1995.

[29] Microsoft http://msdn.microsoft.com.

[30] http://web.cs.wpi.edu/ matt/courses/cs563/talks/perception.html.

[31] http://www.contourcomponents.com/.

[32] SAS Institute Inc. http://www.sas.com/technologies/analytics/datamining/miner/.

[33] SGI http://www.sgi.com/software/mineset.html.

[34] A. Inselberg and B. Dimsdale. Parallel coordinates for visualizing multidimen-

sional geometry. In CG International ’87 on Computer graphics 1987, pages

25–44, New York, NY, USA, 1987. Springer-Verlag New York, Inc.

[35] B. Johnson and B. Shneiderman. Tree-maps: a space-filling approach to the

visualization of hierarchical information structures. In Proceedings of the 2nd

International IEEE Visualization Conference, pages 284–291. IEEE Computer

Society, October 1991.

[36] E. Kandogan. Visualizing multi-dimensional clusters, trends, and outliers using

star coordinates. In Proceedings of the seventh ACM SIGKDD International

Conference on Knowledge Discovery and Data mining, pages 107–116. ACM

Press, 2001.

Page 171: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

BIBLIOGRAPHY 135

[37] D. A. Keim. Databases and visualization. In SIGMOD ’96: Proceedings of the

1996 ACM SIGMOD International Conference on Management of data, page

543, New York, NY, USA, 1996. ACM Press.

[38] D. A. Keim. Designing pixel-oriented visualization techniques: theory and

applications. Visualization and Computer Graphics, IEEE Transactions on,

Volume 6, pages 59–78, 2000.

[39] D. A. Keim, M. Hao, U. Dayal, M. Hsu and J. Ladisch. Pixel bar charts: A

new technique for visualizing large multi-attribute data sets without aggre-

gation. In Proceedings of the IEEE Symposium on Information Visualization

2001 (INFOVIS’01), page 113. IEEE Computer Society, 2001.

[40] D. A. Keim, M. C. Hao, and U. Dayal. Hierarchical pixel bar charts. IEEE

Transactions on Visualization and Computer Graphics, Volume 8, pages 255–

269, 2002.

[41] D. A. Keim and H. P. Kriegel. VisDB: database exploration using multidimen-

sional visualization. Computer Graphics and Applications, IEEE, Volume 14,

pages 40–49, 1994.

[42] J. Lamping, R. Rao, and P. Pirolli. A focus + context technique based on

hyperbolic geometry for visualizing large hierarchies. In CHI ’95, ACM Con-

ference on Human Factors in Computing Systems, pages 401–408. ACM Press,

1995.

[43] T. Lanning, K. Wittenburg, M. Heinrichs, C. Fyock and G. Li. Multidimen-

sional information visualization through sliding rods. In Proceedings of Ad-

vanced Visual Interfaces - AVI 2000, pages 173–180. ACM Press, 2000.

[44] H. Levkowitz. Perceptual steps along color scales. International Journal of

Imaging Systems and Technology, Volume 7, pages 97–101, 1996.

[45] H. Levkowitz and G. T. Herman. Color scales for image data. Computer

Graphics and Applications, IEEE, Volume 12, Number 1, pages 72–80, 1992.

Page 172: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

136 BIBLIOGRAPHY

[46] B. Liu, W. Hsu and Y. Ma. Integrating classification and association rule

mining. In Proceedings of the Fourth International Conference on Knowledge

Discovery and Data Mining, pages 80–86, 1998.

[47] A. S. Maniatis, P. Vassiliadis, S. Skiadopoulos and Y. Vassiliou. Advance

visualization for olap. In Proceedings of the 6th ACM International Workshop

on Data wareshousing and OLAP, pages 9–16. ACM Press, 2003.

[48] D. Marghescu and M. J. Rajanen. Assessing the use of som technique in data

mining. In Proceeding of the 23rd IASTED International Multi-Conference

Databases and Applications, pages 181–186, February 2005.

[49] B. H. McCormick, T.A. DeFanti and M.D. Brown (ed). Visualization in scien-

tific computing. Computer Graphics, Volume 21, Number 6, November 1987.

[50] T. Mihalisin, E.Gawlinski, J. Timlin and J. Schwegler. Visualizing a scalar field

on an n-dimensional lattice. In A.Kaufman (editor), Proceedings of the First

IEEE Conference on Visualization’90, pages 255–262 and 479–480. Practical,

1990.

[51] T. Mihalisin, J. Timlin and J. Schwegler. Visualization and analysis of multi-

variate data: a technique for all fields. In G.M. Nielson and L Rosenblum

(editors), Proceedings of the IEEE Conference on Visualization’91, pages 171–

178 and 421. Practical, 1991.

[52] T. Mihalisin, J. Timlin and J. Schwegler. Visualizing multivariate functions,

data, and distributions. Computer Graphics and Applications, IEEE, Vol-

ume 11, pages 28–35, 1991.

[53] C. Nolan. Manipulate and query olap data using adomd and multidimensional

expression. In Microsoft Systems Journal. Microsoft, August 1999.

[54] P. O’Donnell and N. Draper. An experimental evaluation of an alternative to

the pivot table for ad hoc access to olap data. In Proceedings of the 2004 IFIP

Page 173: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

BIBLIOGRAPHY 137

International Conference on Decision Support Systems (DSS2004): Decision

Support in an Uncertain and Complex World, July 2004.

[55] K. H. Ong, K. L. Ong, W. K. Ng and E. P. Lim. Crystalclear: Active visu-

alization of association rules. In International Workshop on Active Mining (

AM-2002), in conjunction with IEEE International Conference On Data Min-

ing, December 2002.

[56] G. Piatetsky-Shapiro. Discovery, analysis and presentation of strong rules. In

Knowledge Discovery in Databases, pages 229–248, 1991.

[57] R.M. Pickett and G.G. Grinstein. Iconographic displays for visualizing multi-

dimensional data. In Proceedings of the 1988 IEEE International Conference

on Systems, Man, and Cybernetics, 1988., Volume 1, pages 514–519, 1988.

[58] R. Rao and S. K. Card. The table lens: Merging graphical and symbolic rep-

resentations in an interactive focus+context visualization for tabular informa-

tion. In Proceedings of the ACM Conference on Human Factors in Computing

Systems, CHI. ACM, 1994.

[59] G. G. Robertson, S. K. Card and J. D. Mackinlay. Information visualization

using 3d interactive animation. Commun. ACM, Volume 36, Number 4, pages

57–71, 1993.

[60] A. Savasere, E. Omiecinski and S. B. Navathe. An efficient algorithm for mining

association rules in large databases. In Proceedings of the 21th International

Conference on Very Large Data Bases, pages 432–444. Morgan Kaufmann Pub-

lishers Inc., 1995.

[61] B. Schneiderman. Dynamic queries for visual information seeking. IEEE Soft-

ware, Volume 11, Number 6, pages 70–77, 1994.

[62] J. H. Seigel, E. J. Farrell, R. M. Goldwyn and H. P. Friedman. The surgical

implication of physiologic patterns in myocardial infarction shock. Surgery,

Volume 72, pages 126–141, 1972.

Page 174: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

138 BIBLIOGRAPHY

[63] M. Sifer. A visual interface technique for exploring olap data with coordinated

dimension hierarchies. In Proceedings of the Twelfth International Conference

on Information and Knowledge Management, pages 532–535. ACM Press, 2003.

[64] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge

discovery systems. IEEE Transactions on Knowledge and Data Engineering,

Volume 8, Number 6, pages 970–974, 1996.

[65] R. Spence. Sensitivity encoding to support information space navigation : a

design guideline. Information Visualization, Volume 1, pages 120–129, 2002.

[66] R. Spence and L. Tweedie. The attribute explorer : information synthesis via

exploration. Interacting with Computers, Volume 11, pages 137–146, 1998.

[67] M. Spenke and C. Beilken. visual, interactive data mining with infozoom - the

financial dataset. In 3rd European Conference on Principles and Practice of

Knowledge Discovery in Databases, pages 15–18, 1999.

[68] M. Spenke, C. Beilken and T. Berlage. FOCUS: The interactive table for

product comparison and selection. In ACM Symposium on User Interface

Software and Technology, pages 41–50, 1996.

[69] R. Srikant and R. Agrawal. Mining generalized association rules. In VLDB ’95:

Proceedings of the 21th International Conference on Very Large Data Bases,

pages 407–419. Morgan Kaufmann Publishers Inc., 1995.

[70] J. Stasko. An evaluation of space-filling information visualizations for depict-

ing hierarchical structures. Internation Journal of Human-Computer Studies,

Volume 53, Number 5, pages 663–694, 2000.

[71] StatLib-Datasets Archive, http://lib.stat.cmu.edu/datasets. Carnegie Mellon

University, 2004.

Page 175: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

BIBLIOGRAPHY 139

[72] C. Stolte, D. Tang and P. Hanrahan. Polaris: a system for query, analysis, and

visualization of multidimensional relational databases. IEEE Transactions on

Visualization and Computer Graphics, Volume 8, Number 1, pages 52–65, 2002.

[73] C. Stolte, D. Tang and P. Hanrahan. Query, analysis, and visualization of

hierarchically structured data using polaris. In Proceedings of the Sixth ACM

SIGKDD International Conference on Knowledge Discovery and Data mining,

pages 112–122. ACM Press, 2002.

[74] E. Thomsen. OLAP Solutions Building Multidimensional Information Systems.

Wiley Computer Publishing, 1997.

[75] L. Tweedie, B. Spence, D. Williams and R. Bhogal. The attribute explorer. In

Proceedings of the ACM SIGCHI Conference on Human Factors in Computing

Systems (Coference Companion), pages 435–436. ACM Press, 1994.

[76] R. Vieira. Profesional SQL Server 7 Programming. Wrox Press, 1999.

[77] K. Wang, Y. Jiang and L. V. S. Lakshmanan. Mining unexpected rules by push-

ing user dynamics. In KDD ’03: Proceedings of the Ninth ACM SIGKDD Inter-

national Conference on Knowledge Discovery and Data mining, pages 246–255.

ACM Press, 2003.

[78] M. O. Ward. Xmdvtool: integrating multiple methods for visualizing mul-

tivariate data. In Proceedings of the Conference on Visualization ’94, pages

326–333. IEEE Computer Society Press, 1994.

[79] C. Williamson and B. Schneiderman. The dynamic homefinder : evaluating dy-

namic queries in a real-estate information exploration system. In Proceedings of

the 15th Annual International ACM Conference on Research and Development

of Information Retrieval, pages 338–346. ACM Press, 1992.

[80] K. Wittenburg, T. Lanning, M. Heinrichs and M. Stanton. Parallel bargrams

for consumer-based information exploration and choice. In Proceedings of

Page 176: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

140 BIBLIOGRAPHY

the 14th Annual ACM Symposium on User Interface Software and Technol-

ogy (UIST ’01), pages 51–60. ACM Press, 2001.

[81] P. C. Wong, P. Whitney and J. Thomas. Visualizing association rules for

text mining. In INFOVIS ’99: Proceedings of the 1999 IEEE Symposium on

Information Visualization, pages 120–123, Washington, DC, USA, 1999. IEEE

Computer Society.

[82] K. Zhao and B. Liu. Visual analysis of the behavior of discovered rules. In

Workshop Notes in ACM SIGKDD-2001 Workshop on Visual Data Mining,

August 2001.

Page 177: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Appendix A

This appendix contains documents for the user study of VisEx in Chapter 3. The

documents include tutorial and experimental tasks, and a questionnaire.

A.1 Tasks

A.1.1 Tutorial Tasks

1. Is it true if cars have low mpg and acceleration, they will have higher weight

and horsepower?

2. How many cylinders most of the Japanese cars have? Do they have high or low

displacement? (You can use equal height histogram to look for distribution

of each value through left mouse double click on blue areas of each bar.)

3. Which country produces 3-cylinder cars and which country produces 8-cylinder

cars?

4. Are Japanese 6-cylinder cars generally heavier than European 6-cylinder cars?

5. Did European and Japanese companies only produce 4-cylinder cars in 1982?

141

Page 178: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

142 APPENDIX A.

A.1.2 Experimental Tasks

For Task9 and Task10, please use the fixed attribute property located in Type

Attribute Selection (on the left panel) for exploration.

1. Is it true that highly educated people (with 15-18 years of schooling) tend to

work in professional and managerial occupations?

2. Is it true that highly educated clerks (with 15-18 years of schooling) tend to

have higher wages?

3. Are more males working in clerical jobs compared to females?

4. Which group earns higher wages in managerial jobs, males or females?

5. Do older people (above 60) with a lot of experience tend to have higher edu-

cation (more than 14 years of schooling)?

6. Do males tend to have higher education than females in managerial occupa-

tions?

7. Which occupation and sex does a person have when he/she earns the highest

wage and has less experience?

8. Is there a highest educated female (age above 60 years old) who earns high

wage (at least 20 dollars per hour) and works in the Manufacturing sector?

9. Do higher educated (with 15-18 years of schooling) people earning high wages

(20-45 dollars per hour) tend to live in the South? Does the higher educated

person who earns the highest wage in the group live in the South?

10. How old is and how many years of schooling does a male person have when

he works in the service occupation and earns the highest wage?

Page 179: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

A.2. QUESTIONNAIRE 143

A.2 Questionnaire

Part I : Please provide your information

1. Do you have experience in data analysis?

2. Do you have experience in using any visualization tool?

Part II : Please provide ranking of your satisfaction

Strongly disagree Disagree Fair Agree Strongly agree

Usability

• Easy to complete the tasks 1 2 3 4 5

• Easy to learn tool 1 2 3 4 5

• Easy to use tool 1 2 3 4 5

• Easy to understand visualization 1 2 3 4 5

Quality of visualization

• Clarity of visual representation 1 2 3 4 5

• I was able to understand displayed pa-

rameters

1 2 3 4 5

• I was able to identify the correlation

among specified attributes

1 2 3 4 5

• I was able to compare the groups of ob-

jects in the specified range of attributes

1 2 3 4 5

• I was able to identify the specific groups

of data objects

1 2 3 4 5

• I was able to use and understand equal

height histogram

1 2 3 4 5

Page 180: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

144 APPENDIX A.

Strongly disagree Disagree Fair Agree Strongly agree

• I was able to identify the difference be-

tween Fixed attributes and Normal (Non

fixed) attributes in Type Attribute Selec-

tion

1 2 3 4 5

Which one do you prefer to explore data

sets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Learning(Quality of interaction)

• Easy to direct the search for data of in-

terest (naviagation)

1 2 3 4 5

• I was able to specify parameters 1 2 3 4 5

• I was able to correct my mistakes 1 2 3 4 5

• I was able to change the selection of pa-

rameters

1 2 3 4 5

• I was able to explore data 1 2 3 4 5

• I was able to use interactive features

(through mouse click)

1 2 3 4 5

• I was able to use and change features

(including Normal, Comparison, Fixed at-

tributes)

1 2 3 4 5

Quality of information

• Reliable 1 2 3 4 5

• Interesting 1 2 3 4 5

• Clear and understandable 1 2 3 4 5

• Easy to interpret results 1 2 3 4 5

Page 181: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

A.2. QUESTIONNAIRE 145

Part III : Please provide comments.

Please provide your comments about the system.

Page 182: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

146 APPENDIX A.

Page 183: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

Appendix B

This appendix contains documents for the user study of VisDM in Chapter 4. The

documents include tasks from different data sets and a questionnaire.

B.1 Tasks

B.1.1 Tasks from Dataset1

Suppose you are recently appointed as a marketing manager for a supermarket. You

have been informed that the volume of sell this month has dropped. You would

like to identify which items have low unit sale and the sale of which items can be

increased by making a promotion or putting the items close together so that there

is a chance that customers buying one of these items may buy the other items as

well.

You have four main tasks to complete. Please set minimum support = 10.

1. Identify (name) the first two maximum and two minimum items sold according

to their sales volume.

2. Identify four items which are purchased together most of the time and pro-

vide an association rule which satisfies the provided support and confidence

thresholds. (Assume that the support > 70 and the confidence > 70 are the

possible association of items purchased together).

147

Page 184: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

148 APPENDIX B.

3. Identify three association rules involving item numbers: 17, 42, and 11 in-

cluding support and confidence of each rule. Which rule do you think is the

strongest?

4. Do you think it is possible that item numbers 31 and 29 are frequently pur-

chased together? (Assume that the support > 70 and confidence > 70 are the

possible association of items purchased together).

B.1.2 Tasks from Dataset2

You have four main tasks to complete. Please set minimum support = 10

1. Identify the first two maximum and two minimum items according to their

sales volume.

2. Identify two items which are purchased together the maximum number of

times and the association rule involving these items that satisfies the sup-

port and confidence threshold provided. (Assume that the support > 50 and

confidence > 70 are the possible association of items purchased together).

3. Identify three association rules involving the items: tomato, pacifier, and rice

including support and confidence of each rule.

4. Do you think rice and tomato are frequently purchased together? (Assume

that the support > 50 and confidence > 60 are the possible association of

items purchased together.)

B.2 Questionnaire

Part I : Please provide your information

1. Do you have experience in data analysis?

Page 185: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

B.2. QUESTIONNAIRE 149

2. Do you have experience in using any visualization tool?

Part II : Please provide ranking of your satisfaction

Strongly disagree Disagree Fair Agree Strongly agree

Usability

• Easy to complete the tasks 1 2 3 4 5

• Easy to learn tool 1 2 3 4 5

• Easy to use tool 1 2 3 4 5

Quality of visualization

• Clarity of visual representation 1 2 3 4 5

• I was able to understand parameters 1 2 3 4 5

• I was able to identify the maximum per-

centage of items purchased together (co-

existing items)

1 2 3 4 5

• I was able to identify the minimum per-

centage of items purchased together (co-

existing items)

1 2 3 4 5

• I was able to find the item that is bought

most often

1 2 3 4 5

• I was able to find the item that is bought

least often

1 2 3 4 5

Page 186: A Visualization Framework for Exploring …...A Visualization Framework for Exploring Correlations among Attributes of a Large Dataset and Its Applications in Data Mining This thesis

150 APPENDIX B.

Strongly disagree Disagree Fair Agree Strongly agree

Learning(Quality of interaction)

• Easy to direct the search for data of in-

terest (navigation)

1 2 3 4 5

• I was able to use parameters 1 2 3 4 5

• I was able to correct my mistakes 1 2 3 4 5

• I was able to change the selection of pa-

rameters

1 2 3 4 5

• I was able to explore data 1 2 3 4 5

Quality of information

• Reliable 1 2 3 4 5

• Interesting 1 2 3 4 5

• Clear and understandable 1 2 3 4 5

• Easy to interpret results 1 2 3 4 5

Part III : Please provide comments.

Please provide your comments about the system.