blending aggregation and selection: adapting parallel coordinates

R E F E R E E D P A P E R

Blending Aggregation and Selection Adapting ParallelCoordinates for the Visualization of Large Datasets

Gennady Andrienko and Natalia Andrienko

Fraunhofer Institute AIS Schloss Birlinghoven 53754 Sankt Augustin Germany

Website httpwwwaisfraunhoferdeand

Email gennadyandrienkoaisfraunhoferde

Many of the traditional data visualization techniques which proved to be supportive for exploratory analysis of datasets of

moderate sizes fail to fulfil their function when applied to large datasets There are two approaches to coping with large

amounts of data data selection when only a portion of data is displayed and data aggregation ie grouping data items

and considering the groups instead of the original data None of these approaches alone suits the needs of exploratory data

analysis which requires consideration of data on all levels overall (considering a dataset as a whole) intermediate

(viewing and comparing collective characteristics of arbitrary data subsets or classes) and elementary (accessing

individual data items) Therefore it is necessary to combine these approaches ie build a tool showing the whole set and

arbitrarily defined subsets (object classes) in an aggregated way and superimposing this with a representation of

arbitrarily selected individual data items

We have achieved such a combination of approaches by modifying the technique of parallel coordinate plot These

modifications are described and analysed in the paper

1 INTRODUCTION

Large datasets pose a serious challenge to researchers invisualization Techniques that proved to be effective insupporting the exploration of moderate amounts of datarapidly decline in their efficacy and eventually completelyfail with increasing the number of data items to be analysedAttempts to visualize large datasets in the same ways assmall ones typically encounter at least one of the followingobstacles

1 Overplotting visual elements representing differentdata items fit into the same position in the displayand hence some of them are occluded This problemoccurs to such visualization techniques as scatterplotscatterplot matrix and parallel coordinate plot To fightoverplotting transparent output is used in someimplementations (see for example [T02]) While thisapproach may be useful for detecting clear-cut group-ings of objects with close characteristics it does notwork so well when the distribution of characteristics ismore dispersed Besides the user cannot properlyestimate the number of data items fitting in this orthat position within the plot area

2 Decline in legibility in order to represent all data items inone display the corresponding visual elements are madeso small that they become hardly visible and distinguish-able This problem often occurs on map displays

3 Impossibility to represent the full dataset visual elementsrepresenting all data items cannot be fitted into onedisplay due to size limitations and hence only a part ofthe data can be seen This problem arises for example inthe display techniques representing data in a tabularform including table lens and permutation matrix

Two major approaches to solving these problems exist dataselection (zooming focusing filtering) and data aggrega-tion Data selection means that a display does not representa dataset as a whole but only a portion of it which isselected in a certain way The display is supplied withinteractive controls for changing the current selectionwhich results in showing another portion of the data Thusselection on a map is done through zooming and viewportshifting while rows and columns in a table display areselected for viewing using scrollbars A data portion mayalso be selected by means of querying only data items withcertain properties which are specified in a query are shownThis technique in particular helps to reduce overplotting

Data aggregation reduces the amount of data underanalysis by grouping individual items into subsets oftencalled lsquoaggregatesrsquo and computing some collective char-acteristics of the aggregates The aggregates and theircharacteristics (jointly called lsquoaggregated datarsquo) are thenexplored instead of the original data

We argue that none of these approaches alone fullysatisfies the needs of exploratory data analysis and that it is

The Cartographic Journal Vol 42 No 1 pp 49ndash60 June 2005 The British Cartographic Society 2005

DOI 101179000870405X57284

necessary to combine them In order to substantiate ouropinion we need to refer to the theory of lsquoreading levelsrsquoformulated by Jacques Bertin [B83] Let us briefly recitethis theory

The function of graphical representation of data is toprovide answers to various questions concerning the dataThe questions can be distinguished according to theirlsquolevels of readingrsquo The level of reading indicates whether aquestion concerns a single data element a group ofelements taken as a whole or all elements of a datasetAccordingly there are three levels of reading elementaryintermediate and overall

Bertin claims that a visualization is effective if it permitsimmediate extraction of the necessary information iefinding the answer to the observerrsquos question at a singleglance with no need to move the eyes or attention and toinvolve the memory Bertin uses the term lsquoimagersquo to refer tolsquothe meaningful visual form perceptible in the minimuminstant of visionrsquo An optimal visualization contains a singleimage providing the answer to the observerrsquos questionHowever exploratory data analysis (in Bertinrsquos termslsquoinformation processingrsquo) typically does not deal with asingle question for which a data representation could beoptimized lsquoWith this function (ie supporting informationprocessing) the graphic is an experimental instrumentleading to the construction of collections of comparableimages with which the researcher lsquoplaysrsquo We class and orderthese images in different ways grouping similar onesconstructing ordered images to discover the syntheticschema which is at once the simplest and most meaningfulrsquoThat is the ultimate goal of information processing is todiscover a simple and meaningful synthetic schema standingbehind the data ie to understand the phenomenoncharacterized by the data in its whole This evidentlyrequires the overall level of reading However on the wayto this understanding an explorer needs to classify andorder detect similarities and differences with the inter-mediate and elementary reading levels being necessarilyinvolved in these activities

Hence in order to be suitable for exploratory dataanalysis a visualization needs to support all reading levelsFor this purpose it is required that the visualization on theone hand is comprehensive on the other hand containsthe smallest possible number of memorizable imagesComprehensiveness means avoiding any prior reductionof the information (eg by classification) using thelsquocomplete information which alone provides all the givensfor pertinent correlations and choices [hellip] But also itmatters that all types of comparisons and classings arepossible and easy The most useful questions will obviouslyinvolve the overall level of reading where their answer willbe found in a limited number of comparable imagesrsquo [B83p 164]

Let us evaluate from this perspective the two approachesto the visualization of large data volumes that is selectionand aggregation

Selection significantly reduces the information that canbe perceived at once since only a portion of data is visible atany moment However interactive controls allow the userto alter the current selection so that any individual dataitem can be eventually accessed Therefore it is possible to

find answers to questions requiring the elementary level ofreading ie questions about individual items The user canalso explore groups of simultaneously visible data itemswhich means that the intermediate reading level is alsopartly supported We say lsquopartlyrsquo because it is difficult tocompare a currently visible group to another group or tothe whole dataset The overall reading level is notsupported because there is no way to see all data itemssimultaneously

Aggregation groups the original multitude of data itemsinto a significantly smaller number of aggregates which canbe visualized all together in a single view thus enabling theoverall reading level Characteristics of the aggregates can beeasily explored and compared hence the intermediatereading level is also supported However the grouping ofdata items into aggregates is done prior to the visualizationand each aggregate is typically represented by one graphicalelement such as a bar in a histogram or a sector of a piechartAs a result the user cannot consider arbitrary subsets of dataitems that do not coincide with the aggregates displayedSince dealing with rigid once and forever defined classes orgroups is incompatible with the philosophy of exploratorydata analysis contemporary data aggregation tools arecharacterized by high user interactivity which allows ananalyst to choose and dynamically change the level ofaggregation (ie how large the aggregates are) the methodof aggregation (ie how individual items are grouped intoaggregates) and the functions for deriving characteristics ofthe aggregates from those of their members (ie sumsranges or various statistics) A good example of such a tool isTreemap developed by the research team of BenShneiderman [Sh92]

While substitution of original data by aggregatedfacilitates the processes of simplification and abstractionand hence gaining an overall understanding of a phe-nomenon this is achieved at the cost of substantialinformation loss specifically discarding individual dataitems In Bertinrsquos terms an aggregated data display is notcomprehensive and the elementary reading level iscompletely disabled

Hence neither data selection nor data aggregationprovides a fully satisfactory solution to the problem ofvisualizing large datasets A suitable compromise could beachieved by means of combining the approaches Thus adisplay could represent summary information concerningthe entire dataset and its subsets (aggregation) and at thesame time individual characteristics of selected data items(selection) This allows the user in particular to comparethese individual characteristics with those of the whole setand the subsets The display must be supplied withinteractive facilities allowing the user both to re-aggregatethe data in different ways and to change the currentselection of individual data items

For a practical realization of this idea two ways arepossible

1 Take some display technique representing individualdata items as a basis and modify it so that it could alsoshow aggregated information

2 Take some aggregated representation technique andextend it with a possibility to display individualcharacteristics

50 The Cartographic Journal

In this paper we investigate the possibilities for extend-ing some popular data visualization techniques (scatterplotand matrix of scatterplots table lens frequency histogrambox plot and parallel coordinate plot) to combinedrepresentation of aggregated and individual informationThen we describe in more detail our own extensions of theparallel coordinate plot technique

We do not introduce any restrictions or assumptionsconcerning how aggregates are defined except that eachdata item must belong to at most one aggregate Thismeans that there should be a general mechanism forreflecting arbitrary aggregations or classifications anddynamic reaction to changes of the aggregates (classes)for example like the mechanism for class propagation in thesystem CommonGIS [AA03]

As an example Figure 1 shows two different classifica-tions of the counties of USA On the upper map thecounties are divided into three classes according to thepopulation density in 1999 The variation of colour is usedto distinguish the classes on the map light corresponds tolow density and dark to high density The breaks betweenthe intervals of low medium and high population densityhave been selected so that the resulting classes consist ofapproximately equal numbers of counties

The lower map represents the results of grouping thecounties into three clusters on the basis of six attributesrepresenting proportions of different age groups in thepopulation below five years old 5ndash17 18ndash29 30ndash49 50ndash64 65 or more years The clustering was done using themethod SimpleKMeans as it is implemented in the datamining system Weka [WF99]

The set of counties with their demographic character-istics and these two example classifications will be usedthroughout the paper for testing various display techniquesThe set is rather large it consists of 3140 objects While thisis sufficient for our study we take into account that muchlarger sets exist and hence any technique should beevaluated in terms of the possibility to scale it to larger sets

The model analysis task is to compare the classes ofcountries with respect to the age structure of the populationFor the first classification this would allow one to seewhether the age structure is somehow related to populationdensity For the second classification this could help ananalyst to understand the results of clustering ie what thegroups defined by the data mining algorithm mean

Let us now evaluate the most popular techniques for datavisualization in terms of their suitability for large datasetsand the possibility to modify them so that questions of allthree reading levels could be answered Our particular focusis the possibility of comparison of arbitrary classes of objects(data items) an activity involving the intermediate readinglevel although we also pay attention to the other two levelsWe assume that classes are defined independently of thedisplays used to represent them and only informationabout the classification results (specifically what objectseach class consists of and what colour is assigned to it) istransferred to each display

2 CLASS COMPARISON WITH DIFFERENT DATA

DISPLAYS

We include the following types of data displays in ourevaluation

N Scatterplot (suitable for two attributes) and scatterplotmatrix (suitable for more than two attributes)

N Table lens [RC94]

N Parallel coordinate plot ([I85] [I98])

N Frequency histogram (suitable for a single attribute) anda group of coordinated histograms representing differentattributes ([TS98])

N Box-and-whiskers plot or shorter box plot [T77]

A common feature for the first three display types is thatthey represent individual characteristics of objects whereasthe remaining two techniques provide only general(aggregate) information about the distribution of attributevalues throughout the set of objects

There may be two basic approaches to representingarbitrary data subsets (classes) on different types of graphicaldisplays either represent all classes within a common displayarea or clone a display as many times as there are classes andshow each class on a separate display copy Both approacheshave their pluses and minuses Display multiplicationeliminates overlapping of information pertinent to differentclasses and is therefore more beneficial for an individualconsideration of each class One can also easily comparegeneral patterns of distribution of characteristics in theclasses However comparison of attribute values is morecomplicated than in the case when all classes are representedon one and the same display Another disadvantage of displaymultiplication is that it uses much more screen space Thisproblem becomes especially serious whenmany attributes areinvolved in analysis Such display techniques as scatterplotmatrix coordinated histograms and box plots alreadyinclude multiple plots or charts and further multiplicationmay cause significant difficulties for analysis Thus if thereare N attributes and M classes the user will have to analyseand compare N6M displays

Figure 1 Two example classifications of the counties of the USA

Blending Aggregation and Selection 51

From now on we shall mostly focus on the approachwith showing classes on the same display Typically displayelements representing different classes are differentlycoloured which allows one to distinguish between theclasses This approach is applicable to all the techniquesunder consideration except for the box plot

Figures 2 to 6 demonstrate how object classes can bereflected on the different types of displays by the example ofthe classifications of counties of the USA shown inFigure 1

As we have mentioned a box plot display can representclasses only by means of display multiplication (Figure 2)The visualization supports analysis on the overall (entire setof objects) and intermediate (object classes) levels butcannot represent individual characteristics of selected

objects Other disadvantages are too coarse aggregationand the necessity of using multiple plots in order torepresent several attributes

Figure 3 demonstrates representation of classes on ahistogram One can see the overall pattern of valuedistribution in the entire set of objects and compare it tothe patterns for the classes There are several bottlenecks inthis visualization

N The shape of a histogram depends on the chosengranularity

N The technique supports well the comparison of one class(aligned to the baseline at the bottom) to the whole setMaking comparisons between classes is difficult

N When working with large sets of objects it is necessaryto zoom the histogram for seeing details (for exampleon the right and left ends of ranges of normally-distributed attributes) However zooming destroys theoverall view

N Access to individual data instances is practicallyimpossible

In Figure 4 one can see how classes can be represented ona scatterplot The problem with this display is overplottingmany points (probably of different colour) may overlapTherefore it is impossible to guarantee that the patternperceived from such a display (if any) is correct

Representation of classes on a table lens display can bedone by means of colouring table rows according to the

Figure 2 Representation of class characteristics by box plots Thegroup of box plots on the left corresponds to the classification ofthe counties of the USA according to the population density Onthe right the results of clustering are reflected All box plots showthe distribution of values of the attribute lsquoProportion of age group50-64 yearsrsquo The topmost box plots correspond to the entire setof counties and the plots below represent the classes In additionmean values and standard deviations for the respective object setsare shown below each box plot

Figure 4 Scatter-plots of the attributes lsquoProportion of age group less than 5 years oldrsquo and lsquoProportion of age group 65 years and morersquoThe points are coloured according to the classes by population density (left) and to the clusters resulting from the cluster analysis (right)

Figure 3 Histogram representation of the attribute lsquoProportion of age group 50-64 yearsrsquo built with the granularity of 50 bins The bars aredivided into coloured segments according to the number of objects belonging to each of the classes On the left the classification accordingto population density is represented on the right mdash the results of clustering


classes the respective objects belong to Rows correspond-ing to the same class can be grouped together (Figure 5)Within a group the rows can be sorted according to valuesof some attribute However as we have already mentionedthe technique of table lens is inappropriate for large objectsets

In a parallel coordinate plot (Figure 6) objects arerepresented by polygonal lines connecting points (lsquocoordi-natesrsquo) on parallel axes which represent attributes Thelines can be painted according to class colours In simplecases this visualization may be helpful for forming a generalview of distribution of characteristics within an objectsubset (class) However like with the scatterplot a greatproblem of this display is overplotting which in most casesprevents perceiving properties of classes This is illustratedin Figure 6 Even if some pixels are painted in a colour of aclass this does not guarantee that there are no linescorresponding to objects from other classes passingthrough the same point but not visible due to the highdensity of lines

As it follows from our analysis none of the visualizationtechniques properly supports the exploratory analysis oflarge datasets on all reading levels overall (the wholedataset) intermediate (arbitrary subsets or classes) andelementary (individual data items) Hence there is a need indesigning a new technique for example by amending oneof the existing techniques As we have discussed apromising approach is combining data aggregation withdata selection ie aggregated representation of the entireset and subsets (classes) with a superimposed display ofindividual characteristics of user-selected objects We haveindicated two practical approaches to achieve this either tomodify some display technique representing individual data

items so that it could also show aggregated information orto extend some aggregated representation technique with apossibility to display individual characteristics The formerapproach seems more promising than the latter one Atleast some modifications of scatterplot and parallelcoordinate plot techniques towards representing aggre-gated data already exist (while we could not find a properway to extend the table lens technique) At the same time itseems quite difficult to find a way to extend histogram andbox plot displays for including individual data

The modifications of scatterplot are known as binnedscatterplot or two-variable histogram (see for example[C91] or [W99]) They exploit the idea of binning mdashdividing the plot area into small cells (bins) and counting thenumber of data instances fitting into each bin The counts arethen represented by painting the cells into different coloursor shades of grey There is a possibility to represent selectedindividual instances by symbols (eg small hollow circles)superimposed on such a representation although we did notfind in the literature any mentioning of this being actuallydone A disadvantage of binning is that it excludes thepossibility to represent several classes within a single displayit is difficult to show in a legible way howmany items of eachclass fit into each of the bins Hence only display multi-plication may be used for this purpose However in a case ofmore than two attributes multiple scatterplots are neededand their multiplication will tremendously complicate theoverall visualization

The parallel coordinate plot (PCP) technique seems to bea more suitable candidate for an extension towards therepresentation of aggregated information Various attemptsto include aggregated information in a parallel coordinateplot have been made by a number of researchers In nextsection we briefly consider their suggestions

Figure 5 Table lens representation with grouping of rows byclasses (results of cluster analysis) Only a part of data is visible dueto the limitations of the screen size

Figure 6 A parallel coordinate plot with the lines colouredaccording to the classes defined on the basis of population density


3 ENHANCING PARALLEL COORDINATES

A Inselberg introduced the parallel coordinate plot in the1980s as a geometrical abstraction for presenting selectedprojections of a multidimensional attribute space (seehttpwwwcstauacilaiisrealmdash lsquoHome of ParallelCoordinatesrsquo mdash for the history of PCP) At about thesame time D Schilling introduced a similar to PCP lsquovaluepathrsquo technique for multi-criteria optimization problems[SRC83] In early papers various mathematical andalgorithmic properties of PCP were studied ([I85][ICR87]) Later A Inselberg and E Wegman in parallelconsidered PCP as a tool for exploratory data analysis([I90] [I98] [W90] [MW91])

It is necessary to stress that the presence of N axes for Nattributes does not mean that the plot fully reflects the N-dimensional space without loosing important informationActually the plot displays values of N attributes in (N-1)pairwise projections

The technique of PCP attracted attention of manyresearchers who suggested various extensions and mod-ifications A complete overview of the related work isbeyond the scope of this paper We shall only consider thesuggestions that introduce display of aggregated data intoPCP and thereby make it more appropriate for largedatasets This includes drawing coloured bands along theaxes to represent the attribute ranges for the whole data setand for classes ([S00]) putting box plots on the axes([T02]) and even drawing histograms along axes ([OL96][PT03]) Such enhancements can be called lsquoparallel boxplotsrsquo and lsquoparallel histogramsrsquo [PT03] These additions arevery useful for the understanding of properties of subsetsand their comparison However such additional graphicsinside a PCP overlap with lines which complicates theperception Besides each graph represents only a singleattribute ignoring its relationships to other attributes

Another way of introducing summary information isshowing average or median lines for the whole set and forclasses ([S00]) This is done by connecting positions onneighbouring axes corresponding to the mean or medianvalues of the respective attributes However since the meanor median values are counted for each attribute indepen-dently one needs to be cautious and avoid interpreting theresulting lines as the most typical profiles (ie combinationsof characteristics) for the sets of objects

In [MW91] it is proposed to draw a lsquoline density plotrsquoinstead of individual lines For this purpose the areabetween axes is divided into zones (pixels) and the numberof lines passing through each zone is counted The soobtained counts are normalized and shown by painting thezones into different colours or colour grades Thistechnique which is analogous to lsquobinningrsquo on a scatterplotsupports quite well the overall analysis level and is suitablefor large and very large datasets Although this was notsuggested in [MW91] it can be easily imagined how tosuperimpose drawing of selected individual lines upon thedensity-based background painting of the plot areaHowever this technique does not allow us to representobject classes on the same display Hence the intermediateanalysis level can only be supported by means of displaymultiplication

Drawing bands or envelopes around selected subsets oflines provides yet another variety of aggregation ExtendingPCP by showing envelopes of subsets of lines was proposedalready as early as in [I85] and [ICR87] Later AInselbergproposed an iterative procedure of visual data mining [I98]that consists of sequential narrowing of envelopes by meansof selecting subintervals of attribute values In thisprocedure subsets are defined by means of interactionwith the PCP display It is not relevant to investigation ofarbitrary classes

A combination of hierarchical clustering with PCP isproposed in [FWR99] The authors represent clusters ofobjects by variable-width opacity bands rather thanindividual lines Only cluster centres are shown on top ofthe bands by lines Several clusters can be shown andanalysed simultaneously The outlines of the bandscorrespond to Inselbergrsquos envelopes and the opacitygradually decreased from cluster centres to the boundarieswithout taking into account line densities

A similar approach was proposed in [HC00] wherePCP was enhanced by rule induction methods Discoveredrules are visualized using semi-transparent overlappingbands

This method was further developed in [BH03] where afuzzy membership function for object subsets resultingfrom hierarchical clustering was represented by varyingopacity The authors introduced special drawing hints to beused instead of alpha-blending mdash a rather slow techniqueof graphical output typically used for transparent drawingand painting on the computer screen

Building envelopes around lines representing selectedsubsets of objects is a convenient tool for comparingcharacteristics of a subset (specifically ranges of attributevalues) to those of the whole data set However theapplicability of this method for comparison of subsets islimited because overlapping of their envelopes complicatesthe analysis (see Figure 7)

As a conclusion from this overview we find the followingPCP modifications to be supportive for exploration ofproperties of object classes

1 Replacing individual lines on a PCP by class envelopes2 Putting additional graphics on axes to represent the

distribution of attribute values in the whole data set andin subsets

These ideas were used in our own implementation ofPCP-based technique for class investigation described inthe next section

4 OUR APPROACH

In our implementation it is possible to replace drawing ofindividual lines by either class envelopes or special graphicson the axes representing summary information about thedistribution of attribute values in the whole object set andin different classes Lines for user-selected objects can beoverlaid on such an aggregated representation

Let us now consider both aggregation modes in moredetail starting with class envelopes Actually envelopes intheir lsquopurersquo form are not especially informative they only


show ranges of attribute values regardless of the distribu-tion of the values within the ranges Hence just a singleoutlier can significantly increase the width of a bandrepresenting some class Therefore in most real-worldcases envelopes of different classes greatly overlap and donot help to reveal substantial differences in class character-istics It would be good to build lsquosmart envelopesrsquo ignoringoutliers Alternatively an envelope could be drawn so as toprovide more information concerning value distribution ofeach attribute

We have chosen the second option we divide each valuerange into subintervals containing approximately equalnumber of lines and connect corresponding breaks (quan-tiles) on adjacent axes As a result an envelope is dividedinto stripes For better visibility stripes can be paintedslightly differently as is shown in Figure 8 The user canchoose the desired number of subintervals Thus inFigure 8 division into 10 subintervals is represented

It may be seen from Figure 8 that a divided envelopegives much better understanding of characteristics of a classthat just its outline The main impression from the outlineis that characteristics within the class greatly vary Howeverthe partition of the envelope shows us that 80 of values ofeach attribute lie within quite a narrow interval It can beseen for example that the class is characterized by mostlylow values of the third attribute (this is true at least for 90of class members) and mostly medium values of the secondattribute (again at least 90 of class members have mediumvalues of this attribute) Unfortunately it is hard to sayanything concerning how the first 90 are related to thesecond 90 In general these subsets are different anddiscarding the leftmost and the rightmost stripes will notgive us a band containing 80 of all lines This is aconsequence of the independent partition of the range ofeach attribute which must be borne in mind to avoidmisinterpretation of the display

Despite of this problem the stripe display can still beuseful for investigation of class properties and comparisonof classes In our implementation the user may switch onand off the representation of any class This helps the userto focus on a particular class and to make pairwisecomparisons between classes which is easier than analysingthree or more classes simultaneously Class envelopes canbe painted in a transparent or opaque mode Besides theuser can choose which stripes must be painted (filled) andwhich only shown by bounding lines Thus the display inFigure 9 represents two classes out of three class 2 (lightercolour) and class 3 (darker colour) which is also shown inFigure 8 The class envelopes are divided into 10 stripesThe user has chosen the second and the ninth stripes to bepainted in the transparent mode For the remaining stripesonly outlines are shown

The display in Figure 9 shows us that class 2 and class 3differ most significantly by values of the second attributeproportion of the age group from five to 17 years Quitesubstantial distinction exists also with respect to the propor-tion of children under five years old There is no difference inproportions of people of the age 65 years or more

In order to compare the distribution of characteristicswithin a class to that in the entire set of objects it is possibleto display analogous stripes for the whole set An alternativeway is statistics-based scaling of the axes which is describedin our earlier paper [AA01] The technique is illustratedin Figure 10 The same information as in Figure 9 isrepresented but the axes are distorted and aligned so thatthe centre of each axis corresponds to the median value ofthe respective attribute and the middle positions betweenthe centre and the ends correspond to the quartiles

From Figure 10 it can be noticed that class 3 (shown inthe darker colour) substantially deviates from the whole set

Figure 8 The distribution of object characteristics within a class isshown by dividing the value range of each attribute into 10 equal-frequency subintervals Hence each subinterval corresponds to 10of objects of the class

Figure 7 Transparent colour bands or envelopes represent theranges of object characteristics for the classes of counties accordingto the population density (left) and the clusters according to theage structure (right)


with respect to the proportions of the age groups under fiveyears and from five to 17 years Class 2 has more or lesslsquostandardrsquo proportions of these age groups as well as thegroup 50ndash64 years the bulk of values lie between the first

and the last quartiles of the entire dataset Concerning theage groups 18ndash29 years and 30ndash49 years the proportions inclass 3 are about lsquostandardrsquo while class 2 tends to havehigher proportions than in class 3 and in the entire set ingeneral Both class 2 and class 3 have smaller proportions ofthe age group 65 and more years in comparison to thedistribution for the whole set of counties Class 3 is alsocharacterized by smaller proportions of the age group 50ndash64 years

The modification of the lsquoenvelopingrsquo technique bypartitioning envelopes into stripes can be viewed at thesame time as a generalization of putting box plots on plotaxes Thus when the user chooses to divide the envelopesinto four stripes the division represents the medians andquartiles of the attributes analogously to box plots Finerpartitions provide more detailed information about valuedistributions

Drawing graphics on the axes is an alternative methodprovided in our implementation for displaying informationconcerning value distributions Instead of box plots whichshow only medians and quartiles we use graphs that may becalled lsquoellipse plotsrsquo Ellipse plots can represent arbitrarypartitions analogously to lsquostripedrsquo envelopes

Figure 11 illustrates how ellipse plots are builtAnalogously to partitioning envelopes the value range ofeach attribute (pertinent to a class or to the entire dataset)is divided into a desired number of equal-frequencysubintervals Then instead of connecting correspondingquantiles on adjacent axes as we do when stripingenvelopes we draw ellipses around the subintervals iethe horizontal diameters of the ellipses are proportional tothe lengths of the subintervals With this technique we canalso use the vertical diameters of the ellipses to convey

Figure 10 Comparison of value distributions in two classes tothat in the entire dataset by means of statistics-based scaling of theaxes

Figure 11 The same class as in Figure 8 (class 3) is representedby ellipse plots For building the ellipses the value ranges of theattributes are partitioned into 10 equal-frequency intervals In thebackground ellipse plots for the entire set of counties are shown

Figure 9 Comparison of distributions of attribute values in twoclasses Class envelopes are partitioned into 10 stripes of whichonly two (2nd and 9th) are painted according to userrsquos choice


additional information We make them proportional to thesizes of the subsets represented by the ellipses

In Figure 11 the smaller ellipses show the distribution ofattribute values in the same class as is shown in Figure 8 bya lsquostripedrsquo envelope (specifically class 3 resulting from theclustering) Like in Figure 8 division into 10 subintervals isapplied ie each ellipse represents 10 of the classmembers Hence all information available in Figure 8 canalso be seen in Figure 11 Additionally the bigger ellipsesin the background represent the distribution of attributevalues in the entire dataset so that class 3 can beconveniently compared with the whole set of countiesAnalogously to the smaller ellipses each big ellipse standsfor 10 of all counties Thus we can see on the upper axisthat 10 of the whole set of counties that have the highestproportions of the age group below five years contain 40members of class 3 and top 20 of the whole set containalmost 70 of the class members A similar observation canbe drawn from the second axis corresponding to the agegroup from five to 17 years top 20 of the whole setcontain about 80 of the class

Besides comparing the distributions we can also estimatethe size of class 3 in relation to the size of the whole set theproportions between the heights of the ellipses suggest thatthe class contains approximately one-fifth of all counties

Ellipse plots can also be used for comparing two or moreclasses Thus in Figure 12 class 2 class 3 and the entire setare represented simultaneously For easier comprehensionthe attribute ranges are divided this time into fivesubintervals instead of 10 As can be seen from the figurethe user can choose which of the ellipses will be filled andwhich remain hollow In this example filling is used for thecentral ellipse in each ellipse plot (ie middle 20 of objects

of the respective sets) The filling can be opaque as in thefigure or transparent

With Figure 12 we can make similar comparison of class2 and class 3 as with Figure 9 Additionally we can relateour observations concerning these two classes to thedistributions of attribute values in the entire dataset Wecan also estimate the relative sizes of the classes class 2 isnearly two times bigger than class 3 and constitutes about40 of the entire set of counties

Like with envelopes we can also apply statistics-basedscaling of the axes After applying this transformation thedisplay from Figure 12 looks as is shown in Figure 13 Nowthe area around the median line can be viewed as a sort oflsquostandardrsquo range of characteristics and we can investigate thedeviations of the classes from this standard range Theinformation perceived from this picture is similar to what isconveyed by the lsquostripedrsquo envelopes in Figure 10

In comparison to a lsquostripedrsquo envelope a collection ofellipse plots representing the same information does notproduce an integral image This is a weakness and anadvantage at the same time a weakness because a singleimage is easier perceived (and hence easier compared withan analogous image of another set of objects) and anadvantage since the misleading interpretation of stripes asline containers (ie that each line is fully contained in asingle stripe) is precluded

As compared to traditional box-and-whiskers plotsellipse plots are more general they can represent not onlymedians and quartiles but any number of quantiles (in ourimplementation from 2 to 10) We use ellipses rather thanboxes because two adjacent ellipses touch just in a singlepoint and hence can be easier distinguished visually

Display of aggregated information can be combined withdrawing lines for selected individual objects Objectselection can be done through any display In particular

Figure 12 Value distributions in two classes (class 2 and class 3)and the entire dataset are represented by ellipse plots on the basisof partitioning into 5 equal-frequency intervals Each ellipse standsfor 20 of objects of the respective set The vertical diameters ofthe ellipses are proportional to the sizes of the sets

Figure 13 Statistics-based scaling of axes has been applied to thedisplay in Figure 12


one can select objects by clicking or dragging on the parallelcoordinate plot The typical reaction is selection of theobjects the lines of which come near the mouse position (inthe case of clicking) or cross the rectangular area specifiedby mouse dragging When some object classes arepropagated to a PCP display the lines of selected objectsare coloured according to the classes the objects belong to(see Figure 14) otherwise the lines are shown in black

A special type of interaction is possible for a PCP displaywith ellipse plots Clicking on an ellipse selects the objectsfrom the corresponding subset Thus Figure 14 shows theresult of clicking on the rightmost ellipse on the top axisThis ellipse represents the top 10 of the whole set ofcounties with respect to the proportions of children underfive years old The click on the ellipse has selected thisgroup of objects It can be seen that the group consistsmostly from members of class 3 (dark lines) with a smallerfraction of objects from class 2 (light lines) and only a fewmembers of class 1 Unfortunately the lines belonging toclass 1 (red) cannot be distinguished from the lines of class3 (blue) in a greyscale reproduction

Clicking on another ellipse modifies the selection ThusFigure 15 shows how the plot from Figure 14 changes afterclicking on the rightmost ellipse on the axis second fromtop which corresponds to the proportions of the age groupfrom five to 17 years old The result is selection of thecounties the lines of which belong to both ellipses clickedThese are the counties with proportions of the age groupsless than five years old and five to 17 years old lying amongthe top 10 over the country This subset of countiesincludes only one member of class 2 the remainingcounties belong to class 3

The general discipline for the interaction with ellipseplots is following clicking on two or more ellipses on thesame axis results in adding new selections to the previouslymade ones (logical lsquoORrsquo operation) clicking on an ellipseon another axis results in intersecting the previous selectionwith the new one (logical lsquoANDrsquo)

5 DISCUSSION AND CONCLUSION

Our work on PCP modification has been incited by theobservation that this technique as well as other traditionalmethods for data visualization and display coordinationcannot properly support analysis of large datasets Ourspecial interest is analysis of object classes in particularresults of applying clustering algorithms of data mining (ingeneral outputs of data mining procedures are often quitedifficult to interpret therefore a proper visualizationsupport is required) A traditional approach for represent-ing object classes on data displays is so-called multi-coloured brushing ie painting display elements (dots ona scatterplot lines on a parallel coordinate plot etc) in thecolours of corresponding classes (see for example[HT98]) However due to overplotting brushing oftenfails to convey correct information concerning the classesHence for large datasets brushing should be substitutedby other methods of representing class-relevant informa-tion We believe that the most appropriate approach is toprovide such information in an aggregated form

Our modifications of the basic parallel coordinate plottechnique increase its appropriateness for large datasets atthe cost of replacing representation of individual instancesby the display of aggregated information concerning a

Figure 14 Clicking on an ellipse results in the correspondingobjects being selected and their lines appearing on the plot Hereclicking on the rightmost ellipse on the top axis has selected thetop 10 of counties with respect to the proportions of childrenunder 5 years old

Figure 15 Clicking on the rightmost ellipse on the second axishas modified the selection from Figure 14 Now selected are thelines of the counties with proportions of the age groups less than 5years old and from 5 to 17 years old being among the top 10over the country


dataset as a whole and its subsets (object classes) Thisallows one to apply parallel coordinate plots for dataanalysis on the overall and intermediate levels Theelementary level of analysis is supported by the possibilityto show individual characteristics of selected objects on topof the aggregate representation

We have suggested two alternative methods for represent-ing aggregated information lsquostripedrsquo envelopes and ellipseplots Both are based on partitioning the value ranges of theattribute into equal frequency intervals The enveloperepresentation conveys information concerning value distri-bution in an object set through a single image whereas ellipseplots do not promote such an integral perception Whileellipse plots are perceptually more complex than lsquostripedrsquoenvelopes the latter may induce the misleading interpretationof lines being fully enclosed in the stripes without intersectingtheir boundaries Ellipse plots can be better combined withdrawing individual lines than envelopes

A limitation of the suggested approach is that it issuitable only for a relatively small number of classes It isvery difficult to compare class characteristics when manyclasses are simultaneously shown in the display Althoughthe user may select the classes to view and perform pairwisecomparisons this procedure may become impractical withdozens or hundreds of classes

The aggregate representations can gain from a properscaling of PCP Thus statistics-based scaling decreases theinfluence of outliers and therefore can make the visualiza-tion more effective Additionally it can convey overall levelinformation concerning the distribution of characteristics inthe entire dataset At this cost the envelope or ellipse plotscorresponding to the whole dataset may be omitted fromthe display thus making it simpler and more legible

An important feature of the proposed approach is itsscalability Overplotting is reduced because mostly aggre-gated characteristics are shown and optionally just a subsetof lines The computational complexity is MNlog(N)where N is a number of instances and M is a number oftheir attributes Therefore the applicability is limited onlyby the amount of RAM on userrsquos computer

This restriction can be removed by implementing thevisualization in a client-server mode and using a powerfuldatabase system on the server side Thus the database cancompute and provide aggregated characteristics of data(particularly value ranges quantiles and other necessarystatistics) This is sufficient for the visualization andinteractive manipulation Instance data can be transferredonly on demand for a selected area of interest

However a weakness of both lsquostripedrsquo envelopes andellipse plots is that they convey summary information foreach attribute independently of others One can comparevalues distributions of different attributes in classes andentire dataset but cannot investigate relationships betweenthe attributes and cannot explore the distribution of valuecombinations Thus neither envelopes nor ellipse plots givean idea concerning the lsquotypical profilesrsquo of class members

This weakness can be partly compensated by thepossibility to represent individual characteristics for selectedobject subsets Appropriate selections can be made bymeans of interacting with the parallel coordinate plot Forexample the user may set a kind of lsquotrajectory maskrsquo by

clicking on ellipses on different axes and see how manyobjects have their characteristics fitting in this mask wherethese objects are on a map or other displays and whatclasses they belong to In principle this is equivalent toapplying a dynamic query tool [AWS92] but in the case ofellipse plots information about value distributions can beconveniently used However in this interactive way theuser cannot explore all possible masks (characteristicprofiles) but only a few Hence additional tools are neededto properly support profile analysis

We plan to implement a computational tool that wouldcount all existing combinations of characteristics for a user-specified partitioning of value ranges of attributes This maybe a partitioning by quantiles as in ellipse plots and lsquostripedrsquoenvelopes or a partitioning into a desired number of equalintervals While the computation itself is rather simple thecombination analysis tool requires a convenient user interfacefor selecting and partitioning attributes and an effectivevisualization of the results obtained Another idea is tocombine user-defined partitioning with an automated dis-covery of association rules by means of data mining followedby interactive visualization and analysis of the results

Similar ideas can be applied and further developed for theanalysis of time-series data In this case we can assume thatvalues for consecutive time moments are auto-correlatedThis assumption may allow us to develop methods for moresophisticated visual analysis

ACKNOWLEDGEMENT

This study was partly supported by the EuropeanCommission in project GIMMI (IST-2001-34245)

REFERENCES

Andrienko G and Andrienko N (2001) lsquoConstructing ParallelCoordinates Plot for Problem Solvingrsquo in Proceedings SmartGraphics pp 9ndash14 ACM Press New York [AA01]

Andrienko N and Andrienko G (2003) lsquoInformed Spatial Decisionsthrough Coordinated Viewsrsquo Information Visualization 2 270ndash85 [AA03]

Ahlberg C Williamson C and Shneiderman B (1992) lsquoDynamicqueries for information exploration an implementation andevaluationrsquo in Proceedings ACM CHIrsquo92 pp 619ndash26 ACMPress New York [AWS92]

Bertin J (1983) Semiology of Graphics Diagrams NetworksMaps The University of Wisconsin Press Madison [B83]

Berthold M R and Hall L O (2003) lsquoVisualizing Fuzzy Points inParallel Coordinatesrsquo IEEE Transactions on Fuzzy Systems 11369ndash74 [BH03]

Carr D B (1991) lsquoLooking at Large Data Sets Using Binned DataPlotsrsquo in Computing and Graphics in Statistics ed by Buja ATukey P A pp 7ndash39 Springer-Verlag New York [C91]

Fua Y-H Ward M O and Rundensteiner E A (1999)lsquoHierarchical parallel coordinates for exploration of large datasetsrsquoin Proceedings IEEE Visualization pp 43ndash50 San FranciscoCA October 1999 IEEE Computer Society Press Washington[FWR99]

Han J and Cercone N (2000) lsquoRuleViz a model for visualizingknowledge discovery processrsquo in Proceedings KDD 2000pp 244ndash53 ACM Press Boston MA [HC00]

Hoffmann H and Theus M (1998) lsquoSelection sequences in ManetrsquoComputational Statistics 13 77ndash87 [HT98]

Inselberg A (1985) lsquoThe plane with parallel coordinatesrsquo The VisualComputer 1 69ndash91 [I85]


Inselberg A (1990) lsquoParallel Coordinates A Tool for VisualizingMulti-Dimensional Geometryrsquo in Proceedings IEEEVisualization pp 361ndash78 IEEE Computer Society PressWashington [I90]

Inselberg A (1998) lsquoVisual Data Mining with Parallel CoordinatesrsquoComputational Statistics 13 47ndash63 [I98]

Inselberg A Chomut T and Reif M (1987) lsquoConvexity Algorithmsin Parallel Coordinatesrsquo Journal of the ACM 34 765ndash801[ICR87]

Miller J J and Wegman E J (1991) lsquoConstruction of line densitiesfor parallel coordinate plotsrsquo in Computing and Graphics inStatistics ed by Buja A and Tukey P A pp 107ndash23 Springer-Verlag [MW91]

Ong H-L and Lee H-Y (1996) lsquoSoftware report WinViz mdash a visualdata analysis toolrsquo Computers amp Graphics 20 83ndash84 [OL96]

Pratt K B and Tschapek G (2003) lsquoVisualizing concept driftrsquoin Proceedings 9th ACM SIGKDD Conference WashingtonDC August 2003 pp 735ndash40 ACM Press New York NY[PT03]

Rao R and Card S (1994) lsquoThe Table Lens Merging graphical andsymbolic representations in an interactive FocuszContextvisualization for tabular datarsquo in Proceedings of ACM CHIrsquo94Conference on Human Factors in Computing Systems BostonMA April 1994 pp 318ndash22 ACM Press New York [RC94]

Schilling D A Revelle C and Cohon J (1983) lsquoAn approach to thedisplay and analysis of multiobjective problemsrsquo Socio-EconomicPlanning Sciences 17 57ndash63 [SRC83]

Siirtola H (2000) lsquoDirect Manipulation of Parallel Coordinatesrsquo inProceedings Information Visualization 2000 London UKpp 373ndash78 IEEE Computer Society PressWashington [S00]

Shneiderman B (1992) lsquoTree Visualization With Treemaps A 2-DSpace-Filling Approachrsquo ACM Transactions on Graphics 1192ndash99 [Sh92]

Tukey J W (1977) Exploratory Data Analysis Addison-WesleyReading [T77]

Theus M (2002) lsquoInteractive data visualization using MondrianrsquoJournal of Statistical Software 7 [T02]

Tweedie L and Spence R (1998) lsquoThe prosection matrix a tool tosupport the interactive exploration of statistical models and datarsquoComputational Statistics 13 65ndash76 [TS98]

Wegman E J (1990) lsquoHyperdimensional data analysis using parallelcoordinatesrsquo Journal of the American Statistical Association 85664ndash75 [W90]

Wilkinson L (1999) The Grammar of Graphics Springer-VerlagNew York [W99]

Witten I H and Frank E (1999) Data Mining Practical MachineLearning Tools and Techniques with Java ImplementationsMorgan Kaufmann [WF99]


necessary to combine them In order to substantiate ouropinion we need to refer to the theory of lsquoreading levelsrsquoformulated by Jacques Bertin [B83] Let us briefly recitethis theory

The function of graphical representation of data is toprovide answers to various questions concerning the dataThe questions can be distinguished according to theirlsquolevels of readingrsquo The level of reading indicates whether aquestion concerns a single data element a group ofelements taken as a whole or all elements of a datasetAccordingly there are three levels of reading elementaryintermediate and overall

Bertin claims that a visualization is effective if it permitsimmediate extraction of the necessary information iefinding the answer to the observerrsquos question at a singleglance with no need to move the eyes or attention and toinvolve the memory Bertin uses the term lsquoimagersquo to refer tolsquothe meaningful visual form perceptible in the minimuminstant of visionrsquo An optimal visualization contains a singleimage providing the answer to the observerrsquos questionHowever exploratory data analysis (in Bertinrsquos termslsquoinformation processingrsquo) typically does not deal with asingle question for which a data representation could beoptimized lsquoWith this function (ie supporting informationprocessing) the graphic is an experimental instrumentleading to the construction of collections of comparableimages with which the researcher lsquoplaysrsquo We class and orderthese images in different ways grouping similar onesconstructing ordered images to discover the syntheticschema which is at once the simplest and most meaningfulrsquoThat is the ultimate goal of information processing is todiscover a simple and meaningful synthetic schema standingbehind the data ie to understand the phenomenoncharacterized by the data in its whole This evidentlyrequires the overall level of reading However on the wayto this understanding an explorer needs to classify andorder detect similarities and differences with the inter-mediate and elementary reading levels being necessarilyinvolved in these activities

Hence in order to be suitable for exploratory dataanalysis a visualization needs to support all reading levelsFor this purpose it is required that the visualization on theone hand is comprehensive on the other hand containsthe smallest possible number of memorizable imagesComprehensiveness means avoiding any prior reductionof the information (eg by classification) using thelsquocomplete information which alone provides all the givensfor pertinent correlations and choices [hellip] But also itmatters that all types of comparisons and classings arepossible and easy The most useful questions will obviouslyinvolve the overall level of reading where their answer willbe found in a limited number of comparable imagesrsquo [B83p 164]

Let us evaluate from this perspective the two approachesto the visualization of large data volumes that is selectionand aggregation

Selection significantly reduces the information that canbe perceived at once since only a portion of data is visible atany moment However interactive controls allow the userto alter the current selection so that any individual dataitem can be eventually accessed Therefore it is possible to

find answers to questions requiring the elementary level ofreading ie questions about individual items The user canalso explore groups of simultaneously visible data itemswhich means that the intermediate reading level is alsopartly supported We say lsquopartlyrsquo because it is difficult tocompare a currently visible group to another group or tothe whole dataset The overall reading level is notsupported because there is no way to see all data itemssimultaneously

Aggregation groups the original multitude of data itemsinto a significantly smaller number of aggregates which canbe visualized all together in a single view thus enabling theoverall reading level Characteristics of the aggregates can beeasily explored and compared hence the intermediatereading level is also supported However the grouping ofdata items into aggregates is done prior to the visualizationand each aggregate is typically represented by one graphicalelement such as a bar in a histogram or a sector of a piechartAs a result the user cannot consider arbitrary subsets of dataitems that do not coincide with the aggregates displayedSince dealing with rigid once and forever defined classes orgroups is incompatible with the philosophy of exploratorydata analysis contemporary data aggregation tools arecharacterized by high user interactivity which allows ananalyst to choose and dynamically change the level ofaggregation (ie how large the aggregates are) the methodof aggregation (ie how individual items are grouped intoaggregates) and the functions for deriving characteristics ofthe aggregates from those of their members (ie sumsranges or various statistics) A good example of such a tool isTreemap developed by the research team of BenShneiderman [Sh92]

While substitution of original data by aggregatedfacilitates the processes of simplification and abstractionand hence gaining an overall understanding of a phe-nomenon this is achieved at the cost of substantialinformation loss specifically discarding individual dataitems In Bertinrsquos terms an aggregated data display is notcomprehensive and the elementary reading level iscompletely disabled

Hence neither data selection nor data aggregationprovides a fully satisfactory solution to the problem ofvisualizing large datasets A suitable compromise could beachieved by means of combining the approaches Thus adisplay could represent summary information concerningthe entire dataset and its subsets (aggregation) and at thesame time individual characteristics of selected data items(selection) This allows the user in particular to comparethese individual characteristics with those of the whole setand the subsets The display must be supplied withinteractive facilities allowing the user both to re-aggregatethe data in different ways and to change the currentselection of individual data items

For a practical realization of this idea two ways arepossible

1 Take some display technique representing individualdata items as a basis and modify it so that it could alsoshow aggregated information

2 Take some aggregated representation technique andextend it with a possibility to display individualcharacteristics










DISPLAYS



N Table lens [RC94]















































4 OUR APPROACH

























































ACKNOWLEDGEMENT


REFERENCES





































DISPLAYS



N Table lens [RC94]















































4 OUR APPROACH

























































ACKNOWLEDGEMENT


REFERENCES




































































4 OUR APPROACH

























































ACKNOWLEDGEMENT


REFERENCES


















































































ACKNOWLEDGEMENT


REFERENCES





























blending aggregation and selection: adapting parallel coordinates

Documents