quality based guidance for exploratory dimensionality

Quality based guidance for exploratory dimensionalityreduction

Sara Johansson Fernstad1,2, Jane Shaw2 and Jimmy Johansson1

Abstract

High dimensional data sets containing hundreds of variables are difficult to explore, since traditionalvisualization methods often are unable to represent such data effectively. This is commonly addressed byemploying dimensionality reduction prior to visualization. Numerous dimensionality reduction methodsare available. However, few reduction approaches take the importance of several structures into accountand few provide an overview of structures existing in the full high dimensional data set. For exploratoryanalysis, as well as for many other tasks, several structures may be of interest. Exploration of thefull high dimensional data set without reduction may also be desirable. This paper presents flexiblemethods for exploratory analysis and interactive dimensionality reduction. Automated methods areemployed to analyse the variables, using a range of quality metrics, providing one or more measures of‘interestingness’ for individual variables. Through ranking, a single value of interestingness is obtained,based on several quality metrics, that is usable as a threshold for the most interesting variables. Aninteractive environment is presented where the user is provided with many possibilities to explore andgain understanding of the high dimensional data set. Guided by this, the analyst can explore thehigh dimensional data set and interactively select a subset of the potentially most interesting variables,employing various methods for dimensionality reduction. The system is demonstrated through a use-caseanalysing data from a DNA sequence-based study of bacterial populations.

Keywords

High dimensional data, dimensionality reduction, quality metrics, visual exploration, interactive visualanalysis.

1C-Research, Linkoping University, Sweden2Unilever Discover Port Sunlight, United Kingdom

Corresponding author:Sara Johansson Fernstad, Unilever Discover Port Sunlight, Quarry Road East, Wirral CH63 3JW, UKEmail: [email protected]: +44 151 641 1187

1 Introduction

Data sets including hundreds of variables are increasingly common in various application areas, suchas bioinformatics, simulations and social sciences. Analysis and exploration of this data is facilitatedby the use of interactive systems and visual representations, but the number of variables effectivelydisplayed using standard visual representations, such as parallel coordinates1,2 and scatter plot matrices,3

is limited. A common way of addressing this problem is to reduce the number of variables prior tovisualization.

Numerous dimensionality reduction methods are available, most of which preserve one or a few specificstructures when reducing the data set. For many tasks, and especially for exploratory analysis, severalstructures may be of interest and, hence, extraction of a subset of variables that are interesting due totheir influence on more than one structure is desirable. This is addressed in Johansson and Johansson4

where an overall measure of variable interestingness is utilized, taking the form of a weighted sumof several quality metrics. In addition to this, most dimensionality reduction systems fail to provideoverview and insight into the structures of the full high dimensional data set. For a small number ofstructures and moderate number of variables, an overview can be rather easily presented, for instanceusing a heat map as suggested by Guo.5 However, as the number of interesting structures and variablesincrease this becomes a more complex task.

Johansson Fernstad et al.6 addresses this when presenting a system for exploratory analysis of highdimensional microbial data. Their system was designed in collaboration with microbiologists and basedon the concept of combining quality metrics. By utilizing a ranking algorithm7 they provided an over-all measure of interestingness based on a set of quality metrics relevant for the microbiology domain.The system incorporated interactive and visual features into the analysis process, which provided userinfluence, guidance and aid in gaining insights of the characteristics of the microbial data set. Althoughoriginally designed for the microbiology domain, many of the features presented in that paper may bejust as useful in a broader context, if utilizing a set of generic quality metrics and visual representations.That is the focus of this paper, which presents a generic and highly customizable system for exploratorydimensionality reduction. The system is based upon the ideas of previous papers,4,6 but includes anextended set of visual and exploratory features for analysis of high dimensional data sets. The conceptof interestingness is in this paper defined by a set of metrics relating to general statistical properties ofdata, which can easily be extended with pre-calculated metrics defined by the context of problem do-main. The user is also provided with functionality for interactively modifying the influence of individualmetrics to the overall ranking.

Many dimensionality reduction systems automatically retain only the most interesting variables accord-ing to some metric, or automatically replace groups of variables fulfilling some criteria. However, a highdimensional data set may quite possibly include variables of certain interest, due to structures or taskcontext, that are not identifiable by an algorithm. The foundation of the techniques presented in thispaper is to provide extensive user-control combined with quality metric guidance. This enables an inter-active and exploratory dimensionality reduction, which is guided by variable ranks and quality metricsbut not restricted to selecting only variables fulfilling a certain criteria.

In an interactive system where dimensionality reduction is fully controlled by the user, it is importantfor the analyst to understand relationships within the high dimensional data set, to be able to makeinformed decisions. The system in this paper includes an interactive environment where variable ranksand quality metrics are used to provide an easily interpreted overview of the structures within the dataset. The overview concurrently acts as a visual guidance and control feature for dimensionality reduction.

Figure 1: The primary window of the interactive environment displaying a bacterial population data set including 184variables. The rank and quality metric profiles of the variables are displayed in the Ranking and Quality view (top leftview) and in the glyph view (top right view). The high dimensional data set is displayed in the bottom view.

Figure 1 displays the primary window of the interactive environment.

In the top half of the window, variable structures and relationships are displayed using parallel coordi-nates and a two-dimensional glyph plot. In the parallel coordinates, axes represent ranks and qualitymetrics, while polylines represent the variables of the high dimensional data set. In the glyph plot,the quality metric profile of each variables is represented by a glyph, laid out using Principal Compo-nents Analysis (PCA).8 The quality metrics are used as input vectors to the PCA, which facilitates theunderstanding of variable similarities and dissimilarities in terms of quality metric profiles.

To summarize, the main contributions of this paper is a generic system for exploratory dimensionalityreduction that includes:

• Several methods for interactive and quality guided dimensionality reduction, including:

– variable filtering based on quality metrics and ranking;– creation of representative variables replacing groups of similar variables;– manual and quality guided selection of individual variables.

• Customizable variable ranking using a combination of quality metrics, which provides an overallmeasure of variable interestingness based on several structures.

• An interactive environment that provides:

– overview of the structures within the high dimensional data set, through this enabling analysisof variable relationships in terms of quality metric profiles;

– possibility of identifying variables with specific, desirable or unusual characteristics;

– visual exploration of selected variable subsets using common visualization methods.

The remainder of this paper is organized as follows. The next section presents the background and relatedwork, this is followed by a section that provides a brief overview of the system. In the section Algorithmicanalysis the quality metrics and ranking are described in detail, and the section Interactive explorationand dimensionality reduction presents the interactive environment and dimensionality reduction features.The Use-case section demonstrates the usefulness of the presented techniques and the final sectionconcludes and discusses the presented work.

2 Background and related work

Visualization of high dimensional data sets, sometimes including hundreds of variables, is a major chal-lenge in information visualization. It has, in the context of scalability, been identified as one of the topten challenges within information visualization.9 This section presents previous research related to highdimensional data visualization and presents the contribution of this paper in the context of previouswork.

2.1 Visualization

Common visualization methods, such as parallel coordinates,1,2 scatter plot matrices3 and table lenses10

are often useful for analysis of multivariate data sets of moderate sizes. However, as described by Eick andCarr,11 both human visual capabilities as well as hardware and interactivity highly affect the efficiency ofvisual representations. The efficiency of most visual representations is rapidly reduced as the number ofvariables increases. However, some visual representations designed specifically for visualization of datasets with a large number of variables are available. Examples are the pixel based methods presentedby Keim,12 where each pixel in the display represents a data value and where different pixel layoutsare used to facilitate identification of different structures. Although able to display more variables thanmany traditional visualization methods, the pixel based methods are also limited by display size andresolution. They are furthermore perceptually constrained to viewing overall trends, as the limited sizeof a pixel makes it almost impossible to perceive details. A visual representation related to the glyphview presented in this paper is the Value and Relation (VaR) display, presented by Yang et al.13 TheVaR display is a two dimensional plot where each variable is represented by a glyph, designed as apixel plot and representing the values of the data items. The glyph layout within the plot is basedon relationships between variables. Related to this is also the work by Turkay et al.14 where highdimensional data sets are represented in both item space and dimension space. In the dimension space,various statistical properties of variables are represented. It is, thus, in spirit similar to the qualitymetric profile representations used in both this and the preceding paper bu Johansson Fernstad et al.6

In addition to visual representations, some systems for exploration of high dimensional data are alsoavailable. For instance Barlowe et al.15 describe a system where partial derivatives are visually repre-sented to the user, aiming to detect correlation among variables. To reduce clutter they utilize a step-wisevisual exploration. Another example is the ClusterSculptor system, presented by Nam et al.,16 whichcombines interactive features and clustering algorithms for clustering of data sets including hundreds of

variables. Both these systems focus on a specific task or structure and their utility is hence more taskdependent than the utility of the system presented here.

2.2 Dimensionality reduction

A common approach to visualization of high dimensional data is to employ dimensionality reductionprior to visualization. Through this, a lower dimensional projection is extracted which represents themost important structures within the data, which may be more straightforward and efficient to anal-yse. By using automated methods such as Multi-dimensional Scaling (MDS),8 Principal ComponentsAnalysis (PCA)8 and Self-organizing Maps,17 data sets including hundreds of variables can be efficientlyprojected onto a low dimensional space where the new variables often are linear combinations of theoriginal variables. Other examples are the projection pursuit algorithm18 which aims at identifyinglinear projections based on a measure of usefulness, and the family of linear transformations presentedby Koren and Carmel,19 where several different properties of the data are made use of, such as simi-larities and cluster structures. Their dimensionality reduction approach was later utilized by Engel etal.,20 who combined it with a hierarchical clustering method and a visual representation based on starcoordinates,21 to create a structural decomposition tree for high dimensional data. A slightly differentapproach to dimensionality reduction is the grand tour,22 where the analyst moves through a sequence oftwo-dimensional projections of a multivariate data set. Although efficient, the user often has very littleinfluence over the result of the automated methods and thus the user’s knowledge is not incorporated.An attempt to overcome this was presented by Williams and Munzner,23 where the user through an inter-active system can guide and influence the MDS process. A disadvantage of many of the aforementionedmethods is that the relationship between original and new set of variables is not always straightforwardto the user. A more intuitive relationship can for instance be obtained by selecting a representativesubset of the original variables, using some quality metric, or by separating the original variables intogroups, using a representative variable for each group. In Ivosev et al.24 the Principal ComponentsVariable Grouping was presented, which is an extension of PCA where the principal components areused to group the original variables.

Various interactive systems for dimensionality reduction are available, focusing on different structureswithin the data. In Jeong et al.25 a system aiming to assist the user in understanding the result of PCAis presented, using an interactive multiple coordinated views approach. Guo5 presents a system wherevariable pairs are ranked through a ‘goodness of clustering’ metric, enabling identification of potentiallyinteresting subspaces using an entropy matrix to represent the variable pair ranks. A similar approachis available in the rank-by-feature framework, presented by Seo and Shneiderman,26 where variable pairsare ranked based on a selected ranking criteria. Artero et al.27 presents an approach performing dimen-sionality reduction based on a similarity metric. In Sips et al.28 different projections of high dimensionaldata sets are ranked using class consistency as a quality metric. Similar approaches are presented byTatu et al.29 and Albuquerque et al.,30 using various metrics to rank projections of different visual rep-resentations. Ferdosi and Roerdink31 presented methods for variable ordering in parallel coordinates andscatter plot matrices, focusing on multivariate structures and aiming to find relevant subspaces for clus-tering. Although these interactive systems use different metrics and focus on different structures, theyall consider only one metric at a time. The Visual Hierarchical Dimension Reduction (VHDR) system32

is a system that creates hierarchical structures based on the similarities between pairs of variables. Thehierarchy can be interactively modified and through it subsets of interesting variables can be selected.In 2003, Yang et al.33 presented the Dimension Ordering, Spacing and Filtering Approach (DOSFA),which is derived from VHDR. DOSFA performs dimensionality reduction based on a combination ofsimilarity and a second importance value, such as variance. The reduction is carried out by removing

all but one of highly similar variables and by removing variables with low importance, hence partiallyintroducing the idea of combining several quality metrics for dimensionality reduction.

Considering the problem of how a subset of variables might be best selected for analysis, with selectionguided by a set of metrics of interest, Johansson and Johansson4 demonstrates how an overall measure ofvariable interestingness can be used to indicate to the user which variables should be displayed, and whythey might be of interest. The measure takes the form of a weighted sum of metrics, with the user definingthe weights pre-analysis. The sum provides a threshold for selecting a subset of interesting variables.The user decides on the number of variables to retain while investigating the trade-off between numberof variables and loss of quality metrics in an interactive display. The most interesting variables, based onthe weighted sum, are automatically retained. Another system, focusing on information loss, is presentedby Schreck et al.34 They use a projection precision measure for comparison of original and reduced dataand visually incorporates it into the visualization. In terms of incorporating user guidance into thedimensionality reduction process, the DimStiller system, presented by Ingram et al.,35 is related to thesystem presented in this paper. In DimStiller, data analysis and dimensionality reduction are carriedout as a chain of step-wise data transformations. The transformations are controlled by the user, who isguided at a global level by workflows aiding to find useful chains of transformations, as well as on a locallevel through visual feedback, facilitating parameter tuning and identification of the most informativesettings for a single transformation. Johansson Fernstad et al.6 continued developing the concept of userguidance when presenting a system for exploration of high dimensional microbial data. That systemused visual representations as guidance in an interactive dimensionality reduction process, and combinesquality metrics into a single measure of interestingness using a non-dominated ranking approach.7 Non-dominated ranking is more commonly used in multiple objective optimization algorithms.7 Related workis found in the area of decision-making in optimization problems where, rather than metrics of interest,several objectives are considered and handled simultaneously. Examples using non-dominated rankingalgorithms include Srinivas and Deb36 drawing on Goldberg’s approach, and Fonseca and Fleming,37

who demonstrate an extensive framework for interactive decision-making around several objectives andpreference exploration in this context. Subsequently, refinements to ranking approaches for consideringmultiple objectives have been developed and applied to many problem domains.

2.3 Summary and contributions

As described above, various dimensionality reduction methods and systems are available. Although usefulfor many different tasks, these methods may not always provide a suitable environment for exploratorydimensionality reduction and analysis of high dimensional data sets. The majority of the systems focuson only one or a few structures and few provide any overview of the structures within the whole dataset. Moreover, many are based on automated methods and do not provide the analyst full control overthe reduction.

Using a weighted sum of quality metrics enables extraction of variable subsets which are importantfor preserving more than one structure. It has a powerful advantage of providing a single value as athreshold for the most interesting variables. Feedback from industrial data analysts has indicated thatusers may find this simplicity for decision-making attractive. However, by solely utilizing a single valueof interestingness, information is lost regarding values of the original metrics, and the user is unable toexplore explicitly the contribution of each metric to the single measure. The user may also be interestedin exploring the effect of individual metrics upon the overall view. Additionally, users may find itdifficult to estimate an appropriate set of weights, given they must consider the relative importance ofa set of statistical concepts. This is addressed both here and in the previous paper6 by extending user

Input

Visual Analysis

ExploreReduced Data

Algorithmic Analysis

Quality MetricAnalysis

ReduceVariableRanking

PCA

Dimensionality

SubsetOutput

MetricOutput

Quality

Metric Setup HDDExplore

Figure 2: The workflow of the system includes an algorithmic analysis where variables are analysed and ranked usinga range of quality metrics, and an interactive environment which provides a visual overview of the structures in thedata set and enables interactive selection of subsets of variables.

interaction with both individual and summarized metrics. The previous system was tailored to specifictasks required by microbiologists, and incorporated biologically-motivated quality metrics along withvisual features specific for the microbiology domain. The system presented in this paper is a genericsystem based on the previous ideas but designed to be useful within any domain. It is extended withmetrics relating to general statistical properties and with additional exploratory and guiding features,facilitating flexible high dimensional data analysis and interactive dimensionality reduction.

3 System overview

The workflow of the proposed system can be separated into two main parts, as displayed in Figure 2.The initial part includes various algorithmic analyses, which are described in detail in the next section.The algorithmic analysis has the high dimensional data set as input, along with a set of selected qualitymetrics. Within this part the data set is analysed, based on the set of metrics, and for each variable inthe data set a value is extracted for each quality metric. This is followed by computation of variableranks and PCA, both based on the extracted quality values.

The second part of the system includes an interactive environment for visual analysis and exploratorydimensionality reduction, as described in detail in the section Interactive exploration and dimensionalityreduction. Here the ranks and quality metric profiles of the variables, as well as the original data set,are displayed to the user using linked visual representations. Within the interactive environment theuser is provided with various possibilities to explore and gain insights from the structures within thehigh dimensional data set. Guided by this, the user is provided with several methods for interactivelyselecting a subset of potentially interesting variables. In parallel with the reduction, the selected subsetof variables is displayed using a common visual representation, providing basic functionality for an initialvisual analysis of the reduced data set. The reduced data set as well as quality values and variable rankcan be exported and analysed using other appropriate methods and systems.

The quality metric analysis is the only computationally demanding part of the workflow. All interactionsand computations performed subsequently are based on the results of the analysis. Due to that, featuressuch as variable ranking and dimensionality reduction are instantly performed once the quality metricanalysis has been carried out. The time complexity of the quality metric analysis depends entirely onthe quality metrics used. Table 1 provides examples of computation times in milliseconds for the qualitymetrics described in the Quality metric analysis section. Further comments on computation times areprovided along with the description of corresponding metric. The performance is measured for four datasets of varying sizes. These exemplify the performance of the metrics both for sparse high dimensional

Synthetic Bacterial Ozone Crime(100 x 1320) (184 x 50) (73 x 2534) (124 x 1994)

QPearson 264 39 239 527QSpearman 508 300 500 1170Qvariance 20 2 29 41Qskewness 21 2 31 41Qlinear 4669 548 4521 9955Qcluster (fast) 1089 69 750 748Qcluster 1104 5011 7020 7020QoutlierLR 9 16 43 195QoutlierED 3100 160 3849 6395

Table 1: Computation times in milliseconds of quality metrics for the four data sets. Number of variables and itemsare shown within parentheses.

data sets, where the number of dimensions is several times higher than the number of items, as wellas for data sets where the number of items by far exceeds the number of dimensions. The data setsused are: a synthetic data set, including 100 variables and 1320 items; a bacterial population data setdescribed in the Use-case section, including 184 variables and 50 items;38 and two data sets from theUCI Machine Learning Repository.39 The UCI data sets are the eight hour peak Ozone Level Detectiondata set, including 73 variables and 2534 items, and the Communities and Crime data set including 127variables and 1994 items. The tests are run on an HP EliteBook 8540p laptop with an Intel i5 2.53GHzCPU, 4GB RAM and an Nvidia NVS 5100M graphics card.

4 Algorithmic analysis

All dimensionality reduction and exploration within the system are based on an algorithmic analysis ofthe full high dimensional data set, which is performed prior to visual analysis. The initial step of thealgorithmic analysis is the quality metric analysis which is described in detail in the next section. Thisis followed by a variable ranking, described in the Variable ranking section, and application of PCA tothe variables and quality metrics. The visualization of PCA in this context is described in more detailin the section Glyph view.

4.1 Quality metric analysis

The goal of the quality metric analysis is to identify structures in the high dimensional data set, andto assign individual quality values to each of the variables. The quality values represent the strength ofthe variables’ involvement in a specific structure. For each variable one quality value is computed foreach of the quality metrics. The output of the quality metric analysis is, hence, a set of vectors whereeach vector represents the quality values of one variable. In this paper a set of eleven quality metricsare used to demonstrate the presented techniques, three of which are identical to the quality metricspresented in Johansson and Johansson.4 The metrics have been selected through a review of the qualitymetrics used and suggested by previous research in the area, literature on statistical data exploration,and industrial data analysts. In addition to the eleven metrics included in the system, pre-calculatedmetrics can be loaded into the system. This provides flexibility in terms of letting the user focus theanalysis to structures relevant for a specific domain. It also enables the utilization of computationalpowers and strengths of existing software packages. It is worth emphasizing that the ranking, variableselection and exploration features presented in this paper are generic and do not rely on the specific

quality metrics described in this section. They are primarily used as a proof of concept of how severalquality metrics can be used concurrently to reduce and analyse a high dimensional data set. Any otherquality metric from which it is possible to extract an individual quality value for each variable couldbe used just as well. It should also be noted that although the metrics described in the paper are alldesigned for numerical data, the basic concepts of the approach could be used for other types of data aswell, providing the quality metrics used are properly designed for the corresponding data type.

When computing the quality values of a variable there is sometimes a risk that involvement in a largenumber of insignificant structures might add up to what appears to be involvement in significant struc-tures. To avoid this, various thresholds are used to define whether a structure is significant enough tobe included. Throughout the remainder of this section the following notations will be used: a data setX, includes M variables and N items, ~xi is an item where i = 1, ..., N , ~xj and ~xk are variables wherej, k = 1, ...,M and xi,j is the data value for item ~xi in variable ~xj .

4.1.1 Clusters

A density based approach is taken to analysis of cluster structures, defining a cluster as a region withhigher density than its surrounding regions. To identify multi-dimensional clusters the MAFIA clusteralgorithm,40 which has evolved from the CLIQUE cluster algorithm,41 is used. The algorithm initiallyidentifies one dimensional dense units (clusters). It then iteratively extracts higher dimensional clustersby combining lower dimensional clusters, retaining only clusters with density above a given threshold.This is an approach similar to Apriori reasoning.42 Additional details on the clustering algorithm andcomputation of cluster quality values can be found in Johansson and Johansson.4 The cluster metric isdesigned such that high cluster quality values are assigned to variables that are included in subspaceswith high quality clusters. A cluster is considered to be of high quality if it has high density, highcoverage and exists in a subspace which includes a large number of variables.

As seen in Table 1, the cluster quality metric is one of the three most computationally demandingmetrics, along with the linearity metric and the outlier metric based on Euclidean distance. The compu-tation time of the clustering algorithm is highly dependent on the number of sub-clusters and on theirdimensionality.40,41 Hence, complex cluster structures may result in long computation times even if thetotal number of variables is not very high. To speed-up the cluster analysis the maximum dimensionalityof a cluster is limited by a threshold that is defined by the user prior to analysis. Furthermore, a defaultoption for fast cluster analysis is provided, where only cluster structures within individual variables areanalysed. For data sets with a large number of one-dimensional cluster structures, indicating a longcomputation time, a warning is presented to the user, suggesting that the faster one-dimensional anal-ysis should be used instead. For comparison, the computation time for both the fast cluster analysis ofindividual variables and a multi-dimensional cluster analysis are presented in Table 1.

4.1.2 Correlation

Two quality metrics are available for analysis of correlation, the Pearson correlation coefficient,43 r, andthe Spearman rank correlation coefficient,44 ρ. For both metrics the correlation coefficients of all variablepairs are calculated initially. Individual correlation quality values are then calculated for all variables,where the Pearson quality value for variable ~xj is defined as QPearson(~xj) =

∑Mk=1 |r(~xj , ~xk)| for k 6= j

and |r(~xj , ~xk)| ≥ ε where ε typically is in the range 0.05 to 0.8. The Spearman correlation quality value,

QSpearman(~xj), is computed in a corresponding way using |ρ(~xj , ~xk)|. Using this method, high qualityvalues are assigned to variables which are strongly correlated with a large number of variables. If usinga high ε threshold only the strongest correlations are taken into consideration. In addition to correlationanalysis where positive and negative correlations are considered equally important, as above, qualityvalues focusing either on positive or negative correlations are available as well. These are extracted bycalculating QPearson(~xj) =

∑Mk r(~xj , ~xk) with r(~xj , ~xk) ≥ ε and r(~xj , ~xk) ≤ −ε focusing on positive and

negative correlation respectively, and similarly for QSpearman. The computation time for the correlationmetrics, as presented in Table 1, includes the total computation time for absolute, positive and negativecorrelation metrics, since they are all based on the same analysis of pairwise correlations.

4.1.3 Distribution

Data distribution is analysed through two quality metrics; variance and skewness. The quality values,Qvariance(~xj) and Qskewness(~xj), of a variable, ~xj , are extracted by analysing the distributions within ~xj ,using standard methods as described in Wackerly et al.45 and Kendall et al.46

4.1.4 Linearity

The linear relationships within the data set are analysed based on linearity between pairs of variables.For each variable pair a line of best fit is identified, using linear regression methods as described inDraper and Smith.47 To determine whether the relationship between two variables is linear a runstest45,47 is used. The runs test examines whether the deviation of the residuals are random, indicatinga linear relationship, or systematic, indicating a possibly non-linear relationship. A test statistics, z, iscomputed to tests the null hypothesis that the residuals are randomly distributed. The rejection regionfor the null hypothesis in a two-tailed test is |z| ≥ 1.96, using a 95% confidence interval.45 Hence,|z| ≥ 1.96 indicates a non-linear relationship for the variable pair and a small |z| value indicates a highprobability of a linear relationship. Within the presented system, high quality values are assigned tovariables that are part of many variable pairs where the probability of linearity is high. Defining thelinearity quality value for variable ~xj as Qlinear(~xj) =

∑Mk=1 (ζ − |zj,k|) where k 6= j, and |zj,k| ≤ ζ where

ζ typically has a value of 1.96, based on the confidence interval. As previously mentioned, and as seenin Table 1, linearity is one of the three most computationally demanding metrics.

4.1.5 Outliers

Two methods are provided for detection of outliers. One is based on the outlier definition used withinlinear regression analysis,47 where an outlier is defined as a residual whose absolute value lies a numberof standard deviations from the residual mean. To identify two-dimensional outliers the residuals of allvariable pairs are analysed during the linear regression analysis, described in the previous section. Anitem, ~xi, is defined as an outlier if its distance, δi, from the residual mean is greater than τ standarddeviations. Higher order outliers are defined as items which are outliers for a set of variable pairs. Highsignificance is assigned to outliers of high dimensionality and with large δi. An outlier value is computedfor each higher order outlier, defined as oi =

∑Kl=1 δi(l), where K is the set of variable pairs for which

the item is a two-dimensional outlier.

The second method is a density and grid based approach using the Euclidean distance, described in detailin Johansson and Johansson.4 Using this method an item, ~xi, is defined as an outlier if the numberof neighbour items, ψi, within a given radius, φ, around ~xi does not exceed a threshold, ς. Similar tothe outliers based on residual distances, an outlier value, oi, is computed for each outlier, with highsignificance being assigned to outliers of high dimensionality and with few neighbouring items. Thequality values of the outlier metrics, QoutlierLR(~xj) and QoutlierED(~xj), for variable ~xj is computed bysumming corresponding oi for all outliers belonging to ~xj , where oi ≥ $ and $ typically has a value of1.

In terms of computation times, as displayed in Table 1, the outlier analysis based on Euclidean distanceis one of the three most computationally demanding metrics. For outlier analysis based on linearity, onthe other hand, some of the required computations are computed during the linearity analysis. Hence,the presented computation time of QoutlierLR corresponds only to the outlier detection part, and doesnot include identification of the line of best fit.

4.2 Variable ranking

The aim of utilizing variable ranking is to extract a single value of interestingness for each variable, basedon the values obtained during quality metric analysis. Through this the importance of several structuresare taken into consideration, with the benefit of not requiring any prior knowledge from the analystregarding the structures existing in the data set or regarding which structures are of most interest.

The ranking algorithm used here was proposed by Goldberg,7 however, others are available and maybe suitable for similar use. The rank provides a related filter which ensures that all variables with thesame rank achieve a level of equivalence in their metric profile, by use of the non-domination principle.Vectors containing the quality values of variables are used as input vectors for the ranking algorithm.To obtain a variable ranking that is based on a subset of the available metrics, different combinationsof quality metric vectors can be used. The ranking algorithm is formally defined as:

(~qj < p~qk) ⇐⇒ (∀m)(qm,j ≤ qm,k) ∧ (∃m)(qm,j < qm,k) (1)

where ~qj and ~qk are vectors containing the quality values of variable ~xj and ~xk.7 It is said that ~qj ispartially less than ~qk, defined as (~qj < p~qk), if the condition stated in eq. (1) is fulfilled, where qm,j

is the quality value of variable ~xj for metric m, and where m = 1, ..., Q and Q is the total number ofquality metrics to rank by. Thus, ~qj is defined as partially less than ~qk if ~qj is less than or equal to ~qkfor all metrics, and if there exists at least one metric for which ~qj is less than ~qk. If ~qj is partially lessthan ~qk it is said that ~qj dominates ~qk. A variable that is not dominated by any other is said to benon-dominated.

For illustration, six variables (~xa – ~xf ) are assigned ranks based on two metrics (m1, m2), as displayedin Figure 3. The analyst is interested in variables with high values for both metric, and hence want ahigh rank value to be assigned to those. The metric values for each variable ~xa to ~xf are calculated andcompared with each other. Variable ~xf has the lowest values for both m1 and m2, and receives a rank of1. In the remaining unranked points, ~xe has the lowest value of m2, and ~xd the lowest value of m1, butneither has smallest values in both metrics than the other. ~xe and ~xd are thus non-dominated by eachother and both receive rank 2. Similarly, points ~xb and ~xc receive rank 3. Point ~xa has higher values

~xe

~xb

~xf

m2

m1

~xc

~xd

~xa

Figure 3: An illustration of the non-dominated ranking process, based on Goldberg,7 for six variables (~xa – ~xf ) andtwo quality metrics (m1 and m2).

in both metrics than all other points and receives a rank of 4. All variables now have a rank, and theprocess is complete.

Due to the system structure, the variable ranking is based on the quality values, which were extractedduring quality metric analysis. The time complexity of the variable ranking is thus dependent on thenumber of quality metrics used and the number of variables in the data set. Using eleven qualitymetrics, the computation time for variable ranking in the four data sets described earlier was only a fewmilliseconds.

5 Interactive exploration and dimensionality reduction

Within the interactive environment of the system various functionalities are provided for selecting subsetsof potentially interesting variables. The selection is guided by visual representations of the algorithmicanalysis results. The visual representations both provide possibilities to explore the data and to gaininsight into the overall characteristics of the high dimensional data set. They also facilitate identificationof variables with specific, desirable or unusual characteristics. Prior to quality metric analysis the useris presented with the metric selection window, as displayed in Figure 4. In the left half of the window isan interface for selection of quality metrics to use. The left list box displays metrics selected by the userand the right list box displays metrics that are available but not yet selected. The interface also providespossibilities to save and load a metric setup, and to load pre-calculated metrics. The right part of themetric selection window displays a two dimensional PCA-plot of the data set. This plot is included toprovide a fast initial overview of the data, which may act as a guidance for selection of quality metricsand for focus of subsequent analysis.

The interactive environment includes several linked windows and views, described in detail in the fol-lowing sections. In the data view in the bottom of the primary window, displayed in Figure 1, thehigh dimensional data set is displayed using a visual representation selected by the user. The currentimplementation of the system includes three common visual representations for the data view: parallelcoordinates, scatter plot matrix and table lens. Due to its ability of displaying multivariate patterns,such as clusters, parallel coordinates are used as the default representation. The visual representation inthe data view is interactively linked to the dimensionality reduction, which is described in the sectionsVariable merging and Variable selection, and only displays the currently selected subset of variables whenthe dimensionality is reduced. The interactive linking is designed such that the visual representationis reduced or expanded by adding or removing variables, in an accordion like way, when variables are

Figure 4: The metric selection window for selection of quality metrics to use. A PCA-plot to the right provides aninitial overview of structures which guides the user in selection of metrics. Here two clusters are clearly visible.

removed from or added to the selected subset. This provides an initial visual analysis of the selectedvariable subset, enabling fast confirmation as to whether the current subset is of interest for further anal-ysis, or if another subset should be selected instead. It also provides a visual feedback for deciding onthe appropriate number of variables to display, since the analyst will be able to perceive when patternsbecome visible while reducing the data.

The variables in the data view can be interactively re-ordered through drag and drop features. However,since the focus of this work is primarily on interactive and quality guided dimensionality reduction,the current implementation of the system does not include any automated variable ordering algorithms.Nevertheless, recognizing the influence of variable order on our ability of perceiving patterns in visualrepresentations, the inclusion of automated variable ordering is considered for future implementations.If available, classification information can be loaded into the system along with the data set. Theclassification may then be used for colouring in the data view and as additional information displayed asmouse-over tool-tips. In addition to the visual representations of the primary view, a two dimensionalPCA-plot of the full high dimensional data set is available in a separate window. This window is linkedto the data view in the primary window in terms of colouring, selection and filtering. This plot uses thesame PCA computation as the metric selection window.

5.1 Ranking and Quality view

The Ranking and Quality (RaQ) view is displayed in the top left part of the primary window (Figure1). The aim of this view is twofold; firstly it is meant to provide an overview of structures within thefull high dimensional data set, in terms of the quality metric profiles of the variables; and secondly it ismeant to act as a control panel for interactive dimensionality reduction. Quality metric profiles may bethought of as multivariate patterns of variables, and interactive dimensionality reduction can be thoughtof as filtering of variables. Based on this, parallel coordinates were selected as visual representationto use in the RaQ view, since it enables both analysis of multivariate patterns and interactive subsetselection through filtering along axes. In the parallel coordinates of the RaQ view, the variables of thehigh dimensional data set are represented by polylines, whereas the axes of the plot represent variablerank and quality metrics. Hence, a polyline in the RaQ view corresponds to an axis in the data view.To emphasize the relationship between the views, the polyline colour in the RaQ view matches the axis

colour in the data view.

Through the RaQ view the analyst can obtain an understanding of the overall structures within thedata set. Through highlighting of individual polylines it also provides possibilities of examining detailsof particular variable profiles. Furthermore, variables differing from the overall structures, or variableswith specific desirable or undesirable properties, can easily be identified. By using the filter sliders ofthe axes it is possible to explore further the relationships between structures in the high dimensionaldata set. While examining the structures in the RaQ view, some of the quality metrics may be identifiedas not as good a basis for extracting a subset of the most interesting variables in the current data set.By deselecting such quality metrics, using the check box below the corresponding axis in the RaQ view,new variable ranks are instantly re-computed, excluding the deselected quality metrics from the rankingprocedure. Similarly, to increase the flexibility of the ranking, individual quality metrics can be invertedusing the button below corresponding axis in the RaQ view. For an inverted metric, a high rank will beassigned to variables with low quality values.

5.2 Glyph view

To the right of the RaQ view in the primary window is the glyph view, which is inspired by the VaRdisplay.13 In this view each variable in the high dimensional data set is represented by a glyph. Theglyphs are laid out in a two dimensional PCA-plot where the quality metrics are the input vectors ofPCA (equivalent to the second through seventh axis in the RaQ view in Figure 1). The glyphs are madeup of a number of squares, each corresponding to one quality metric, where the opacity of the squarerepresent the quality value. To emphasize the connection between the glyph view and RaQ view, andto facilitate the interpretation of glyphs, the base colours of the glyph squares are the same as the axiscolour of corresponding quality metric in the RaQ view. Equivalently, the colour of the glyph borderscorresponds to the colour of polylines and axes in the RaQ and data views. Glyphs that are not part ofthe currently selected subset of variables are displayed without border.

The glyph view complements the RaQ view through its focus on representing relationships betweenvariables. Due to the use of PCA for glyph layout, groups of variables with similar quality metric profilescan be identified. An example is the small cluster of glyphs positioned in the top right part of the glyphview in Figure 1. The spatial proximity of glyphs also helps in identifying variables that are not selectedbut have similar profiles to a selected variable, indicating that they may be of interest. It furthermoreenables identification of variables that are outliers in terms of quality metric profiles, as they are separatedfrom the majority of glyphs in the plot. This too may indicate properties worth investigating further. Tosimplify the identification of individual variables, mouse over tool-tips displaying corresponding variablename are displayed while hovering over glyphs.

A major issue when using glyphs in scatter plot displays is the problem of glyph overlap, meaning thatglyphs are so closely positioned that they cover each other. The glyph view of this system includes azooming functionality for focusing on a subset of glyphs, which to some extent overcomes the overlapissue. Future options for overcoming this may be to provide possibilities of replacing glyphs with smallerpoints, or to spread a group of selected glyphs using a re-positioning algorithm. The former would reducethe overlay and provide better detail regarding which points are actually laid out closest together, butit would not provide any additional details regarding the quality metric profiles of the variables. Thesecond approach, on the other hand, would preserve details on quality metric profiles, while it to someextent would distort the representation of glyph relationships in the display. Both alternatives may beuseful as selectable features for the user. The primary purpose of the glyph view is, however, to provide

Figure 5: The variable merging window, including a list of suggestions of variable groups to merge to the left andtwo visual representations to the right displaying the selected variable group together with its representative variable.The left view displays a group of three strongly correlated variables. The right view displays a group of six variableswhere the first and sixth (represented by black diagonal cells) are deselected and will not be included if the variablesare merged.

overview of relationships between variables, for instance in terms of clusters and outliers, rather thanproviding detailed information regarding the quality metric profiles of individual variables.

5.3 Variable merging

Within a high dimensional data set there may be groups of variables which are very similar and, hence,may be more interesting as a group rather than as individual variables. To remove redundant informationand provide more space for other variables, such groups may preferably be replaced by a representativevariable. This is addressed in the presented system through the variable merging window (Figure 5).In this window groups of highly similar variables are automatically extracted, based on the previouslycomputed Pearson correlation. A correlation threshold for extracting these groups is interactively set bythe user through a slider. The variable groups, as displayed in the left part of the window, are suggestionsto the user of variables possibly meaningful to merge, and can be selected for further examination. Aselected group of variables is displayed in the right part of the window, along with a representativevariable for the group. Here a combination of parallel coordinates and a scatter plot matrix is used forexamination of correlation patterns. In the scatter plot matrix, blue coloured cells are used as an aidfor distinguishing strong correlations from weaker correlations. The representative variable, positionedas the rightmost variable, represents the average of the variable group.

Through the visual representations, the user gets an understanding of relationships within the group,aiming to guide the decision whether to merge or not. Any variable within a group can be excludedfrom the merging, resulting in an instant re-computation of the representative variable. The right partof Figure 5 displays an example of this, where the first and sixth variables are excluded. Facilitated bythe combination of automated extraction of strongly correlated variables, and guidance through visual

representations, an analyst is able to quickly make informed decisions whether or not groups of variablesare better represented by a single variable. While groups of strongly correlated variables may be hard toextract manually from a high dimensional data set, a fully automated method would not have been ableto distinguish variables that should not be merged. Reasons for not merging variables may, for instance,be a difference in meaning or due to correlation being the result of a strongly skewed distribution. Whena group of variables are merged they are replaced by the representative variable within all views of theprimary window. Quality values and ranks are computed for the new variable, employing the algorithmsdescribed in the Algorithmic analysis section and by using the average values of any pre-calculatedmetrics. To make the representative variable distinguishable from the original variables it is representedby red colour in all views.

5.4 Variable selection

The primary dimensionality reduction within the system is controlled through the RaQ view. Thefilter sliders of the axes are used to select a subset of variables, enabling an interactive dimensionalityreduction which is guided by quality metrics and ranking. The variable selection is instantly performedand reflected in the bottom view, where only the selected subset of variables is displayed. This providesan instant visual feedback regarding the number of variables that can be effectively displayed in thecurrently used visual representation. It also provides a clear indication of when interesting structuresbecome visible.

Many dimensionality reduction systems automatically retain a subset of the most interesting variables,not making full use of the knowledge of the user. Selection through filtering provides user control andguidance in identifying potentially interesting variable subsets. However, a data set may still includevariables of interest which may not have been assigned high ranks. To enable a fully user controlledreduction, this system allows for manual selection of variables. Any manually selected variable will beunaffected by filtering and retained in the displayed variable subset until deselected. The manual selectionincludes selection through picking of polylines in the RaQ view, selection of axis headers in the dataview and selection of glyphs in the glyph view, as well as selection from a list including variable namesand rank. Polylines, glyph borders and axes representing manually selected variables are highlighted inblue. Likewise, a variable assigned to a high rank may be less interesting than its rank indicates, andcan hence be manually removed by the user in a similar way.

Through the ranking, an overall measure of interestingness based on several quality metrics is provided.Hence, when filtering along the rank axis (leftmost axis in the RaQ view in Figure 1) a subset of themost interesting variables based on several quality metrics is selected, which aids the user in quicklyselecting a potentially interesting subset. However, an issue arising is that the number of unique rankstend to decrease as the number of quality metrics increase, concentrating many variables into few ranksand sometimes making the rank filtering too broad a tool for reduction. Due to this a subdivision ofranks may be desirable. This is provided through a possibility of spreading the polylines within a rankaccording to one of the quality metrics. Furthermore, variables with similar quality metric profiles maysometimes be separated into different ranks. This is addressed through the possibilities of filtering onindividual quality metrics as well as through the possibility of identifying and selecting variables withsimilar profiles within the glyph view.

6 Use-case

This section will demonstrate some of the presented techniques and describe how they may be used toexplore a high dimensional data set for identification of interesting structures and forming of hypotheses.The use-case demonstrates how an analyst may use methods in the presented system to analyse abacterial population data set, but is not intended to provide an exhaustive account of how a full analysismight be completed.

The data set analysed is from a 16S ribosomal DNA sequence-based study of bacterial populations.Such studies can generate high dimensional data sets which require both exploratory and confirmatoryanalysis. In this study, data represent levels of 184 OTUs (operational taxonomic units), indicative ofdifferent bacteria species in the human mouth. It comprises 95% of the cumulative total populationfound in fifty samples, taken from ten healthy panellists in five sites of the mouth. The goal of sucha study is to define bacterial ecology of the mouth and thereby develop innovative products for oralhealth.38 The analyst would like to explore the profile of the samples across bacterial populations inthe context of the full data set, and by this identify possible differences between sample sites in termsof bacterial counts and explore which bacteria may be more or less commonly occurring within differentsites or subjects. When analysing this data an analyst would expect to apply a range of statistical andmultivariate techniques, including familiar dimensionality reduction methods such as PCA or MDS,8

as well as data visualization approaches. The techniques presented in this paper can accompany suchanalysis by enabling visual exploration and guidance. In the context of this use-case, data will be referredto using the following terminology; samples correspond to data items, and variables are referred to asOTUs. Thus, OTUs are represented by polylines and glyphs in the RaQ and glyph views. In the dataview, in the bottom of the primary window, OTUs are represented as axes and samples as polylines.The OTU value of a sample relates to the bacterial count of corresponding OTU.

6.1 Initial overview

The data set is loaded into the system. Through the initial PCA-plot, as displayed in Figure 4, an initialoverview of structures is presented to the analyst. It can be seen that the majority of data samplesbelong to one of two clearly separated clusters. The analyst wants to examine this pattern further,to identify which structures and combinations of OTUs that may drive the separation, and uses theidentified pattern as a basis for her continued analysis and quality metric selection. Firstly the threePearson correlation metrics are selected, since correlations may indicate co-occurrence of combinationsof OTUs within samples, which is an important aspect of understanding bacterial populations. Secondlythe cluster metric is selected together with the variance and skewness metrics, since these throughtheir analysis of sample distribution and grouping within OTUs may help in examining sample groupseparation.

When the automated quality metric analysis and ranking are performed, the high dimensional data set,as well as quality metrics and OTU ranks, are displayed in the primary view of the system, as shownin Figure 1. It is apparent that the lower view, displaying the full high dimensional data set, is toocrowded for exploration to be conducted without some reduction. However, in the RaQ view, where thequality metric profiles of the variables are represented, the analyst notices some interesting patterns.For instance, it can be seen that only a small number of OTUs include strong cluster structures andonly two are strongly skewed, as can be seen from the few polylines representing OTUs with high valuesfor the skewness and cluster quality metrics (first and third axis from the right in the upper left view

Figure 6: Highlighting quality metric profiles of three OTU clusters selected in the glyph view. OTUs are representedby polylines in the left view and by glyphs in the right view.

in Figure 1). Moreover the RaQ view reveals a negative correlation between the variance and skewnessmetrics (the two rightmost axes).

In the glyph view, some additional relationships between OTUs are identified. Some smaller glyphclusters, separated from the majority of glyphs, are identified through their spatial proximity and simi-larity in colour. These represent highly similar OTUs that differ from the majority of OTUs, in termsof quality metric profiles, and may, hence, include patterns that are important for understanding thebacterial population. By selecting the clusters in the glyph view the OTUs are highlighted in blue in allother views and their individual quality metric profiles may be examined in the RaQ view, as displayedin Figure 6. The top and middle views both include OTUs strongly involved in cluster structures, butdiffering in terms of involvement in correlation patterns. The bottom view, on the other hand, includeOTUs with weak cluster structures but strongly involved in the negative and overall correlation of thedata set. As demonstrated, groups of OTUs involved in similar structures can quickly be identified inthe glyph view and may, together with the structures found in the RaQ view, act as guidance for theanalyst in subsequent exploration.

6.2 Filtering using rank

As all selected metrics represent structures that may be involved in driving differences between samplesand separating them into groups, the analyst has, initially, no strong preference for one metric overanother. At this point the rank axis (leftmost axis in the RaQ view in Figure 1) provides a useful toolas it offers an overall measure of interestingness taking all quality metrics into consideration. As such, it

enables a straightforward method for selecting a possibly interesting OTU subset, to give a clearer viewof the sample profiles in the bottom view. Initially, five ranks are available and, as visible in Figure 1,a relatively large number of variables are assigned to each rank. Thus, the rank axis alone may be toobroad a tool for reduction. This may be approached by subdividing the ranks based on one of the metrics.However, in this particular case the analyst prefers to utilize the option of modifying the quality metrics’influence on rank, which often provides a different ranking. Due to the negative correlation betweenvariance and skewness, as found in the RaQ view, these metrics may counteract in terms of ranking.The analyst speculates that variance may possibly be of more interest than skewness since the initialPCA-plot, where the axes represent the directions explaining the largest and next largest amounts ofvariation in the data, clearly displayed a separation between samples. She further examines the skewnesspatterns in the data set by filtering along the skewness axis in the RaQ view, selecting different subsetsof OTUs to display in the bottom view based on their skewness, as shown in Figure 7. Through thisshe identifies that a majority of the samples, represented by polylines in the bottom view, have very lowcounts for OTUs with high skewness, as displayed in the top window in Figure 7. This indicates thatthe selected subset of OTUs represent bacteria only existing in a small number of samples. However, thebacterial counts of the samples in OTUs with low skewness, as displayed in the bottom window in Figure7, seems to be higher. Hence, OTUs with low skewness may be more interesting to examine and theanalyst therefore inverts the skewness metric using the button below the axis. This instantly results inre-computation of rank, now assigning high ranks to OTUs with low skewness. This relationship wouldnot have been found as easily using an automated method to assign variable interestingness. Due tothe visual representations of algorithmic analysis results, relationships among quality metrics that mayinfluence the ranking can be identified, examined and dealt with as considered most appropriate by theanalyst.

Moreover, the analyst is concerned that correlation may possibly be given too much significance in theranking, since three correlation metrics are used. Hence she decides to exclude the positive and negativecorrelation metrics from the ranking algorithm using the check boxes below the axes. The ranking isagain re-performed within a few milliseconds, providing a new measure of OTU interestingness. As aresult of the modification and re-computation, the number of unique ranks increases. The analyst selectsa subset of OTUs, using the filter sliders of the leftmost axis in the RaQ view, to retain the nineteenOTUs that are in the four highest ranks, as displayed in Figure 8. As visible from the ’X-like’ structuresbetween OTU pairs in the lower view, some of the OTUs appear to be strongly negatively correlated.

6.3 Examination of correlation patterns

For a bacterial population, correlations between OTUs may reveal patterns of symbiosis in terms ofgroups of OTUs that commonly co-habit, represented by a positive correlation, as well as OTUs thatrarely co-habit, represented by negative correlation. The analyst has several options for exploring corre-lation between OTUs and chooses to start off by examining groups of positively correlated OTUs usingthe variable merging window. In this particular case the analyst is not interested in merging the OTUs,as the individual OTUs and their involvement in driving processes is the main concern of the exploration.Nonetheless, the variable merging window extracts groups of OTUs which may be likely to co-habit andis hence an efficient tool for a first examination of symbiosis patterns. In other analysis situations itwould enable fast extraction of potentially redundant variables. Using a correlation threshold of 0.9,two groups are extracted (as displayed in Figure 9). The first group includes OTUs A12 and A39 (leftview) and the second includes OTUs A132 and A180 (right view). The analyst notes that for the secondgroup, most samples appear to have a count of zero, whereas the samples are more evenly distributedfor the first group. This may possibly indicate that the correlation of OTU A132 and A180 is mainly

Figure 7: Filtering to select a subset of OTUs using the skewness metric (rightmost axis in the RaQ view). OTUswith high skewness (selected in top window) appears to have low counts for most samples, as visible in the data view,whilst the bacterial counts within OTUs with low skewness (selected in the lower window) generally are higher.

Figure 8: A subset of the highest ranked OTUs, represented by polylines in the top parallel coordinates and by axesin the bottom parallel coordinates, after inverting the skewness metric and excluding positive and negative correlationfrom rank computation.

due to a small number of atypical samples, rather than due to co-habitance of OTUs.

In the context of this use-case, not only positive correlations are of interest but also negative correlations,as they may indicate OTUs that rarely co-habit. The occurrence of negatively correlated OTUs isalready indicated through the previously identified ’X-like’ structures. The analyst decides to explorethese patterns further using a scatter plot matrix, as displayed in Figure 10, as it indicates a clearquantitative measure and direction of the associations in the data. In the scatter plot matrix, thePearson correlation coefficient of OTU pairs are represented by coloured cells in the top left part of thematrix, red representing negative correlation and blue representing positive correlation. In Figure 10two groups of OTUs are visible, with strong positive correlation within the groups and strong negativecorrelation between the groups. The smallest group includes four OTUs (A6, A20, A26 and A32) asmarked with black dots in corresponding diagonal cells in Figure 10. The analyst checks the IDs of theOTUs through tool-tips and makes a note to examine them further using various statistical methods.As the groups seems to be part of multivariate correlation patterns, the analyst returns the lower plotto displaying parallel coordinates to examine this further.

6.4 Examination of sample groups

In the parallel coordinates, displayed in the bottom view in Figure 8, the analyst notices that thereappears to be some separation between two groups of samples across the axes. This is investigatedfurther, selecting by hand in the lower parallel coordinates, polylines representing the group of sampleslinking high values of A32 to low values of A35, as this group is clearly concentrated and easy to selectmanually. The selected items are highlighted in black in the lower view, as displayed in the top part ofFigure 11. Once all are selected, it is visible that two different profiles emerge across several axes. Forexample, the highlighted sample group is generally high in the previously identified group of A6, A20,

Figure 9: The variable merging window displaying two groups of OTUs with correlation above 0.9. The left windowdisplays a group including OTUs A12 and A39 and the right window displays a group including OTUs A132 and A180.

Figure 10: The nineteen highest ranked OTUs displayed using a scatter plot matrix where positive and negativecorrelations are represented by blue and red cells respectively.

Figure 11: Through visible correlations a sample group is selected and highlighted in black to explore consistent sampleprofile across OTUs. While re-introducing OTUs some profile differences remain visible.

A26, and A32, whereas the remaining samples are not.

The analyst is interested to explore this difference further, and starts to lower the filter on rank, tore-introduce more OTUs. It is visible in the middle and bottom part of Figure 11 that some profiledifferences remain visible, even as many additional OTUs are re-introduced. The labels of the selecteditems are checked, through a mouse-over tool tip. It is found that the highlighted set of rows onlyincludes two sites of the mouth of the five sampled, sites A and C. Domain specialists confirmed thatthese two sites are different in nature to the other three sampled. The parallel coordinates in Figure11 indicates that this site difference may also support different population profiles. Analysis conductedindependently to this process, using statistical and multivariate techniques, support hypotheses thatdifferences may exist between the populations in these sites.38 Through the various features presentedin this paper, the analyst then continues to select different OTU subsets for examining relationshipsbetween the groups of samples and for establishing groups of OTUs that may be involved in separatingthe sample groups and hence driving differences between the two sample sites.

6.5 Summary of use-case

The presented techniques have given the analyst methods for exploring interactively the sample profilesacross groups of OTUs, which is normally not as easily and quickly done. The analysts’ workflow foranalysing this data would commonly include statistical tests, multivariate approaches such as PCA, andnetwork visualization methods such as graph software Cytoscape.48 Furthermore, the QIIME pipeline,49

which is used to process the sample data, provides a range of unlinked visual outputs such as pie-charts,heatmaps, trees and networks. These methods provide a wide range of analysis, but do not enableexploration and examination of various OTU subsets interactively, where the analysis route itself isdriven by insights gained during analysis, to the same extent as is possible using the system presentedin this paper. As an initial analysis, the exploration described in this use-case has acted as a meansfor identifying interesting structures and suggesting focus of subsequent analysis. Moreover, it providesan illustration of differences in sampled sites for further study, communication and discussion with theanalyst’s team. In the full course of the exploration, ideas around other possible structures in thedata have been noted for further investigation. In general, the exploration could continue in a similarfashion, combining different quality metrics to obtain new OTU ranks, selecting and deselecting OTUsthat appear interesting or uninteresting. Subsets of potentially interesting OTUs can be extracted andexplored using other visualization methods as well, and interesting relationships and patterns within thehigh dimensional data set can be identified for further exploration, experimentation and discussion withthe analyst’s team.

7 Discussion and Conclusions

This paper presents a generic system enabling exploration of high dimensional data sets and interactivedimensionality reduction. The reduction is guided by visual representations combined with qualitymetrics and a quality-based variable ranking. A main advantage of the techniques presented in the systemlies in their ability to provide an easily interpreted overview of the structures and relationships betweenvariables in the high dimensional data set. By using this overview, the understanding and identificationof possibly interesting structures is facilitated, as well as providing guidance when exploring the data.Furthermore, the combined visual and algorithmic techniques provide various features for interactiveand exploratory dimensionality reduction, which is controlled by the user and guided by variable ranksand visual overview.

The techniques have been demonstrated in a use-case, as an example of the potential and usefulness of thiskind of system for guided exploration of high dimensional data sets. The use-case also provided examplesof some of the analysis routes that might be used in practise. It has been illustrated how exploratorywork might generate hypotheses around bacterial population differences that can be supported by furtheranalysis using confirmatory analysis techniques.38 The presented techniques and analysis routes similarto the ones described can also be useful within other domains dealing with high dimensional datasets. In the use-case example, the benefit of the system is showing the difference in bacterial populationsvisually within various data subsets, for interactive exploration by analysts and assisting identification ofpotentially interesting structures, which can guide subsequent analysis. It has also been demonstratedhow the analyst, facilitated by the combination of visual and algorithmic guidance, is able to makeinformed decisions within an efficient dimensionality reduction process. Furthermore, the analysis carriedout enabled the analyst to explore multivariate patterns in the data, which is normally not as easilydone, and through it gain insights. It also provided the analyst’s team with new ways of thinking about

the data and, through this, generating ideas about other possible structures and metrics of interest.

A potential issue with the presented system is related to its flexibility and complexity, in relation to itsease of use. For straightforward analysis tasks, an intuitive interface with a limited set of analysis optionsand functionality may often be most useful. However, as tasks grow more and more complex, systemsare required to provide more complex analysis options, often with the trade-off of less intuitive interfaces.The interface of the presented system is not fully intuitive, in the sense that some sort of introduction isrequired for a new user to understand how to interpret it and make use of its functionality. There is alsoa potential risk of getting confused regarding which aspects of the data set is displayed in which view,especially when parallel coordinates are used in the bottom view, since it is also the representation ofthe RaQ view. This has, however, not been an issue with the end users that have been introduced tothe techniques so far, which includes a small group of microbiologists and informaticians. A longitudinaluser study with domain experts will be the subject of future work and will provide further informationon the utility and usability of the techniques. Another area of future work is the application of thetechniques to different domains, to further establish its generality and usefulness.

The presented techniques are in many ways customizable and, hence, suitable for a range of domains.Firstly, the overall approach of using quality metrics to extract the potentially most interesting vari-ables, and to combine quality metric analysis and interactive visualization, to provide flexible and user-controlled dimensionality reduction, could be applied to almost any type of data, provided the qualitymetrics used are appropriate for corresponding data. Secondly, the system in itself, as currently imple-mented, is flexible in providing possibilities of including pre-calculated metrics along with the metricsavailable in the system. Through the use of pre-calculated metrics, analysis can be adapted to focuson domain specific issues. Thirdly, the analyst is able to design and select relevant metrics and modifytheir influence on the overall measure of interestingness and is able to make informed decisions based onidentified data structures. More options of selection and interaction with metrics may be desirable andis the subject of future work.

Acknowledgement

This work was supported by Unilever Discover Port Sunlight; the Visualization Programme coordinatedby the Swedish Knowledge Foundation; and the Swedish Research Council in the Linnaeus CentreCADICS.

References

[1] Inselberg A. The plane with parallel coordinates. The Visual Computer 1985; 1(4): 69–91.

[2] Wegman EJ. Hyperdimensional data analysis using parallel coordinates. Journal of the AmericanStatistical Association 1990; 85(411): 664–675.

[3] Becker RA and Cleveland WS. Brushing scatterplots. Technometrics 1987; 29(2):127–142.

[4] Johansson S and Johansson J. Interactive dimensionality reduction through user-defined combina-tions of quality metrics. IEEE Transactions on Visualization and Computer Graphics 2009; 15(6):993–1000.

[5] Guo D. Coordinating computational and visual approaches for interactive feature selection andmultivariate clustering. Information Visualization 2003; 2(4): 232–246.

[6] Johansson Fernstad S, Johansson J, Adams S, Shaw J and Taylor D. Visual exploration of micro-bial populations. In: Gehlenborg N, Machiraju R, Moller T, editors. Proceedings of the 1st IEEESymposium on Biological Data Visualization (BioVis); 2011 October 23-24; Providence RI, USA;2011.

[7] Goldberg DE. Genetic algorithms in search, optimization and machine learning. 1st ed. Boston:Addison-Wesley Longman Publishing Co, 1989.

[8] Cox T. Introduction to multivariate analysis. 1st ed. Hodder Arnold Publication, 2005.

[9] Chen C. Top 10 unsolved information visualization problems. IEEE Computer Graphics and Appli-cations 2005; 25(4): 12–16.

[10] Rao R and Card SK. The table lens: merging graphical and symbolic representations in an in-teractive focus + context visualization for tabular information. In: Adelson B, Dumais S, Olson J,editors. Proceedings of the SIGCHI conference on Human factors in computing systems: celebratinginterdependence; 1994 April 24-28; Boston MA, USA; 1994; 318–322.

[11] Eick SG and Karr AF. Visual scalability. Journal of Computational and Graphical Statistics 2002;11(1): 22–43.

[12] Keim DA. Designing pixel-oriented visualization techniques: theory and applications. IEEE Trans-actions on Visualization and Computer Graphics 2000; 6(1): 59–78.

[13] Yang J, Patro A, Huang S, Mehta N, Ward MO and Rundensteiner EA. Value and relation displayfor interactive exploration of high dimensional datasets. In: Ward M, Munzner T, editors. Proceed-ings of the 10th IEEE Symposium on Information Visualization; 2004 October 10-12; Austin, TX,USA; 2004; 73–80.

[14] Turkay C, Filzmoser P and Hauser H. Brushing Dimensions – A Dual Visual Analysis Model forHigh-Dimensional Data. IEEE Transactions on Visualization and Computer Graphics 2011; 17(12):2591–2599.

[15] Barlowe S, Zhang T, Liu Y, Yang J and Jacobs D. Multivariate visual explanation for high dimen-sional datasets. In: Ebert D, Ertl T, editors. Proceedings of IEEE Symposium on Visual AnalyticsScience and Technology; 2008 October 21-23; Columbus, OH, USA; 2008; 147–154.

[16] Nam EJ, Han Y, Mueller K, Zelenyuk A and Imre D. ClusterSculptor: a visual analytics tool forhigh-dimensional data. In: Ribarsky W, Dill J, editors. Proceedings of IEEE Symposium on VisualAnalytics Science and Technology; 2007 October 30 - November 1; Sacramento, CA, USA; 2007;75–82.

[17] Kohonen T. The self-organizing map. Neurocomputing 1998; 21(1–3): 1–6.

[18] Friedman JH and Tukey JW. A projection pursuit algorithm for exploratory data analysis. IEEETransactions on Computers 1974; 23(9):881–890.

[19] Koren Y and Carmel L. Robust linear dimensionality reduction. IEEE Transactions on Visualizationand Computer Graphics 2004; 10(4): 459–470.

[20] Engel D, Rosenbaum R, Hamann B and Hagen H. Structural decomposition trees. Computer Graph-ics Forum 2011; 30(3): 921–930.

[21] Kandogan E. Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In:Lee D, Schkolnick M, Provost M, Srikant R, editors. Proceedings of the 7th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining; 2001 August 26-29; San Francisco,CA, USA; 2001; 107–116.

[22] Asimov D. The grand tour: a tool for viewing multidimensional data. SIAM Journal on Scientificand Statistical Computing 1985; 6(1):128–143.

[23] Williams M and Munzner T. Steerable, progressive multidimensional scaling. In: Ward M, MunznerT, editors. Proceedings of the 10th IEEE Symposium on Information Visualization;2004 October10-12; Austin, TX, USA; 2004; 57–64.

[24] Ivosev G, Burton L and Bonner R. Dimensionality reduction and visualization in Principal Com-ponent Analysis. Analytical Chemistry 2008; 80(13): 4933–4944.

[25] Jeong DH, Ziemkiewicz C, Fisher B, Ribarsky W and Chang R. iPCA: an interactive system forPCA-based visual analytics. Computer Graphics Forum 2009; 28(3): 767–774.

[26] Seo J and Shneiderman B. A Rank-by-Feature framework for unsupervised multidimensional dataexploration using low dimensional projections. In: Ward M, Munzner T, editors. Proceedings of the10th IEEE Symposium on Information Visualization;2004 October 10-12; Austin, TX, USA; 2004;65–72.

[27] Artero AO, de Olivera MCF and Levkowitz H. Enhanced high dimensional data visualization throughdimension reduction and attribute arrangement. In: Banissi E, Burkhard RA, Ursyn A, et al, editors.Proceedings of the 10th international conference on Infomation Visualization; 2006 July 5-7; London,UK; 2006; 707–712.

[28] Sips M, Neubert B, Lewis JP and Hanrahan P. Selecting good views of high-dimensional data usingclass consistency. Computer Graphics Forum (Proc. EuroVis 2009) 2009; 28(3): 831–838

[29] Tatu A, Albuquerque G, Eisemann M, Bak P, Theisel H, Magnor M and Keim D. Automatedanalytical methods to support visual exploration of high-dimensional data. IEEE Transactions onVisualization and Computer Graphics 2011; 17(5): 584–597.

[30] Albuquerque G, Eisemann M, Lehmann DJ, Theisel H and Magnor M. Improving the visual anal-ysis of high-dimensional datasets using quality measures. In: MacEachren A, Miksch S, editors.Proceedings of IEEE Symposium on Visual Analytics Science and Technology; 2010 October 25-26;Salt Lake City, UT, USA; 2010; 19–26.

[31] Ferdosi BJ and Roerdink JB. Visualizing high-dimensional structures by dimension ordering andfiltering using subspace analysis. Computer Graphics Forum 2011; 30(3): 1121–1130.

[32] Yang J, Ward MO and Huang S. Visual hierarchical dimension reduction for exploration of highdimensional datasets. In: Bonneau GP, Hahmann S, Hansen CD, editors. Proceedings of Eurograph-ics/IEEE TCVG Symposium on Visualization; 2003 May 26-28; Grenoble, France; 2003; 19–28.

[33] Yang J, Peng W, Ward MO and Rundensteiner EA. Interactive hierarchical dimension ordering,spacing and filtering for exploration of high dimensional datasets. In: Munzner T, North S, editors.Proceedings of IEEE Symposium on Information Visualization; 2003 October 19-21; Seattle, WA,USA; 2003; 105–112.

[34] Schreck T, von Landesberger T and Bremm S. Techniques for precision-based visual analysis of pro-jected data. In: Park J, Hao MC, Wong PC, Chen C, editors. Proceedings of IS&T/SPIE Conferenceon Visualization and Data Analysis; 2010 January 18-21; San Jose, CA, USA; 2010

[35] Ingram S, Munzner T, Irvine V, Tory M, Bergner S and Moller T. DimStiller: workflows fordimensional analysis and reduction. In: MacEachren A, Miksch S, editors. Proceedings of IEEESymposium on Visual Analytics Science and Technology; 2010 October 25-26; Salt Lake City, UT,USA; 2010; 3–10.

[36] Srinivas N and Deb K. Multiobjective optimization using nondominated sorting in genetic algo-rithms. Evolutionary Computation 1994; 2(3): 221–248.

[37] Fonseca CM and Fleming PJ. Genetic algorithms for multiobjective optimization: formulation,discussion and generalisation. In: Forrest S, editor. Proceedings of the 5th International Conferenceon Genetic Algorithms; 1993 July 17-21; Urbana-Champaign, IL, USA; 1993; 416–423.

[38] Adams SE, Lloyd AM, Brading MG, Cox TF, Taylor D, and Quince C. Measurement of bacterialdiversity using 454-sequencing and Oral Microarray (HOMIM). Presented at International Associ-ation for Dental Research General Session; 2010 July; Barcelona, Spain; 2010.

[39] Asuncion A and Newman DJ. UCI Machine Learning Repository.http://www.ics.uci.edu/∼mlearn/MLRepository.html (2007, accessed February 2010).

[40] Nagesh H, Goil S and Choudhary A. Adaptive grids for clustering massive data sets. In: GrossmanR, Kumar V ,editor.Proceedings of First Siam International Conference on Data Mining; 2001 April5-7; Chicago, IL, USA; 2001.

[41] Agrawal R, Gehrke J, Gunopulos D and Raghavan P. Automatic subspace clustering of high dimen-sional data for data mining applications. In: Tiwary A, Franklin M, editors. Proceedings of ACMSIGMOD International Conference on Management of Data; 1998 June 1-4; Seattle, WA, USA;1998; 94–105.

[42] Agrawal R and Srikant R. Fast algorithms for mining association rules. In: Bocca JB, Jarke M,Zaniolo C, editors. Proceedings of the 20th International Conference on Very Large Data Bases;1994 September 12-15; Santiago de Chile, Chile; 1994; 487–499.

[43] Rodgers JL and Nicewander WA. Thirteen ways to look at the correlation coefficient. The AmericanStatistician 1988; 42(1): 59–66.

[44] Myers JL and Well AD. Research design and statistical analysis. 3rd ed. New York: HarperCollinsPublishers Inc, 1991.

[45] Wackerly DD, Mendenhall W and Scheaffer RL. Mathematical statistics with applications. 7th ed.Southbank: Thomson learning, Inc, 2008.

[46] Kendall M, Stuart A and Ord JK. Kendall’s advanced theory of statistics, vol. 1, distribution theory.5th ed. London: Charles Griffin & Company Limited, 1987.

[47] Draper NR and Smith H. Applied regression analysis. 2nd ed. New York: John Wiley & sons, Inc,1981.

[48] Shannon P, Markiel A, Ozier O, Baliga S, Wand JT, Ramage D, Amin N, Schwikowski B and IdekerT. Cytoscape: A software environment for integrated models of biomolecular interaction networks.Genome Research 2003; 13(11): 2498–2504.

[49] Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, PeaAG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, LozuponeCA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA,Widmann J, Yatsunenko T, Zaneveld J and Knight R. QIIME allows analysis of high-throughputcommunity sequencing data. Nature Methods 2010: 7(5): 335–336.

quality based guidance for exploratory dimensionality

Documents