clustrophile: a tool for visual clustering...

9
Clustrophile: A Tool for Visual Clustering Analysis Ça˘ gatay Demiralp IBM Research [email protected] ABSTRACT While clustering is one of the most popular methods for data min- ing, analysts lack adequate tools for quick, iterative clustering anal- ysis, which is essential for hypothesis generation and data reason- ing. We introduce Clustrophile, an interactive tool for iteratively computing discrete and continuous data clusters, rapidly explor- ing different choices of clustering parameters, and reasoning about clustering instances in relation to data dimensions. Clustrophile combines three basic visualizations – a table of raw datasets, a scatter plot of planar projections, and a matrix diagram (heatmap) of discrete clusterings – through interaction and intermediate vi- sual encoding. Clustrophile also contributes two spatial interaction techniques, forward projection and backward projection, and a vi- sualization method, prolines, for reasoning about two-dimensional projections obtained through dimensionality reductions. Keywords Clustering, projection, dimensionality reduction, visual analysis, experiment, Tukey, out-of-sample extension, forward projection, backward projection, prolines, sampling, scalable visualization, in- teractive analytics. 1. INTRODUCTION Clustering is a basic method in data mining. By automatically di- viding data into subsets based on similarity, clustering algorithms provide a simple yet powerful means to explore structures and vari- ations in data. What makes clustering attractive is its unsupervised (automated) nature, which reduces the analysis time. Nonetheless, analysts need to make several decisions on a clustering analysis that determine what constitutes a cluster, including which clustering al- gorithm and similarity measure to use, which samples and features (dimensions) to include, and what granularity (e.g., number of clus- ters) to seek. Therefore, quickly exploring the effects of alternative decisions is important in both reasoning about the data and making these choices. Although standard tools such as R or Matlab are extensive and computationally powerful, they are not designed to support such Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. KDD 2016 Workshop on Interactive Data Exploration and Analytics (IDEA’16) August 14th, 2016, San Francisco, CA, USA. © 2016 ACM. ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 interactive iterative analysis. It is often cumbersome, if not impos- sible, to run what-if scenarios with these tools. In response, we in- troduce Clustrophile, an interactive visual analysis tool, to help an- alysts to perform iterative clustering analysis. Clustrophile couples three basic visualizations, a dynamic table listing of raw datasets, a scatter plot of planar projections, and a matrix diagram (heatmap) of discrete clusterings, using interaction and intermediate visual en- coding. We consider dimensionality reduction as a form of contin- uous clustering that complements the discrete nature of standard clustering techniques. We also contribute two spatial interaction techniques, forward projection and backward projection, and a vi- sualization method, prolines, for reasoning about two-dimensional projections computed using dimensionality reductions. 2. RELATED WORK Clustrophile builds on earlier work on interactive systems sup- porting visual clustering analysis. The projection interaction and visualization techniques in Clustrophile are related to prior efforts in user experience with scatter-plot visualizations of dimensionality reductions. 2.1 Visualizing Clusterings Prior research applies visualization for improving user under- standing of clustering results across domains. Using coordinated visualizations with drill-down/up capabilities is a typical approach in earlier interactive tools. The Hierarchical Clustering Explorer [37] is an early and comprehensive example of interactive visual- ization tools for exploring clusterings. It supports the exploration of hierarchical clusterings of gene expression datasets through den- drograms (hierarchical clustering trees) stacked up with heatmap visualizations. Earlier work also proposes tools that make it possible to incor- porate user feedback into clustering formation. Matchmaker [28] builds on techniques from [37] with the ability to modify cluster- ings by grouping data dimensions. ClusterSculptor [30] and Cluster Sculptor [8], two different tools, enable users to supervise cluster- ing processes in various clustering methods. Schreck et al. [35] propose using user feedback to bootstrap the similarity evaluation in data space (trajectories, in this case) and then apply the cluster- ing algorithm. Prior work has also introduced techniques for comparing clus- tering results of different datasets or different algorithms [10, 29, 34, 37]. DICON [10] encodes statistical properties of clustering instances as icons and embeds them in the plane based on similar- ity using multidimensional scaling. Pilhofer et al. [34] propose a method for reordering categorical variables to align with each other and thus augment the visual comparison of clusterings. The recent tool XCluSim [29] supports comparison of several clustering re-

Upload: others

Post on 18-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

Clustrophile: A Tool for Visual Clustering Analysis

Çagatay DemiralpIBM Research

[email protected]

ABSTRACTWhile clustering is one of the most popular methods for data min-ing, analysts lack adequate tools for quick, iterative clustering anal-ysis, which is essential for hypothesis generation and data reason-ing. We introduce Clustrophile, an interactive tool for iterativelycomputing discrete and continuous data clusters, rapidly explor-ing different choices of clustering parameters, and reasoning aboutclustering instances in relation to data dimensions. Clustrophilecombines three basic visualizations – a table of raw datasets, ascatter plot of planar projections, and a matrix diagram (heatmap)of discrete clusterings – through interaction and intermediate vi-sual encoding. Clustrophile also contributes two spatial interactiontechniques, forward projection and backward projection, and a vi-sualization method, prolines, for reasoning about two-dimensionalprojections obtained through dimensionality reductions.

KeywordsClustering, projection, dimensionality reduction, visual analysis,experiment, Tukey, out-of-sample extension, forward projection,backward projection, prolines, sampling, scalable visualization, in-teractive analytics.

1. INTRODUCTIONClustering is a basic method in data mining. By automatically di-

viding data into subsets based on similarity, clustering algorithmsprovide a simple yet powerful means to explore structures and vari-ations in data. What makes clustering attractive is its unsupervised(automated) nature, which reduces the analysis time. Nonetheless,analysts need to make several decisions on a clustering analysis thatdetermine what constitutes a cluster, including which clustering al-gorithm and similarity measure to use, which samples and features(dimensions) to include, and what granularity (e.g., number of clus-ters) to seek. Therefore, quickly exploring the effects of alternativedecisions is important in both reasoning about the data and makingthese choices.

Although standard tools such as R or Matlab are extensive andcomputationally powerful, they are not designed to support such

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

KDD 2016 Workshop on Interactive Data Exploration and Analytics(IDEA’16) August 14th, 2016, San Francisco, CA, USA.© 2016 ACM. ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

interactive iterative analysis. It is often cumbersome, if not impos-sible, to run what-if scenarios with these tools. In response, we in-troduce Clustrophile, an interactive visual analysis tool, to help an-alysts to perform iterative clustering analysis. Clustrophile couplesthree basic visualizations, a dynamic table listing of raw datasets, ascatter plot of planar projections, and a matrix diagram (heatmap)of discrete clusterings, using interaction and intermediate visual en-coding. We consider dimensionality reduction as a form of contin-uous clustering that complements the discrete nature of standardclustering techniques. We also contribute two spatial interactiontechniques, forward projection and backward projection, and a vi-sualization method, prolines, for reasoning about two-dimensionalprojections computed using dimensionality reductions.

2. RELATED WORKClustrophile builds on earlier work on interactive systems sup-

porting visual clustering analysis. The projection interaction andvisualization techniques in Clustrophile are related to prior effortsin user experience with scatter-plot visualizations of dimensionalityreductions.

2.1 Visualizing ClusteringsPrior research applies visualization for improving user under-

standing of clustering results across domains. Using coordinatedvisualizations with drill-down/up capabilities is a typical approachin earlier interactive tools. The Hierarchical Clustering Explorer[37] is an early and comprehensive example of interactive visual-ization tools for exploring clusterings. It supports the explorationof hierarchical clusterings of gene expression datasets through den-drograms (hierarchical clustering trees) stacked up with heatmapvisualizations.

Earlier work also proposes tools that make it possible to incor-porate user feedback into clustering formation. Matchmaker [28]builds on techniques from [37] with the ability to modify cluster-ings by grouping data dimensions. ClusterSculptor [30] and ClusterSculptor [8], two different tools, enable users to supervise cluster-ing processes in various clustering methods. Schreck et al. [35]propose using user feedback to bootstrap the similarity evaluationin data space (trajectories, in this case) and then apply the cluster-ing algorithm.

Prior work has also introduced techniques for comparing clus-tering results of different datasets or different algorithms [10, 29,34, 37]. DICON [10] encodes statistical properties of clusteringinstances as icons and embeds them in the plane based on similar-ity using multidimensional scaling. Pilhofer et al. [34] propose amethod for reordering categorical variables to align with each otherand thus augment the visual comparison of clusterings. The recenttool XCluSim [29] supports comparison of several clustering re-

Page 2: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

c

d

a

b

h i

e f g

Figure 1: Clustrophile is an interactive visual analysis tool for computing data clusters and iteratively exploring and reasoning about clusteringinstances in relation to data subsets and dimensions through what-if scenarios. To this end, Clustrophile combines three basic visualizations,a) a table of raw datasets, b) a scatter plot of planar projections, and c) a matrix diagram (heatmap) of discrete clusterings, using interactionand intermediate visual encoding. Clustrophile enables users to interactively d) change the number of clusters, quickly explore several e)projection and f) clustering algorithms and parameters, run g) statistical analysis, including hypothesis testing, and dynamically filter h) theobservations and i) features to which visual analysis is applied.

sults of gene expression datasets using an approach similar to thatof the Hierarchical Clustering Explorer.

Clustrophile is similar to earlier work in coordinating basic andauxiliary visualizations to explore clusterings. Clustrophile focuseson supporting iterative, interactive exploration of data with the abil-ity to explore multiple choices of algorithmic parameters alongwith hypothesis testing through visualizations and interactions aswell as formal statistical methods. Finally, Clustrophile is domain-agnostic and is intended to be a general tool for data scientists.

2.2 Making Sense with and of DimensionalityReductions

Dimensionality reduction is a common method for analyzing andvisualizing high-dimensional datasets across domains. Researchersin statistics and psychology pioneered the use of techniques thatproject multivariate data onto low-dimensional manifolds for vi-sual analysis (e.g., [2, 16, 15, 25, 38, 40]). PRIM-9 (Picturing,Rotation, Isolation, and Masking — in up to 9 dimensions) [15]is an early visualization system supporting exploratory data ana-lyis through projections. PRIM-9 enables the user to interactivelyrotate the multivariate data while continuously viewing a two di-mensional projection of the data. Motivated by the user behavior

in the PRIM-9 system, Friedman and Tukey [16] first propose ameasure, the projection index, for quantifying the “usefulness” of agiven projection plane (or line) and, then, an optimization method,the projection pursuit, to find the most useful projection direction(i.e., one that has the highest projection index value). The proposedindex considers the projections that result in large spread with highlocal density to be useful (e.g., highly separated clusters). In an ax-iomatic approach that complements the projection pursuit, Asimovintroduces the grand tour, a method for viewing multidimensionaldata via orthogonal projections onto a sequence of two-dimensionalplanes [2]. Asimov considers a set of criteria such as density, con-tinuity, and uniformity to select a sequence of projection planesfrom all possible projection planes and provides specific methodsto devise such sequences. Note that the space of all possible two-dimensional planes through the origin is a Grassmannian manifold.Asimov’s grand tours can be seen as geodesic curves with desiredproperties in this manifold.

Despite their wide use (and overuse), interpreting and reasoningabout dimensionality reductions can often be difficult. Earlier workfocuses on better conveying projection (reduction) errors, integrat-ing user feedback into the projection process and evaluating the ef-fectiveness of various dimensionality reductions. Low-dimensional

Page 3: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

Figure 2: (Left) ANOVA test on calories between two selected clusters in a life style dataset. (Right) Correlation coefficients for all pairs offeatures in a development indicators dataset [31] for OECD member states. Correlation values are sorted based on their absolute value. Thesign of correlation is encoded by color.

projections are generally lossy representations of the data relations.Therefore, it is useful to convey both overall and per-point dimen-sionality reduction errors to users when desired. Earlier researchproposes techniques for visualizing projection errors using Voronoidiagrams [3, 26] and “correcting” them within a neighborhood ofthe probed point [11, 39]. Stahnke et al. [39] suggest a set of in-teractive methods for interpreting the meaning and quality of pro-jections visualized as scatter plots. The methods make it possibleto see approximation errors, reason about positioning of elements,compare them to each other, and visualize the extrapolated densityof individual dimensions in the projection space.

In certain cases, expert users have prior knowledge of how theprojections should look. To enable user input to guide dimension-ality reduction, earlier research has proposed several techniques [9,13, 17, 20, 21, 44]. Enabling users to adjust the projection posi-tions or the weights of data dimensions and distances is a commonapproach in earlier research for incorporating user feedback to pro-jection computations. For example, X/GGvis [9] supports chang-ing the weights of dissimilarities input to the MDS stress functionalong the with the coordinates (configuration) of the embeddedpoints to guide the projection process. Similarly, iPCA [20] en-ables users to interactively modify the weights of data dimensionsin computing projections. Endert et al. [14] apply similar ideasto an additional set of dimensionality-reduction methods while in-corporating user feedback through spatial interactions. The spa-tial interactions, forward projection and backward projection, thatwe introduce here are developed for dynamically reasoning aboutdimensionality-reduction methods and the underlying data, not forincorporating user feedback.

Prior research also evaluates dimensionality-reduction tech-niques [7, 27] as well as visualization methods for represent-ing dimensionally-reduced data [36]. Sedlmair et al. find thattwo-dimensional scatter plots outperform scatter-plot matrices andthree-dimensional scatter plots in the task of separating clus-ters [36]. Lewis et al. [27] report that experts are consistent inevaluating the quality of dimensionality reductions obtained by dif-ferent methods, but novices are highly inconsistent in such evalua-tions. A later study finds, however, that experts with limited expe-rience in dimensionality reduction also lack clear understanding ofdimensionality-reduction results [7].

Forward projection, backward projection and prolines are newtechniques and complement earlier work in improving interactivereasoning with dimensionality reductions, particularly in order to

facilitate dynamically asking and answering hypothetical questionsabout both the underlying data and the dimensionality reduction.

3. THE DESIGN OF CLUSTROPHILEWe developed Clustrophile for data scientists, using their regular

feedback at each stage of the development process. We discussbelow the design of Clustrophile, stressing the rationale behind ourchoices, basic visualizations and interactions.

3.1 Design CriteriaIn our collaboration with data scientists, we identified four high-

level criteria to consider in designing Clustrophile.Show Variation Within Clusters Clustering is useful for

grouping data points based on similarity, enabling users to dis-cover salient structures in data while reducing the cognitiveload. However, differences among data points within clusters arelost. Clustrophile has coordinated views—Table, Projection, andClustering—that facilitate exploration of differences among datapoints at different levels of granularity. The projection view holdsa scatter-plot visualization of the data reduced to two dimensionsthrough dimensional reduction, thus providing a continuous spatialview of similarities among high-dimensional data points.

Allow Quick Iteration over Parameters In clustering analy-sis, users typically need to make several decisions, including whichclustering method and distance (dissimilarity) measure to use, howmany clusters to create, which features and data subsets to con-sider, and the like. After an initial clustering, users would like to beable to iterate on and refine these decisions. Clustrophile enablesusers to interactively update and apply clustering and projectionalgorithms and parameters at any point in their analysis.

Facilitate Reasoning about Clustering Instances Users oftenwould like to know what features (dimensions) of the data pointsare important in determining a given clustering instance or howdifferent choices of features or distance measures might affect theclustering. Clustrophile allows users to add/remove features inter-actively and to change distance measures used in clustering andprojections.

Promote Multiscale Exploration The ability to interactivelydrill down into data is crucial for exploration and effective use ofvisual encoding variables, particularly in two-dimensional space.Clustrophile supports dynamic filtering of data across the views.In addition, Clustrophile makes possible the application of cluster-ing and projection methods to filtered subsets of data, providing a

Page 4: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

…x0 x1 x2 x3 x4 x5 x6 …

perturb

…project

e0

e1

select

Projected location of :

x0 x1 x2 x3 x4 x5 x6

�yy y0x0

x

�x

knownknown

�x⇥e0 e1

⇤= �y

x0

y0 = y + �y

a

bc

Figure 3: Forward projection enables users to a) select any datapoint x in the projection, b) interactively change the feature (di-mension) values of the point and c) observe how that changes thecurrent projected location y of the point. For PCA, the positionalchange vector ∆y can be derived directly by projecting the datachange vector ∆x onto the first two principal components, e0 ande1.

semantic zoom-in and zoom-out capability.

3.2 ViewsClustrophile has five coordinated views: Table, Projection, Clus-

tering, Statistics and Playground.Table The Table view (Figure 1a) contains a dynamic table visu-

alization of data. Tables in Clustrophile can be searched, filtered,sorted, and exported as needed (Figure 1h,i). Upon loading, datafirst appears as a table listing in this view, giving users a direct andfamiliar way to access the records. Clustrophile supports input filesin the Comma Separated Values (CSV) format. Clustrophile alsoenables exporting the current table in CSV, Portable Document For-mat (PDF), or Excel file formats. Alternatively, users can simplythe current table to the clipboard to paste in other applications.

Clustering The Clustering view (Figure 1c) contains a heatmap(matrix diagram) visualization of the current clustering. Thecolumns of the heatmap corresponds to the number of clustersand are ordered from left to right based on size (i.e., the firstcolumn represents the largest cluster in the current clustering).The rows of the heatmap represent the features, and the color ofeach cell encodes the normalized average feature value for clus-ters. Clustrophile supports dynamic computation of clusterings us-ing the kmeans and agglomerative clustering algorithms with sev-eral choices of similarity measures and, in the case of agglomer-ative clustering, linkage options (Figure 1f). The choices can bechanged easily and clustering can be recomputed using the modelpanel above the clustering view. Similarly, users can dynamicallychange the number of clusters by using a sliding bar (Figure 1d).

Projection Clustering algorithms divide data into discretegroups based on similarity, but different degrees of variation withinand between groups are suppressed. Clustrophile provides two-dimensional projections obtained using dimensionality reductionthat complement the discrete clusterings. The Projection view (Fig-ure 1b) contains a scatter-plot visualization of the current datareduced to two dimensions by using one of six dimensionality-reduction methods: Principal Component Analysis (PCA), Clas-sical Multidimensional Scaling (CMDS), non-metric Multidimen-sional Scaling (MDS), Isomap, Locally Linear Embedding (LLE),and t-distributed Stochastic Neighbor Embedding (t-SNE) [42]. Aswith clustering, users can select among several similarity measureswith which to run the projection algorithms (Figure 1e). Each cir-cle in the scatter plot represents a data point and their color encodestheir cluster membership in the currently active clustering method.

a

b

PortugalKoreaJapan

Turkey

MexicoChile

Italy

Spain

Figure 4: Forward projection in action. Forward projection en-ables user to explore if and how much StudentSkills explains thedifference between Portugal and Korea in a projection of OECDmember countries based on a set of development indices. The usera) dynamically changes the value of the StudenSkills dimensionfor Portugal and b) observes the dynamically updated projection.In this case, user discovers that StudentSkills is the most importantfeature explaining the difference between Portugal and Korea.

Statistics This view displays the results of the most recent sta-tistical computation. Currently, Clustrophile provides standardpoint statistics along with a hypothesis-testing functionality usingANOVA and pairwise correlation computations between features(Figure 2).

Playground Clustrophile enables the exploration of two- di-mensional projections of the data through forward and backwardprojections. In the Playground view, users can create a copy ofan existing data point and interactively modify its feature valuesto see how its projected position changes. Conversely, users canchange the projected position and see what feature values satisfythis change.

3.3 InteractionsBrushing and Linking. We use brushing & linking to select

data across and coordinate the views of Clustrophile. This is themain mechanism that lets users observe the effects of one operationacross the views.

Dynamic filtering In addition to brushing, Clustrophile providestwo basic mechanisms for dynamically filtering data (Figure 1h).First, its search functionality lets users filter the data using arbi-trary keyword search on feature names and values. Second, userscan also filter the table using expressions in a mini-language. Forexample, typing age > 40 & weight<180 dynamical selects datapoints across views where the fields age and weight satisfy the en-tered constraint.

Adding and Removing Features Understanding the relevanceof data dimensions or features to the analysis is an important yetchallenging goal in data analysis. Clustrophile enables users toadd and remove features (dimensions) and explore the resultingchanges in clustering and projection results (Figure 1i).

3.4 Interacting with Dimensionality Reduc-tions

Dimensionality reduction is the process of reducing the numberof dimensions in a high-dimensional dataset in a way that maxi-mally preserves inter-datapoint relations of some form as measured

Page 5: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

…x

xi�k�xi,

xi�k�i xi+k�i

x1x0 x2…

c�i

Figure 5: Prolines visualize paths of forward projections. For agiven feature xi of a data point x, we construct a proline by connect-ing the forward projections of the points regularly sampled from arange of x values, where all features are fixed but xi changes fromxi−kσi to xi +kσi. σi is the standard deviation of the ith feature inthe dataset and k, c are constants controlling respectively the extentof the range and the step size with which we iterate over the range.

in the original high-dimensional space. As with clustering, mostdimensionality-reduction techniques are unsupervised and learnsalient structures explaining the data. Unlike clustering, however,dimensionality-reduction methods discover continuous representa-tions of these structures.

Despite its ubiquitous use, dimensionality reduction can be diffi-cult to interpret, particularly in relation to original data dimensions.What do the axes mean? is probably users’ most frequent questionwhen looking at scatter plots in which points (nodes) correspondto dimensionally-reduced data. Clustrophile integrates forwardprojection, backward projection, and prolines to facilitate direct,dynamic examination of dimensionality reductions represented asscatter plots.

There are many dimensionality-reduction methods [42] anddeveloping effective and scalable dimensionality-reduction algo-rithms is an active research area. Here we focus on princi-pal component analysis (PCA), one of the most frequently useddimensionality-reduction techniques; note that the discussion hereapplies as well to other linear dimensionality-reduction methods.PCA computes (learns) a linear orthogonal transformation (high-dimensional rotation) of the empirically centered data into a newcoordinate frame in which the axes represent maximal variability.The orthogonal axes of the new coordinate frame are called prin-cipal components. To reduce the number of dimensions to two,for example, we project the centered data matrix, rows of whichcorrespond to data samples and columns to features (dimensions),onto the first two principal components, e0 and e1. Details of PCAalong with its many formulations and interpretations can be foundin standard textbooks on machine learning or data mining (e.g., [5,18]).

3.5 Forward ProjectionForward projection enables users to interactively change the fea-

ture or dimension values of a data point, x, and observe how thesehypothesized changes in data modify the current projected loca-tion, y (Figures 3,4). This is useful because understanding the im-portance and sensitivity of features (dimensions) is a key goal inexploratory data analysis.

We compute forward projections using out-of-sample extension(or extrapolation) [42]. Out-of-sample extension is the process ofprojecting a new data point into an existing projection (e.g., learnedmanifold model) using only the properties of the projection. It isconceptually equivalent to testing a trained machine learning model

StudentSkillsEducationAttainment

SelfReportedHealth

WorkingLongHours

Figure 6: Prolines for Portugal in a PCA projection of OECD mem-ber countries based on their values for a set of development indices.Prolines will be straight lines for linear dimensionality-reductionmethods. In addition, the length of each path corresponds to thespeed (sensitivity, variability) along the corresponding dimension.For example, StudentSkills is the most sensitive feature determin-ing the projection in this case. Note that forward projection ani-mates the speed of the change along prolines, giving the user anadditional cue about the importance of the dimension in the projec-tion.

with data that was not part of training.In the case of PCA, we obtain the two-dimensional position

change vector ∆y by projecting the data change vector x′ onto theprincipal components: ∆y = ∆x E, where E =

[e0 e1

].

3.6 Prolines: Visualizing Forward Projec-tions

It is desirable to see in advance what forward projection pathslook like for each feature. Users can then start inspecting the di-mensions that look interesting or important.

Prolines visualize forward projection paths based on a range ofpossible values for each feature and data point (Figures 5, 6). Let xibe the value of the ith feature for the data point x. We first computethe standard deviation σi for the feature in the dataset and devise arange I =

[xi− kσi, xi + kσi

]. We then iterate over the range with

a step size of cσi, compute the forward projections as discussedabove, and then connect them as a path. The constants k, c controlrespectively the extent of the range and the step size with which weiterate over the range.

Prolines will be straight lines for linear dimensionality-reductionmethods (Figure 6), and therefore computing forward projectionsonly for the extremum values of the range I is sufficient. Also notethat in the case of PCA projections prolines reduces to plotting thecontributions of the feature to the principal components (loadings)as a line vector.

3.7 Backward ProjectionBackward projection as an interaction technique is a natural com-

plement of forward projection. Consider the following scenario: auser looks at a projection and, seeing a cluster of points and a sin-gle point projected far from this group, asks what changes in thefeature values of the outlier point would bring the apparent outliernear the cluster. Now, the user can play with different dimensionsusing forward projection to move the current projection of the out-lier point near the cluster. It would be more natural, however, tomove the point directly and observe the change (Figures 7, 8, 9).

Page 6: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

…x0 x1 x2 x3 x4 x5 x6 …

known knownunknown

e0

e1

a select

perturb

b

c back-project

Back-projected value of :

�yy y0x0

x

�x⇥e0 e1

⇤= �y

�x

x0 = x + �x

x0

Figure 7: Through backward projection, users can a) select a nodein the projection that corresponds to a data point x, b) directly movethe node in any direction and c) dynamically observe what datachanges ∆x would satisfy the hypothesized change ∆y in the pro-jected position. In PCA projections, ∆x can be obtained by solvingfor it in the linear equation ∆x [e0 e1] = ∆y.

Portugal

Turkey

MexicoChileItaly

Spain

Greece

a

b

Figure 8: Unconstrained backward projection. A user, curiousabout the projection difference between Turkey and Greece, a)moves the proxy node for Turkey (gray square with dashed bor-der) towards Greece. The feature values for Turkey are automat-ically updated to satisfy the new projected position as the nodeis moved. The user b) observes that, as Turkey gets closer toGreece, WorkingLongHours decreases (encoded with red) whileEducationAttainment, StudentSkills, YearsInEducation, Life-Expectancy, SelfReportedHealth, and LifeSatisfaction increase(green). TimeDevotedToLeisure (not seen) stays constant (gray).

The formulation of backward projection is the same as that offorward projection: ∆y = ∆x E. In this case, however, ∆x is un-known and we need to solve the equation.

As formulated, the problem is underdetermined and, in gen-eral, there can be infinitely many data points (feature values) thatproject to the same planar position. Therefore, our implementationin Clustrophile supports both unconstrained and constrained back-ward projections. Users can introduce equality as well as inequalityconstraints (Figure 10).

In the case of unconstrained backward projection, we find ∆x bysolving a regularized least-squares optimization problem.

minimize∆x

‖∆x‖2

subject to ∆x E = ∆y

Note that this is equivalent to setting ∆x = ∆y ET . In general, forlinear projections we have the unconstrained back projection di-rectly.

As for constrained backward projection, we find ∆x by solving

PortugalKoreaJapan

Turkey

MexicoChileItalySpain

a

b

Figure 9: Constrained backward projection. A user exploresthe projection difference between Portugal and Korea, first fixing(i.e., setting equality constraints) all dimensions but EducationAt-tainment, StudentSkills, YearsInEducation, LifeSatisfaction andthen a) moving the proxy node for Portugal nearer to Korea. Theuser b) observes that LifeSatisfaction decreases while Education-Attainment, StudentSkills, and YearsInEducation increase.

Figure 10: Clustrophile interface for entering inequality constraintsfor backward projection. Users can enter bounded, left and rightbounded interval constraints. The histogram shows the distributionof the future (bmi, body mass index, in this case) for which theconstraints are entered. The user can adjust the constraints interac-tively using the histogram brush or the slider.

the following quadratic optimization problem:

minimize∆x

‖∆x E−∆y‖2

subject to C∆x = dlb≤ ∆x≤ ub

C is the design matrix of equality constraints, d is the constantvector of equalities, and lb and ub are the vectors of lower andupper boundary constraints.

3.8 System DetailsClustrophile is a web application based on a client-server model

(Figure 11). We implemented Clustrophile’s web interface inJavascript with help of D3 [6] and AngularJS [1] libraries. Wegenerated the parser for the mini-language used to filter data withPEG.js [33]. Most of the analytical computations are performed onClustrophile’s Python-based analytics server, which has four mod-ules: clustering, projection, statistics, and solver. These modules

Page 7: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

ClustrophileWeb Client

JSON

HTMLJavascriptAngularJSD3

PythonNumPy

SciPyscikit-learn

Clustering Projection Stats Solver

Analytics Server

(REST API)

Figure 11: Clustrophile architecture.

are mainly wrappers, making heavy use of SciPy [22], NumPy [43],and scikit-learn [32] Python libraries. The solver module usesCVXOPT [12] for quadratic programming.

4. USER FEEDBACKClustrophile is a research prototype under development and has

been used over several months by data scientists and researchersin the healthcare domain. While we have not conducted a formalstudy, we briefly discuss the informal feedback we have gathered.

Users cared most about the time-saving aspects of Clustrophile.They were pleased with the ability to explore different clusteringand projection algorithms and parameters without going back totheir scripts. Similarly, among Clustrophile’s favorite functionali-ties were the ability to add and remove features and iteratively re-compute clusterings and projections on filtered data while stayingin the context of data analysis session. We found that our users weremore familiar with clustering than projection; indeed, for some therelation between clustering and projection view was not alwaysclear.

The most important request from our users was scalability. Assoon as they started using Clustrophile in commercial projects, theyrealized that they wanted to be able to analyze large datasets with-out losing Clustrophile’s current interactive and iterative user expe-rience (more on this in the following section).

5. SAMPLING FOR SCALEThe power of visual analysis tools such as Clustrophile comes

from both facilitating iterative, interactive analysis and leverag-ing visual perception. Exploring large datasets at interactive rates,which typically involves coordination of multiple visualizationsthrough brushing and linking and dynamic filtering, is, however,a challenging problem. One source of the challenge is the costof interactive computation and rendering. Another is the percep-tual and cognitive cost (e.g., clutter) users incur when dealing withlarge numbers of visual elements.

There are two basic approaches to this problem: precomputationand sampling [19]. Precomputation involves processing data into aform (typically tiles or cubes) to interactively answer queries (e.g.,zooming, panning, brushing, etc.) that are known in advance. Thisapproach has been the prevalent method both in the visualizationcommunity and the database community, from which most of thecurrent techniques originate from. However, precomputation is notalways feasible or, indeed, desirable. Scalable visualization toolsbased on precomputation are typically applied to the visualizationof low-dimensional, spatial (e.g., map) datasets as precomputationis infeasible when the data is high dimensional, quickly expandingthe combinatorial space of possible cubes or tiles. And, in general,precomputation is inflexible as it restricts the ability to run arbitraryqueries.

Sampling, considering only a selected subset of the data at a timefor analysis, is an attractive alternative to precomputation for scal-ing interactive visual analytics tools. Sampling has generality andthe advantage of easing computational and perceptual/cognitiveproblems at once. In principle, there is no reason that sampling-based visual analysis should not be a viable and practical option.In the end, the field of statistics builds on the premise that one caninfer properties of a population (read complete data) from its sam-ples. There are, however, two major challenges that, we believe,also limit wider adoption of sampling in general [19].

First is a concern about potential biases introduced by sampling.This concern seems, however, to be at least partly unfounded, sinceneither aggregation bias of precomputation nor sampling bias ofcomplete data appear to cause as much concern. In recent work,Kim et al. improve the effectiveness (and trustworthiness) ofsampling-based visualizations by guaranteeing the preservation ofrelations (e.g., ranking) within the complete data [24]. The sec-ond challenge is the lack of understanding how users interact withsampling in visual analytics tools or how sampling affects the userexperience and comprehension. Can we develop models of userbehavior regarding sampling? How can we improve the user expe-rience with sampling through visualization and interaction? Howcan users control the sampling process without being experts instatistics?

Addressing these challenges would accelerate the adoption ofsampling and improve the utilization of the unique opportunitythat sampling provides in enabling visual analysis on large datasetswithout losing the power of iterative, interactive visual-analysisworkflow that tools like Clustrophile facilitate.

6. REFLECTIONS ON PROJECTIONSUsing out-of-sample extrapolation, forward projection avoids re-

running dimensionality-reduction algorithms. From the visualiza-tion point of view, this is not just a computational convenience butalso has perceptual and cognitive advantages such as preserving theconstancy of scatter-plot representations. For example, re-running(training) a dimensionality-reduction algorithm with addition of anew sample can significantly alter a two-dimensional scatter plotof the dimensionally-reduced data, despite all the original inter-datapoint similarities stay unchanged. Many of the dimensionality-reduction algorithms are based on eigenvector computations. Evendifferent runs on the same dataset can result in different—typically,flipped—planar coordinates (if v is an eigenvector of a matrix so is−v ).

What about interacting with nonlinear dimensionality reduc-tions? There are out-of-sample extrapolation methods for manynonlinear dimensionality-reduction techniques that make the exten-sion of forward projection with prolines possible [4]. As for back-ward projection, its computation will be direct in certain cases (e.g.,when an autoencoder is used). In general, however, some form ofconstrained optimization specific to the dimensionality-reductionalgorithm will be needed. Nonetheless, it is highly desirable to de-velop general methods that apply across dimensionality-reductionmethods.

7. VISUAL ANALYSIS IS LIKE DOING EX-PERIMENTS

Data analysis is an iterative process in which analysts essentiallyrun mental experiments on data, asking questions and (re)formingand testing hypotheses. Tukey and Wilk [41] were among the firstto observe the similarities between data analysis and doing exper-iments. They list eleven similarities, for example, “Interaction,

Page 8: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

feedback, trial and error are all essential; convenience dramaticallyhelpful.” Albeit often implicitly, the visualization literature makesa strong case for designing visual analysis tools to support quick,iterative analysis flow that is conducive to hypothesis generationand testing (e.g., [23]).

We integrate our spatial interaction techniques for exploring andreasoning with dimensionality reductions into Clustrophile, whichuses familiar data-mining and visualization methods to facilitate it-erative, interactive clustering analysis. Injecting new techniquesinto familiar workflows is an effective way for assessing their use-fulness and adoption. Tukey and Wilk make an important obser-vation on the adoption of new techniques as part of their analogy:“There can be great gains from adding sophistication and ingenu-ity . . . to our kit of tools, just as long as simpler and more obviousapproaches are not neglected.”

It is a standard practice to design visualization tools by consider-ing criteria determined to support user tasks. While this approachis necessary for creating useful tools, our experience in develop-ing Clustrophile suggests that the design process can benefit fromthe regulating clarity of general, higher-level conceptual models.To explore and reason about data, analysts generally have the ba-sic data-mining and visualization techniques. They often, how-ever, lack interactive tools integrating these techniques to facilitatequick, iterative what-if analysis. Extending Tukey and Wilk’s anal-ogy between data analysis and running experiments to visual anal-ysis, visual analysis like doing experiments, provides a useful con-ceptual model for a large segment of visual analytics applications.Clustrophile, along with forward projection, backward projectionand prolines, contributes to the kit of tools needed to facilitate per-forming visual analysis in a similar way to running experiments.

8. REFERENCES[1] AngularJS. http://angularjs.org/.[2] D. Asimov. The grand tour: A tool for viewing

multidimensional data. SIAM J. Sci. Stat. Comput.,6(1):128–143, Jan. 1985.

[3] M. Aupetit. Visualizing distortions and recovering topologyin continuous projection techniques. Neurocomputing,70(7-9):1304–1330, mar 2007.

[4] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau,N. Le Roux, and M. Ouimet. Out-of-sample extensions forlle, isomap, mds, eigenmaps, and spectral clustering.Advances in neural information processing systems,16:177–184, 2004.

[5] C. M. Bishop. Pattern Recognition and Machine Learning.Springer-Verlag, 2006.

[6] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-drivendocuments. IEEE Trans. Visualization & Comp. Graphics,17(12):2301–2309, 2011.

[7] M. Brehmer, M. Sedlmair, S. Ingram, and T. Munzner.Visualizing dimensionally-reduced data: Interviews withanalysts and a characterization of task sequences. InBELIV’14. ACM, 2014.

[8] P. Bruneau, P. Pinheiro, B. Broeksema, and B. Otjacques.Cluster sculptor, an interactive visual clustering system.Neurocomputing, 150:627–644, 2015.

[9] A. Buja, D. F. Swayne, M. L. Littman, N. Dean,H. Hofmann, and L. Chen. Data visualization withmultidimensional scaling. Journal of Computational andGraphical Statistics, 17(2):444–472, 2008.

[10] N. Cao, D. Gotz, J. Sun, and H. Qu. Dicon: Interactive visualanalysis of multidimensional clusters. IEEE Transactions on

Visualization and Computer Graphics, 17(12):2581–2590,Dec 2011.

[11] J. Chuang, D. Ramage, C. Manning, and J. Heer.Interpretation and trust. In Proceedings of the 2012 ACMannual conference on Human Factors in Computing Systems- CHI '12. Association for Computing Machinery (ACM),2012.

[12] CVXOPT. http://cvxopt.org/.[13] A. Endert, P. Fiaux, and C. North. Semantic interaction for

visual text analytics. In Proceedings of the 2012 ACM annualconference on Human Factors in Computing Systems - CHI'12. Association for Computing Machinery (ACM), 2012.

[14] A. Endert, C. Han, D. Maiti, L. House, S. Leman, andC. North. Observation-level interaction with statisticalmodels for visual analytics. In Visual Analytics Science andTechnology (VAST), 2011 IEEE Conference on, pages121–130, Oct 2011.

[15] M. A. Fisherkeller, J. H. Friedman, and J. W. Tukey. Prim-9:An interactive multidimensional data display and analysissystem. In Proc. Fourth International Congress forStereology, 1974.

[16] J. H. Friedman and J. W. Tukey. A projection pursuitalgorithm for exploratory data analysis. IEEE Transactionson Computers, C-23(9):881–890, Sept 1974.

[17] M. Gleicher. Explainers: Expert explorations with craftedprojections. IEEE Trans. Visual. Comput. Graphics,19(12):2042–2051, dec 2013.

[18] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. Theelements of statistical learning: data mining, inference andprediction. The Mathematical Intelligencer, 27(2):83–85,2005.

[19] J. M. Hellerstein. Interactive Analytics. In M. S. J. M. H.Peter Bailis, Joseph M. Hellerstein and M. Stonebraker,editors, Readings in Database Systems. MIT Press, 5thedition, 2015.

[20] D. H. Jeong, C. Ziemkiewicz, B. Fisher, W. Ribarsky, andR. Chang. ipca: An interactive system for pca-based visualanalytics. Computer Graphics Forum, 28(3):767–774, 2009.

[21] S. Johansson and J. Johansson. Interactive dimensionalityreduction through user-defined combinations of qualitymetrics. IEEE Trans. Visual. Comput. Graphics,15(6):993–1000, nov 2009.

[22] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open sourcescientific tools for Python, 2001. http://www.scipy.org/.

[23] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprisedata analysis and visualization: An interview study. In Proc.IEEE VAST’12, 2012.

[24] A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden,and R. Rubinfeld. Rapid sampling for visualizations withordering guarantees. Proc. VLDB Endow., 8(5):521–532, jan2015.

[25] J. Kruskal. Multidimensional scaling by optimizing goodnessof fit to a nonmetric hypothesis. Psychometrika, 29(1):1–27,1964.

[26] S. Lespinats and M. Aupetit. CheckViz: Sanity check andtopological clues for linear and non-linear mappings.Computer Graphics Forum, 30(1):113–125, dec 2010.

[27] J. M. Lewis, L. Van Der Maaten, and V. de Sa. A behavioralinvestigation of dimensionality reduction. In Proc. 34thConf. of the Cognitive Science Society (CogSci), pages671–676, 2012.

Page 9: Clustrophile: A Tool for Visual Clustering Analysishci.stanford.edu/~cagatay/projects/clustrophile/clustrophile.pdf · Clustering is a basic method in data mining. By automatically

[28] A. Lex, M. Streit, C. Partl, K. Kashofer, and D. Schmalstieg.Comparative analysis of multidimensional, quantitative data.IEEE Trans. Visual. Comput. Graphics, 16(6):1027–1035,nov 2010.

[29] S. L’Yi, B. Ko, D. Shin, Y.-J. Cho, J. Lee, B. Kim, andJ. Seo. XCluSim: a visual analytics tool for interactivelycomparing multiple clustering results of bioinformatics data.BMC Bioinformatics, 16(11):1–15, 2015.

[30] E. J. Nam, Y. Han, K. Mueller, A. Zelenyuk, and D. Imre.Cluster Sculptor: A visual analytics tool forhigh-dimensional data. In Proc. IEEE VAST’07, 2007.

[31] OECD Better Life Index.http://www.oecdbetterlifeindex.org/. Accessed: May 27,2016.

[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python. Journal of Machine LearningResearch, 12:2825–2830, 2011.

[33] PEG.js. http://pegjs.org/.[34] A. Pilhofer, A. Gribov, and A. Unwin. Comparing

clusterings using bertin’s idea. IEEE Trans. Visual. Comput.Graphics, 18(12):2506–2515, dec 2012.

[35] T. Schreck, J. Bernard, T. von Landesberger, andJ. Kohlhammer. Visual cluster analysis of trajectory datawith interactive kohonen maps. Information Visualization,8(1):14–29, 2009.

[36] M. Sedlmair, T. Munzner, and M. Tory. Empirical guidanceon scatterplot and dimension reduction technique choices.IEEE Trans. Visual. Comput. Graphics, 19(12):2634–2643,dec 2013.

[37] J. Seo and B. Shneiderman. Interactively exploringhierarchical clustering results [gene identification].Computer, 35(7):80–86, jul 2002.

[38] R. Shepard. The analysis of proximities: Multidimensionalscaling with an unknown distance function. I.Psychometrika, 27(2):125–140, 1962.

[39] J. Stahnke, M. Dörk, B. Müller, and A. Thom. Probingprojections: Interaction techniques for interpretingarrangements and errors of dimensionality reductions. IEEETrans. Vis. Comput. Graph., 22(1):629–638, 2016.

[40] W. Torgerson. Multidimensional scaling: I. theory andmethod. Psychometrika, 17(4):401–419, 1952.

[41] J. W. Tukey and M. Wilk. Data analysis and statistics:Techniques and approaches. In Proceedings of theSymposium on Information Processing in Sight and SensorySystems, pages 7–27. ACM, 1965.

[42] L. Van der Maaten, E. Postma, and H. Van den Herik.Dimensionality reduction: A comparative review. TechnicalReport TiCC TR 2009-005, 2009.

[43] S. v. d. Walt, S. C. Colbert, and G. Varoquaux. The numpyarray: A structure for efficient numerical computation.Computing in Science & Engineering, 13(2):22–30, 2011.

[44] M. Williams and T. Munzner. Steerable, progressivemultidimensional scaling. In IEEE Symposium onInformation Visualization. Institute of Electrical &Electronics Engineers (IEEE), 2004.