finding structure in cytof data - machine...

Finding Structure in CyTOF DataOr, how to visualize low dimensional embedded manifolds.

Panagiotis [email protected] ∗

General TermsAlgorithms, Experimentation, Measurement

KeywordsManifold Learning, Dimensionality Reduction, Flow Cytom-etry, CyTOF

1. INTRODUCTIONHigh-dimensional data are notorious for the difficulties thatpose to most of the known statistical procedures that aimto manipulate the data in a meaningful and enlighteningway. The so called phenomenon of the ’curse of dimension-ality’ has a severe negative impact to our current methodsfor various reasons; the size of the considered space growsexponentially as the number of dimensions increases, mean-ing that for instance solving a density estimation problemin such a space requires exponentially many samples (if nofurther assumptions are made). Single cell mass cytome-try (CyTOF) is a recently introduced technique [1], that isanticipated to revolutionize the field of hematology and en-hance our general understanding about the cells of livingorganisms. A main obstacle towards these goals is going tobe the analysis of the produced, high dimensional, data.How are we going to deal with data that ’live’ in an ambi-ent space of e.g. 100 dimensions? Definitely, without anyfurther assumptions, our task will be difficult. Hopefully,as with many other ’real-life produced’ datasets we expectthat CyTOF data will be governed by only a few degreesof freedom. Such a scenario is possible when for examplethe data lie on a (non) linear manifold of low dimensional-ity, embedded in a higher dimensional space. This kind ofdata is said to have a ’small’ intrinsic dimensionality andvarious techniques have been produced in order to discoverthe essential ’forces’ that shape the data (leading to variousdimensionality reduction schemes).

In this article, we will make the first move by showing that

∗Computer Science Department, Stanford University.

data produced by CyTOF are indeed intrinsically low di-mensional and we will apply to them various state of theart techniques in order to get a concrete visual representa-tion of them. The remainder of this article is organized asfollows. In Section 2 we begin by giving a brief descriptionof the kind of data produced by CyTOF along with somenotational conventions. Next, in Section 3 we thoroughlyintroduce three popular techniques for estimating the in-trinsic dimension of a data set sampled from a manifold. Insections 5, 6, 7 we describe the algorithms and the strate-gies we used for discovering the non linear, low dimensionalembedding of CyTOF data. Instead of describing the ex-perimental component of this article in a separate section,we explain our findings along with each technique used. Wemake our final conclusions in Section 8.

2. CYTOF DATAThe CyTOF data at hand, are composed by measurementsof 31 human’s cell characteristics, varying from the diameterof a cell, to its various proteins abundances. Though 31dimensions used for describing the state of each cell are ’few’compared with the number of dimensions produced in otherinstances of the problem (e.g a 64×64 pixel photograph livesin the 4094 dimensional space), still, having for example avisual representation in the 31 dimensional space, is far fromeasy. Also, we anticipate that in the near feature CyTOFwill be able to measure up to a hundred features of a cell,giving further importance to this first exploratory analysis.On top of this, instead of having a single (static) ’snap-shot’ for the characteristics of the analyzed cells, CyTOFgenerated different ’snap-shots’ over 8 different time periods.The data is organized in Xt(n×D) matrices, where n is thenumber of analyzed cells (up to many thousands), D = 31is the ambient (high) dimension and t ∈ {1, . . . , 8} variesover the different time periods. Finally, we mention here,that for many evaluation benchmarks the data behaved verysimilarly over the different time points; in these cases, forreasons of space saving, we present the experimental findingsonly for a single point in time t = c. For the rest of thisarticle, xi ∈ RD will represent the i-th analyzed cell andyi ∈ Rd it’s counterpart in the low d-dimensional embedding.

3. ESTIMATING INTRINSIC DIMENSIONThresholding Principal ComponentsGiven a set Sn = {X1, . . . , Xn}, Xi ∈ X, i = 1, . . . , n of datapoints sampled independently from a Manifold M , probablythe most obvious way to estimate the intrinsic dimension,Dintr, is by looking at the eigenstructure of the covariancematrix C of Sn. In this approach, Dpca is defined as thenumber of eigenvalues of C that are larger than a giventhreshold. This technique has two basic disadvantages. Firstit requires a threshold parameter that determines whicheigenvalues are to discard. In addition, if the manifold ishighly nonlinear, Dpca will characterize the global (intrin-sic) dimension of the data rather than the local dimension

of the manifold. Dpca will always overestimate Dintr; thedifference depends on the level of nonlinearity of the mani-fold.

Luckily, our data suggest a very clear threshold. As seenin Figure 1 the three largest eigenvalues capture more than70% of the total variance, while the ’elbow’ between thethird and fifth eigenvalue indicates the sharp decrease of thecaptured variance. Thus, we can be confident that Dintr

with high probability will be less than 5 and it seems thateven a ’meaningful’ 2-dimensional embedding of the dataexists. In order to enhance our reasoning for the existenceof a low-dimensional embedding we estimate Dintr with twomore statistics.

Correlation dimensionThe second approach to intrinsic dimension estimation thatwe used is based on geometric properties of the data andrequires neither any explicit assumption on the underlyingdata model, nor input parameters to set. It is based on thecorrelation dimension, from the family of fractal dimensionsand the intuition behind its definition is based on the ob-servation that in a D-dimensional set, the number of pairspoints closer to each other than r is proportional to rD.

Definition 1. Given a finite set Sn = {x1, . . . , xn} of ametric space X, let

Cn(r) =2

n(n− 1)

n∑i=1

n∑j=i+1

I(||xi − xj || < r)

where I is the indicator function. For a countable set S ={X1, X2, . . . } ⊂ X, the correlation integral is defined asC(r) = limn→∞ Cn(r). If the limit exists, the correlationdimension of S is defined as

Dcorr = limr→0

logC(r)

log r.

Since, for a finite sample the zero limit cannot be achieved,we instead used the scale-dependent correlation dimen-sion of a finite set. This is defined as

Definition 2.

Dcorr(r1, r2) =logC(r2)− logC(r1)

log r2 − log r1.

In our experiments instead of measuring Dcorr for some con-stant, small values, r1, r2, we defined them to be the medianand the maximum distance (respectively), among the dis-tances defined by the k nearest neighbors of each datapoint.Thus, for each datapoint we computed a separate Dcorr andour final estimate was their average value. We noticed ourresults to be very robust under different values of k.

Unfortunately, it is known that Dcorr ≤ Dintr and thatDcorr approximates well the Dintr if the data distributionon the manifold is nearly uniform (our approach of using dif-ferent ri’s is not enough to circumvent this). We deal withthe known non-uniformity of the CyTOF 1 by using our lastestimator, this of packing numbers[4]. We avoid giving a de-tailed analysis of this recently introduced technique. Intu-itively it tries to efficiently estimate the number of minimumopen balls whose union covers the space S. This techniquewas shown to not be dependent on the data distribution ofthe manifold.

Intrinsic dimension of CyTOF dataAs seen in Figure 2 CyTOF data (Xt), for all the differenttime points seem to lie on a low dimensional manifold. Theprincipal components and the correlation dimension boundDintr between 5 and ∼ 1.5. (For estimating Dcorr we usedthe distances of the k = 10 nearest neighbors of each dat-apoint.) Supported by these results, we turn to our nextgoal which is to find a meaningful visual representation ofthe data.

4. THE GLOBALLY LINEAR CASEIf we knew that the our data lived on, or very close to, alinear subspace, a d-dimensional hyperplane embedded inthe ambient space, then PCA would be the tool to use. Itwould ’perfectly’ discover the embedded hyperplane, sincethis would be the hyperplane spanned by the d eigevectorsof the empirical covariance matrix that would have non zero(or very close to zero) eigenvalues.

But what if the data live on a more complex ’micro-world’,like a non-linear manifold? We will describe and use 3 dif-ferent techniques to deal with such a scenario: LLE, Isomapand T-Sne.

5. LOCALLY LINEAR EMBEDDINGLocally Linear Embedding [2], or LLE, is based on the ideathat over a small ’patch’ of a smooth manifold, its surfaceis approximately flat. Thus, it proposes to use the expectedlocal linearity of the manifold in order to find a linear weight-based representation of each point from its neighbors, char-acterizing in this way the local relative positioning of eachneighborhood in the high dimensional space. By using thislocal parameterization one can look for a new set of pointsin a lower dimension which preserves, as closely as possible,the same relative positioning information.

The first step of LLE is to solve for the weights that bestcharacterize the points’ relationship in RD:

1Apart from the experimental results which were supportingit, we had prior knowledge for the existence of small, distinctcell sub-populations.

W = arg minW

n∑i=1

||xi −∑

j∈N(i)

Wijxj ||2

(1)

s.t. ∀i∑j

Wij = 1 (2)

where N(i) is the neighboring points of xi (the neighborhoodof each point can be calculated with various means, e.g.based on the k-nearest neighbors or by some local sphereof radius ε around xi). The size of the neighborhood is acompromise; it must be large enough to allow for a goodreconstruction of the points (it must contain at least dintr

points), but small enough for not violating the locally lin-ear assumption. With the normalization requirement addedthe produced weight based representation will be invariantto any local rotation, scaling, or translation of xi and itsneighbors.

The second step is to find the embedding Y which best pre-serves the previously found local configurations:

Y = arg minY

n∑i=1

||yi −∑

j∈N(i)

Wijyj ||2

(3)

s.t. Y~1 = 0, Y Y T = In (4)

Under the above constrains, the problem is well-posed and aunique globally optimal solution can be found analytically,by reformulating it as a quadratic program.

5.1 LLE in practiceWe tried LLE with a varying number k for the nearest neigh-bors of each data point, measuring their distances under theEuclidean norm (a meta-parameter of LLE). We found thealgorithm to perform extremely poorly when we used as itsinput the entire analyzed cell population of any given timeperiod (see Figure 4a ). The reason behind this failure is ex-plained graphically in Figure 3; here we we have plotted thepercentage of cells captured by the two largest connectedcomponents of the (disconnected) neighborhood graph,under different values of k. In other words, the raw highdimensional data even for k > 31 (the ambient dimension)do not form a strongly connected neighborhood graph.

In 4b, we used a value of k = 16 and the algorithm consid-ered ∼ 98% of the cell population; the cells that form thelargest connected component. The rest 2% of the cells werecompletely ignored, treated as very noisy observations. Ajustification for this stems from the fact that this 2% wasscattered through roughly 200 different unconnected compo-nents, resulting in a lots of very small (2-3 cells each) groupsof unrelated data.

6. ISOMAPIsometric feature mapping, or Isomap [5] may be viewedas an extension of Multidimensional Scaling (MDS), a clas-sical method for embedding dissimilarity information intoEuclidean space. Isomap consists of two main steps:

1. Estimate the geodesic distances (shortest curved dis-tances) among the data points in RD.

2. Use MDS to embed the data points in a low-dimensionalspace that best preserves the estimated geodesic dis-tances.

For the first part of Isomap, we appeal again to the local lin-earity of the manifold. If our data are sufficiently dense, thenthere is some local neighborhood in which the geodesic dis-tance is well-approximated by the naive Euclidean distance.Taking this local distances as trusted, farther distances maybe approximated by finding the length of the shortest pathalong trusted edges.

6.1 Isomap in practiceFor the approximation of the geodesic distance to be accu-rate (among distant points) the prerequisite of dense datais essential. A simple way to achieve this would be to usethe neighborhood graph resulted by some large value of k.On the other hand, such an attempt with high probabilitywould suffer by what is know in the literature as ’short cir-cuit distances’, which can lead to drastically different (andincorrect) low-dimensional embeddings. k = 19 was foundto be the golden ratio in our empirical results as it managedto give a good balance in the trade-off between the fractionof the variance in geodesic distance estimates which are notaccounted for in the Euclidean (low dimensional) embed-ding; and the fraction of points not included in the largestconnected component (see Figure 5).

7. T-SNET-distributed stochastic neighbor embedding [3], works un-der a different vain. It was developed to address the prob-lem of visualizing the high dimensional data, aka it worksfor embeddings with d ≤ 3 and it’s not appropriate for arbi-trary dimensionality reduction. Its basic idea is to describethe similarities between data points (e.g. their Euclideandistances) as conditional probabilities and then try to mini-mize the (Kullback-Leibler) divergence of these probabilitiesas they were captured in the low and the high dimensionalspace respectively. In a sketchy description, T-SNE worksas follows:

1. ∀ xi in the high dimensional space let pj|i to be pro-portional with the density of ||xi − xj || drawn from aGaussian(xi, σ).

2. ∀ yi in the low dimensional space let qj|i to be pro-portional with the density of ||yi − yj || drawn from aStudent’s T-distribution with 1 degree of freedom.

3. Use a gradient descent like method, to minimize thethe KL(P ||Q) =

∑i

∑j pij log

pijqij

The success of this method, as also revealed in the CyTOFdata, is that it effectively copes with the crowding prob-lem[3]. This problem in simple words stems from the factthat the area in two (or three)dimensions that is available toaccommodate moderately distant points (in the high dimen-sional space) will not be nearly large enough compared with

the area available to accommodate nearby points. Thus, ifwe want to model the small distances accurately in the lowdimensional space, most of the points that are at moderatedistance from point i will have to be placed much too faraway from it. The way T-SNE achieves this in an effectivemanner comes from the use of the heavy tale T-Student dis-tribution for modeling the distances in the low dimensionalspace. Comparing it with the probabilistic setting of SNEor LLE, where every pairwise distance can be regarded ashaving being generated under a Gaussian distribution (withmean equal to one of the two points), the T distributionmakes it much more likely for 2 points to be embedded faraway from each other. Thus as seen, also in our experiments(see Figures 9, 10 ) T-SNE produces less compact and thus’cleaner’ results. For all our experiments with T-SNE weused gradient descend for 1000 iterations and with 10 differ-ent re-initializations. In most cases we manage to convergeto a local minima pretty quickly.

8. CONCLUSIONSThe most important result which stemmed out from thisproject is that we now have more experimental evidence thatCyTOF data are likely to be governed by very few degreesof freedom. Also, regarding the methods used, we can nowmake some statements about their performance in this kindof data. By visual inspection,T-SNE outperformed all theother methods. In order to boost this evidence, we havemeasured the following type of error that each method in-troduced in its low dimensional embedding; For every xi, letpi(m) be the value the i-th cell has in its m-th dimensionin the high dimensional space. Figure [12] plots the sum ofthe squared distances between every embedded cell and its5 nearest (embedded) neighbors, as these are formed underthe dimension of the original space pi(m) with the maximumvariance.

(The two dimensions that with maximum variance in theoriginal space, were those measuring the CD3 and CD45protein abundances of the cells. These 2 protein abundanceswere used to give color to all of our plots, by coloring everycell with an intensity proportional to its original CD3 orCD45 abundance).

In Figure 8 we see the 2-dimensional embedding resultedby PCA. Interestingly enough, it manages to separate thedata (colors) to a large extent. Some results from the useof Isomap (who found to perform slightly better than LLE)are shown in [6], [7]. Finally, in Figure [11] one can see thekind of the 2-dimensional embeddings that a k-means likealgorithm could produce for this data. Clearly, the manifoldslearners used outperform such an approach.

9. ACKNOWLEDGEMENTSThe author would like to thank Prof. Daphne Koller whoserved as his rotation advisor in Stanford and gave him theopportunity to ’play around’ with the CyTOF data. Also,Manfred Claasen, who was an invaluable source of inspira-tion and help. The materials written by Lawrence Caytonand Alexander Ihler, available online, helped him a lot indemystifying non linear manifold learners. Finally, the pub-licly available code of Laurens van der Maaten was a greathelp.

10. REFERENCES[1] Sean C. Bendall et. al, Single-Cell Mass

Cytometry of Differential Immune and DrugResponses Across a Human HematopoieticContinuum. Science, 2011.

[2] Sam Roweis, Lawrence Saul. Nonlineardimensionality reduction by locally linear embedding.Science, 290 (5500), 2323-2326, 22 December 2000.

[3] Laurens van der Maaten, Geoffrey HintonVisualizing Data using t-SNE. Journal of MachineLearning Research, 2008.

[4] Balazs Kegl Intrinsic Dimension Estimation UsingPacking Numbers NIPS, 2002.

[5] J. B. Tenenbaum, V. de Silva and J. C.Langford A Global Geometric Framework forNonlinear Dimensionality Reduction Science, 290(5500), 2319-2323, 22 December 2000.

Figure 1: Percentage of variance as captured by an increas-ing number of (decreasingly sorted) eigenvalues, (t = 5).

Figure 2: Three estimators of the Intrinsic Dimension for allthe 8 different time periods.

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html

Figure 3: Percentage of cells in largest and second largestconnected components of the neighborhood graph (t = 5).

(a) LLE Only CC (b) LLE All Cells

Figure 4: LLE 2D embedding (k = 16). In 4a only thecells in the largest connected component were consideredby LLE. In contrast at 4b , LLE was applied to the wholecell population. The difference in the embedding is prettydramatic. Both experiments are at (t = 5) and for coloringuse CD45 cell abundances.

Figure 5: Trade-off between the fraction of the variance ingeodesic distance estimates not accounted for in the Eu-clidean (2-d) embedding and the fraction of points not in-cluded in the largest connected component.

Figure 6: Isomap 2D embedding (t = 1). CD3 cell abun-dances used for coloring.

Figure 7: Isomap 2D embedding (t = 5). CD3 cell abun-dances used for coloring.

(a) t = 1 (b) t = 8

Figure 8: A 2D embedding constructed by projecting thecells on the 2 larger eigenvectors. Colors resulted by CD3cells abundances.

Figure 9: T-SNE 2D embedding (t = 1). CD3 cell abun-dances used for coloring.

Figure 10: T-SNE 2D embedding (t = 5). CD3 cell abun-dances used for coloring.

(a) Cluster 1 (b) Cluster 2

(c) Cluster 3 (d) Cluster 4

Figure 11: A 2D embedding as considered by the K-mediansalgorthim, with k = 4. For each cluster we have plottedthe colors resulted by CD45 cells abundances. The relativeinner-cluster (grid) positions of the cells are random.

Figure 12: Comparison of methods as captured by the av-erage difference between each embedded data point and its5-closest neighbors under their L1 distance in feature CD3.

finding structure in cytof data - machine...

Documents