are the data hollow?

6
Are the data hollow? Leonid Pekelis, Susan Holmes Statistics Department, Sequoia Hall, Stanford University, CA 94305 December 14, 2009 Abstract This paper presents statistical approaches to testing whether data are spherical in the topological sense, ie, they lie on a closed manifold with a hollow interior. We characterize the sphere by it’s Betti numbers and use the computational topology program PLEX [12] based on simplicial complexes to calculate the data’s Betti numbers. We use the parametric bootstrap approach to test the power of the test for several alternatives including the full sphere or ellipsoid filling points in three dimensions. 1 Introduction High dimensional data sometimes lie on a a lower dimensional manifold, this has been extensively explored by the manifold learning community for examples see [17], [1] or [13]. However in some applications it is important to know whether the manifold is closed as in the simple case of cell cycle data where gene expressions in very high dimensions can be projected onto a closed curve, see [7] for the details of the cell cycle application. It can be important to see if data has a hollow structure, even in data that are originally known to come from a sphere, such as geological measurements. However the nature of the data themselves often obscure the fact that the data are hollow, since spherical data will project automatically on the circle in two dimensions. Dynamic graphics such as those proposed by ggobi[15] can provide a way of ‘seeing’ that the data sit on a sphere, but we are interested in a statistical test for such situations. In other types of data, such as astronomical data some researchers [19] have used statistical methods to detect topological structure, we propose an approach also based on topology aimed at detecting structure with dense surfaces surrounding hollow volumes that can be characterized by a specific pattern of the topological characteristics called Betti numbers[10]. 2 Methodology For the remainder of this article, let X be a random variable in IR d with density function f : M→ [0, 1], and M a manifold, possibly of dimension less than d. The general problem in topological statistics is identifying a good approximation to the space M from X ,a n × d matrix encoding n possibly noisy draws from X. In this paper, we focus the hypothesis test H 0 : M does not have the topology of the sphere S d ,d 1 (1) In order to reformulate this null hypothesis into a well defined problem, we use the methods of persistent homology from the field of computational topology. The next section summarizes the theory behind persistent homology and the process of approximating topological properties of M from X . This leads to a restatement of (1) in topological terms. For a more thorough introduction to persistent homology, [2] is an excellent place to start. 2.1 Calculating the Homology of Data The homology of a space M is a set of specific topological characteristics which describe M. Formally, it is a series of functors that takes M to algebraic images. Two homeomorphic spaces necessarily have the same homology, but the converse is not always true. In principle, other functors exist, such as the fundamental group, but for data analysis homology is exclusively examined due to computational tractability. A more general approach falls under algebraic topology. For a thorough treatment see [10]. Calculating homology from X is accomplished in two principle steps. First, build a simplicial complex, denoted by S(X ,r), where r is a 1-dimensional parameter often relating to the level sets of a distance function on X . And second, calculate the homology groups, H p (S(X ,r)), where p runs over ZZ + and denotes dimension. Intuitively, for a well chosen value of r, S(X ,r) is a polyhedral approximation to M. Thus, we can hope for isomorphisms between their homology groups, H p (M) H p (S(X ,r)). The structure of S(X ,r) is a collection of vertices, edges, triangles, and their higher dimension equivalents, collec- tively named simplices, together with information on how various simplices fit together. Imagine a multidimensional 1

Upload: others

Post on 08-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Are the data hollow?

Are the data hollow?

Leonid Pekelis, Susan HolmesStatistics Department, Sequoia Hall, Stanford University, CA 94305

December 14, 2009

Abstract

This paper presents statistical approaches to testing whether data are spherical in the topological sense, ie,they lie on a closed manifold with a hollow interior. We characterize the sphere by it’s Betti numbers and use thecomputational topology program PLEX [12] based on simplicial complexes to calculate the data’s Betti numbers.

We use the parametric bootstrap approach to test the power of the test for several alternatives including thefull sphere or ellipsoid filling points in three dimensions.

1 IntroductionHigh dimensional data sometimes lie on a a lower dimensional manifold, this has been extensively explored by

the manifold learning community for examples see [17], [1] or [13].However in some applications it is important to know whether the manifold is closed as in the simple case of cell

cycle data where gene expressions in very high dimensions can be projected onto a closed curve, see [7] for the detailsof the cell cycle application.

It can be important to see if data has a hollow structure, even in data that are originally known to come from asphere, such as geological measurements. However the nature of the data themselves often obscure the fact that thedata are hollow, since spherical data will project automatically on the circle in two dimensions. Dynamic graphicssuch as those proposed by ggobi[15] can provide a way of ‘seeing’ that the data sit on a sphere, but we are interestedin a statistical test for such situations.

In other types of data, such as astronomical data some researchers [19] have used statistical methods to detecttopological structure, we propose an approach also based on topology aimed at detecting structure with dense surfacessurrounding hollow volumes that can be characterized by a specific pattern of the topological characteristics calledBetti numbers[10].

2 MethodologyFor the remainder of this article, let X be a random variable in IRd with density function f : M → [0, 1], and

M a manifold, possibly of dimension less than d. The general problem in topological statistics is identifying a goodapproximation to the space M from X , a n× d matrix encoding n possibly noisy draws from X. In this paper, wefocus the hypothesis test

H0 :M does not have the topology of the sphere Sd, d ≥ 1 (1)

In order to reformulate this null hypothesis into a well defined problem, we use the methods of persistent homologyfrom the field of computational topology. The next section summarizes the theory behind persistent homology andthe process of approximating topological properties of M from X . This leads to a restatement of (1) in topologicalterms.

For a more thorough introduction to persistent homology, [2] is an excellent place to start.

2.1 Calculating the Homology of DataThe homology of a space M is a set of specific topological characteristics which describe M. Formally, it is a

series of functors that takesM to algebraic images. Two homeomorphic spaces necessarily have the same homology,but the converse is not always true. In principle, other functors exist, such as the fundamental group, but for dataanalysis homology is exclusively examined due to computational tractability. A more general approach falls underalgebraic topology. For a thorough treatment see [10].

Calculating homology from X is accomplished in two principle steps. First, build a simplicial complex, denotedby S(X , r), where r is a 1-dimensional parameter often relating to the level sets of a distance function on X . Andsecond, calculate the homology groups, Hp(S(X , r)), where p runs over ZZ+ and denotes dimension. Intuitively, fora well chosen value of r, S(X , r) is a polyhedral approximation toM. Thus, we can hope for isomorphisms betweentheir homology groups, Hp(M) ' Hp(S(X , r)).

The structure of S(X , r) is a collection of vertices, edges, triangles, and their higher dimension equivalents, collec-tively named simplices, together with information on how various simplices fit together. Imagine a multidimensional

1

Page 2: Are the data hollow?

jigsaw puzzle. There are many different approaches to constructing S(X , r) in the literature, including Cech, Vitoris-Rips, α-shape, and witness. In this paper we focus on witness complexes, and define them as follows. Discussion ofthe others may be found in [3].

• Let L ⊂ X be a set of vertices chosen at random.• For each x ∈ X , define m(x) to be the distance from x to the closest landmark, l ∈ L.• The edge, [ab], is then in S(X , t) if there exists a “witness” x ∈ X such that max{‖a−x‖, ‖b−x‖} ≤ t+m(x).• All higher dimensional simplices are in S(X , t) if and only if all of their edges are.

The biggest advantage of using the witness approach is that the number of simplices in S(X , t) is bounded bya function of |L|, the number of landmarks. Witness complexes are also computationally simple. These advantagesquickly become necessary in order to finish computation in a reasonable time frame. The trade off is witness complexescurrently lack theoretical tractability, so most results regarding them must be given heuristically. For the sake ofnotation, we now drop the parameters X and r, letting S .= S(X , r) when there is no ambiguity.

Define the boundary map, δk : Ck → Ck−1, which takes k-dimensional simplices to their (k−1)-dimensional faces.For example, the boundary of a triangle is its 3 edges. Next, define a k-cycle as a group of k-dimensional simplicieswith 0 boundary, mod2. The 3 edges of the triangle collectively have 0 boundary since each of the vertices arecounted twice, and is therefore a cycle. Then Hk(S) .= Zk/Bk+1, the k-dimensional cycle cosets created by dividingout those cycles which are boundaries of (k + 1)-dimensional groups of simplices. For instance, the three edges of atriangle would be collapsed in H1. Thus, Hk represents each “true” k-dimensional cycle, or hole, as an abelian group,and its rank, βk

.= rk(Hk), counts the number of k-dimensional holes. These ranks are the Betti-k numbers, with β0

counting the number of connected components. The Betti-k numbers ofM are its homology signature and are a wayto classify topological spaces. A single point and a solid sphere both have the same homology signature, (1, 0, . . . ),and are homologous; Sd has homology signature (1, 0, . . . , 1, 0, . . . ), where the second 1 is in the d-th position, andare not homologous for distinct d.

At this point, it may be tempting to reformulate our null hypothesis as (βi(S))∞i=0 6= (βj(Sd))∞j=0, where each sideof the inequality is an infinite dimensional vector. The issue is that it is a priori impossible to solve for an optimalr value, r∗ ∈ {r : Hp(S) ' Hp(M), p ≥ 0}, so we cannot be sure that the Betti signature for any given r equalsthat ofM. For this reason, persistent homology calculates the Betti signatures over a range of distance parameters,[r0, r1].

The definition of persistent homology is technical, but relies on the simple key fact that increasing distances,r < r′, create nested complexes, S(r) ⊆ S(r′). This allows one to track the evolution of cycles over r, and moreimportantly, gives homomorphisms between homology groups. Now any coset has a birth time (distance), t0, whenit first appears in Hk, and a death time, t1, where it is no longer non-bounding for t1 + ε, for all ε > 0. Only thesingle connected component coset can live forever as eventually all points in X are connected, and all k-holes closed.The life time of a coset is regarded as a measure of its intrinsicness, with longer living cosets more likely to correctlycharacterize M, and short lived ones regarded as homological noise. It is this last notion that topological statisticsattempts to rigorously quantify. A homological approximation to our original hypothesis is as follows.

Let G be a standard p-dimensional Gaussian with mean 0, so that in expectation G has the trivial homology of asingle point. Then an interval [t, t′] exhibiting non-trivial homology on a sample from G can be considered as arisingexogenous to any underlying homology. Each such interval will have a distribution induced by G, which we refer toas a null distribution. More formally, we define

TX ,M.= max{[t0, t1] ∈ [0, tE ] : (βk(S(X , t)))∞k=0 = (βk(M))∞k=0 for all t ∈ [t0, t1]]}.

as the maximum interval of distances where the correct homology ofM is expressed, and tE as the maximum distanceconsidered. We renormalize t→ t

tEto compare interval sets with different endpoints. The resulting null hypothesis

analogue to (1) is, for d ≥ 1H0 : |TX ,Sd | ∈ {|TG0,Sd | : G0 ∼ Nd(0, Id)}, (2)

with corresponding test statistic, |TX ,Sd |.It should be noted that the above null hypothesis is not in exact correspondence with (1) as in principle M may

have a non-trivial homology not isomorphic to Sd, with the distribution of |TX ,Sd | being far from that of |TG0,Sd |.This event can be interpreted by inclusion in the type 1 error of the original test. For results suggesting (2) is aconservative approximation to (1) see section 3.

The goal of the rest of this article will be to examine the power of the test based on the new definition (2) throughsimulation. We consider two categories of alternative hypotheses.

In the first category, we assume X to have uniform distribution on Sd convoluted with varying amounts ofperpendicular Gaussian noise.

H1a : X ∼ U(Sd) ∗N1(0, σ2) , σ ≥ 0. (3)

2

Page 3: Are the data hollow?

In the second category of alternatives, H2a , X is considered to be sampled from the surface of a manifold homeo-

morphic to Sd, but not congruent. In particular, we consider the shapes resulting when two ends of a circle or sphereare pushed towards each other. For an example, see figure 1(c).

The next section describes procedures to generate data according to the alternatives.

2.2 Generating Data from a ManifoldSampling from a general manifold can be done quite simply especially in the case of surfaces of revolution, see

[8] provides a review of the subject with statistical applications. Under H1a , we are interested in testing the noise

tolerance for correctly calculating the homology of a d-sphere. We only demonstrate the procedure for d = 2, notingthat the cases of d = 1, or 3 are exactly similar, only less informative. Points are generated on the 2-sphere accordingto the standard parametrization

x = sin(θ) cos(φ) ; y = sin(θ) sin(φ) ; z = cos(θ) , θ in [0, π] , φ in [0, 2π]. (4)

The parametrization imposes a particular density on θ and φ given by its Hausdorff area measure. Section 3.2.5of Federer (1996) gives a formula for explicitly calculating this density for any manifold M, which we reproducebelow for the case of M = S2.∫

A

f (g(y)) J2g(y)λ2(dy) =∫IR3

f(x)N(g|A, x)H3(dx), [9] (5)

where y = (θ, φ), x = (x, y, z), g : [0, π] × [0, 2π] → S2 is the parametrization (4), f : S2 → [0, 1] any specifieddensity on S2, J2 the 2-dimensional Jacobian, N(f |A, y) = #{x ∈ A : f(x) = y} is the multiplicity function, andH3(dx) the 3-dimensional Hausdorff measure. Intuitively, (5) accounts for stretching in the parametrized space dueto the curvature of S2. Drawing from S2 with density f now amounts to sampling from the density g ◦ Jmf(y), afternormalization, in the parameter space. Letting f be the uniform density, it is easy to show that fS2(y) .= g◦Jmf(y) =14π sin(θ). Figure 1 (a) displays a heat map of fS2(y) on S2.

Finally, normal noise is added to each point by specifying a distribution on the radius, r ∼ N(1, σ2), and applyingthe transformation h(y) = r ∗ g(y). The joint distribution of r, θ and φ is given by

fX(r, θ, φ) = fS2(θ, φ)f(r) =1√

32π3σe−

(r−1)2

2σ . (6)

It is easy to show that the expected squared distance of X from Sd is completely determined by the noise variance,σ. Also, the Gaussian noise is independent from the angular distribution of X. Figure 1 (b) shows a sample X fromX superimposed on a mesh of S2 with the distance σ2 marked.

X

Y

Z

(a)

X

Y

Z

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

(b)

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

x1

x2

(c)

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●

−1.0 −0.5 0.0 0.5 1.0

−0.

6−

0.4

−0.

20.

00.

20.

40.

6

x1

x2

(d)

Figure 1: Figure (a) is a heat map of uniform area measure on S2. Figure (b) shows a sample from X (σ = 0.1)superimposed. Figure (c) shows manifolds parametrized by fp for α in {0, .25, .5, 1, 3, 5} in increasing order ofindentation. Figure (d) gives a sample from (7), with α = 5.

The second set of alternatives, H2a , continuously transform S1 to resemble a 1-sphere less and less, while keeping

homology intact. Let

f+p (θ; r, α) .=

{r cos(θ) , r sin(θ)−

(e−α(r cos(θ))2 − e−α

)}; θ ∈ [0, π] ; d, r ∈ IR+,

f−p = f+p (θ−π; r, α), and fp(θ; r, α) = I{θ∈[0,π]}f

+p (θ; r, α)−I{θ∈[π,2π]}f

−p (θ; r, α). Increasing the parameter α mimics

the action of placing 2 fingers on opposite sides of a balloon and attempting to touch them together. Manifolds forrepresentatives α’s are pictured in figure 1(c). The corresponding (non-normalized) area measure for f+

p is

fY (θ; r, α) = r[√

1− 2 cos(θ)A(θ) +A(θ)2], A(θ) = α sin(2θ)e−α cos2(θ), (7)

3

Page 4: Are the data hollow?

and symmetric for f−p . Although we only consider the two dimensional case, higher dimension versions of fp canbe calculated through rotations about a bisecting axis.

Although the area measure (7) is clearly integrable, no simple analytical solution exists and we resort to MarkovChain Monte Carlo (MCMC) methods to sample data. Figure 1(d) shows a sample from distribution function x,parametrized by fp, and α = 5. The sample appears uniformly distributed despite heavy curvature in the manifold.

For numerical analysis, we consider the cases of d ∈ {1, 2, 3} and σ ∈ [0, 1] for H1a , corresponding to a noise

standard deviation up to the radius of Sd, and d = 2, α ∈ {1, 2, . . . , 5} for H2a .

3 ResultsOur first result suggests that using the distribution of T under the null hypothesis of (2) can be regarded as a

conservative estimate of the more general test (1). We generate comparison data from bi-variate Gaussian mixtures,centered at the origin. Each mixture consists of two Gaussians with equal absolute variance, rotated either 45degrees clockwise or counterclockwise. This results in an “X” shaped distribution, examples of which are given infigure 2(a)-(c), with 95% equi-confidence ellipses drawn for each Gaussian. A 2-sample Anderson-Darling test wasthen performed to compare the resulting empirical distributions of T when 200 samples of 1000 points each aredrawn from the mixed distributions above, and the null G0. The resultant p-values are 0.279, 0.016 and 0 suggestingthat while there is not enough evidence to prove TG0 and T under distribution (a) have dissimilar distributions, thedistribution of T under either case (b) or (c) is significantly different from that of TG0 .

From here, we calculate confidence intervals for the “effective α” values of the test (2) in the cases (a), (b),and (c). We define the effective α value, α, by the type I error rate under an alternate null hypothesis, using theoriginal critical value. The value α answers the following question: assuming the correct null distribution to usewas actually (a), (b), or (c), what significance level are we testing by using a critical value from G0? A confidenceinterval for α can be readily attained by estimating the G0 critical value from its empirical distribution, TαG0

, andthen performing 2000 bootstrap resamples of the statistic pB(α) = P (T ≥ TαG0

). Since it is reasonable to assumepB(α) may not be variance stabilized, we use a bias-corrected and accelerated (BCa) method to calculate bootstrapconfidence intervals. Resulting 90% intervals for pB(0.05) are [0.025, 0.075], [0.015, 0.055], and [0.000, 0.020]. Hence,among these examples, greater dissimilarity of the test statistic T from TG0 , in distribution, corresponds to a lowereffective alpha.

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x1

x2

(a)

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x1

x2

(b)

●●●

●●

●●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x1

x2

(c)

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●●

●●

●●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x1

x2

(d)

Figure 2: Sample taken from Gaussian mixture distribution: Z = X1 +X1, where X1 and X2 are bivariate Gaussiansof equal absolute variance, rotated 45 degrees. A 95% equi-confidence ellipse is drawn for X1 and X2. Figure (d)shows a sample from G0 for comparison.

The next set of tests examine the noise tolerance of TX with X distributed according to (6), by sampling from H1a .

The dimensions of Sd considered are d in {1, 2, 3} with respective sets of noise parameters σ1 = {0.05, 0.10, . . . , 0.95},σ2 = {0.05, 0.10, . . . , 0.70}, and σ3 = {0.05, 0.10, . . . , 0.25}. Decreasing set sizes for σi were taken due to computa-tional considerations. As before we take 200 samples of 1000 points for each dimension and noise level. To estimatechanges in power directly, we calculate bootstrap confidence intervals for the power of the empirical critical value,P (TX ≥ TαG0

|X ∼ U(Sd) ∗N1(0, σ2)), similar to the procedure above. Descriptive results are given in figure 3(a)-(c).The overall message seems to be that recovering a sphere topology is highly unlikely when sampling noise exceeds30% of the radius of the manifold. We now turn to examine further why this break down occurs.

One test in this direction is to examine which levels of σ are critical in determining the distribution of TX(σ). Thatis, what is the set of σk such that Tσk−1 has a different distribution from Tσk? Note we have simplified notation bydefining Tσ

.= TX(σ). Figures 3(d)-(f) show quantile-quantile plots comparing the distributions of Tσk to a standardnormal. The results indicate a partition into an active period when noise is below 30− 35% of the radius of Sd, anda tame period for greater noise. The active period is characterized by a monotone shift from negative skew centeredat a proportion close to 1, to a symmetric distribution, to a positive skew centered about 0. The tame period ischaracterized by distributions all similar to that of TG0 , represented by the bold, red line in figures 3(d)-(f). Thereis also some evidence of a dimensionality effect. In the case of S3, embedded in 4 dimensions, the noise breakdown

4

Page 5: Are the data hollow?

● ● ● ●

●●

●●

● ●●

●●

20 40 60 80

0.2

0.4

0.6

0.8

1.0

SD Noise

Est

imat

ed P

ower

− 90% CI

(a)

● ● ●

●●

●●

●●

● ● ● ●

10 20 30 40 50 60 70

0.2

0.4

0.6

0.8

1.0

SD Noise

Est

imat

ed P

ower

− 90% CI

(b)

●●

5 10 15 20 25

0.2

0.4

0.6

0.8

1.0

SD Noise

Est

imat

ed P

ower

− 90% CI

(c)

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Theoretical Quantiles

Em

piric

al Q

uant

iles

of T

● ● ● ●● ● ● ●

●●●●

●●●●●

●●●●●●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●● ●

● ● ● ● ● ●

(d)

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Theoretical Quantiles

Em

piric

al Q

uant

iles

of T

● ●

●● ●

● ●● ● ●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●

(e)

−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

Theoretical Quantiles

Em

piric

al Q

uant

iles

of T

● ●● ●

● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

● ●

●● ● ● ● ● ●

●●

(f)

● ● ●

● ●

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

SD Noise

Est

imat

ed P

ower

− 90% CI

(g)

1000 1500 2000 2500 3000 3500 4000

0.0

0.2

0.4

0.6

0.8

1.0

sample size (n)

Est

imat

ed P

ower

− 90% CI

(h)

Figure 3: Top Row: Bootstrap 90% confidence intervals for the empirical power of representative distributions fromH1a . Middle Row: Quantile-quantile plots of Tσ, plotted against a standard normal distribution. In all cases, the

distributions may be viewed top to bottom in increasing levels of σ. Black lines with symbols correspond to thechaotic period, while lighter lines with no symbols demonstrate the tame period. The null distribution (bold / red)is included for comparison. For both top rows, the left column corresponds to tests performed on S1, the middle toS3, and the right to S4. Bottom Row: Bootstrap 90% confidence intervals for the empirical power of representativedistributions from H1

a . Figure (a) fixes n = 1000, while (b) fixes α = 3, letting n vary.

and partitioning σ seem to be closer to .25, although we do not have data on higher noise levels to verify this claim.Two sample Anderson-Darling tests comparing distributions for consecutive σ’s support these results.

A final series of tests inspects the power of the test (2) against a particular class of diffeomorphisms of Sd, namelythose in H2

a , and inspects the consistency of the estimator TX in this class.Confidence intervals for empirical power are computed through similar bootstrap BCa intervals. As seen in figure

3(g), power is lost as the coefficient α increases, corresponding to indentation of the circle at both sides. This isexpected as indentation would cause bisecting edges to form at smaller distances in the simplicial complex. In otherwords, complexes not homeomorphic to S1 are present for a larger fraction of distance parameters observed.

A second conclusion is that Tx is a consistent estimator under H2a . For figure 3(h), we performed the same

power calculations as above, fixing β = .30, and letting the sample size n vary. As n increases, empirical power also

5

Page 6: Are the data hollow?

increases, practically to the same level as no indentation.Although no theoretical results into the consistency of Tx under H2

a are given, an intuitive argument is as follows.Since the manifold in question has finite curvature everywhere, there must be some threshold distance ε such thatany edge connecting two points with length less than ε does not bisect the manifold in a way that alters homology.With a larger amount of points, the average edge length decreases, while the distance to bisect the circle’s center,the primary way to alter homology in our class of manifolds, stays constant. Hence, increasing n approaches thethreshold distance in expectation, causing the correct homology of S1 to appear for a larger fraction of distanceparameters.

4 ConclusionIn this paper, we have demonstrated the use of persistent homology to reliably extract topological information

from data. A basic hypothesis test on homological output was able to discover hollowness in data, which can begeneralized to any number of dimensions. The test has the benefits of conservative rejection regions, robustness tosampling noise, and consistency against diffeomorphisms for the examples considered.

Of course, this is a first step in studying the application of persistence to manifold learning. Both fields areactive in their research, and possibilities for their synthesis continue to grow. One particular area of interest forfuture research is zig-zag persistence [4], which calculates homology over partitions of the space data sits in, as wellas intervals of distance parameters. This allows one to discover hollowness in data resembling Swiss cheese, that is,with multiple holes.

Also the homology of a sphere is but one topological signature. It will be interesting to examine if other manifoldshave their own statistical interpretation.

AcknowledgmentsThis work was funded by a grant from the DMS-NSF VIGRE program to Stanford Statistics departments. We

are thankful to Gunnar Carlsson, Persi Diaconis, Dmitriy Morozov, and Jennifer Kloke for enlightening discussions.

References[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural

Computation, 15(6), 2003.[2] Gunnar Carlsson. Topology and data. Bull. Amer. Math. Soc. (N.S.), 46(2):255–308, 2009.[3] Gunnar Carlsson and Vin de Silva. Topological estimation using witness complexes. In M. Alexa and

S. Rusinkiewicz, editors, Eurographics Symposium on Point-Based Graphics (2004), page na, 2004.[4] Gunnar Carlsson and Vin de Silva. Zigzag persistence, 2008.[5] Gunnar Carlsson and Tigran Ishkhanov. A topological analysis of the space of natural images. preprint, 2008.[6] Gunnar Carlsson, Tigran Ishkhanov, Vin Silva, and Afra Zomorodian. On the local behavior of spaces of natural

images. Int. J. Comput. Vision, 76(1):1–12, 2008.[7] Omar De la Cruz. Geometric Approaches in the analysis of Genetic Data. PhD thesis, University of Chicago,

2008.[8] P. Diaconis, S. Holmes, and M. Shahshahani. Sampling from a manifold. 2009.[9] Herbert Federer. Geometric measure theory. Springer-Verlag Berlin Heidelburg, 1996.

[10] Allen Hatcher. Algebraic topology. Cambridge University Press, Cambridge, 2002.[11] Ann B. Lee, Kim S. Pedersen, and David Mumford. The nonlinear statistics of high-contrast patches in natural

images. Int. J. Comput. Vision, 54(1-3):83–103, 2003.[12] Perry P. and de Silva V. Plex: Simplicial complexes in matlab. available at

http://comptop.stanford.edu/programs/, 2006.[13] Lawrence K. Saul and Sam T. Roweis. Think globally, fit locally: unsupervised learning of low dimensional

manifolds. J. Mach. Learn. Res., 4(1):119–155, March 2003.[14] Gurjeet Singh, Facundo Memoli, Tigran Ishkhanov, Guillermo Sapiro, Gunnar Carlsson, and Dario L. Ringach.

Topological analysis of population activity in visual cortex. J. Vis., 8(8):1–18, 6 2008.[15] Deborah F. Swayne, Duncan Temple Lang, Andreas Buja, and Dianne Cook. Ggobi: evolving from xgobi into

an extensible framework for interactive data visualization. Comput. Stat. Data Anal., 43(4):423–444, 2003.[16] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality

reduction. Science, 290(2319), 2000.[17] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A Global Geometric Framework for Nonlinear

Dimensionality Reduction. Science, 290(5500):2319–2323, 2000.[18] van der Schaaf A. van Hateren J.H. Independent component filters of natural images compares with simple cells

in primary visual cortex. Proc.R.Soc.Lond B, 265:359–356, 1998.[19] K.J. Worsley. The geometry of random images. Chance, 9:27–40, 1996.[20] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete Comput. Geom., 33(2):249–

274, 2005.

6