voronoi methods for assessing statistical dependence · 2017-12-01 · voronoi methods for...

1–25

Voronoi Methods for Assessing Statistical Dependence

Jonas Mueller [email protected] Vassar St., Cambridge, MA 02139

David Reshef [email protected]

32 Vassar St., Cambridge, MA 02139

Abstract

In this work, we leverage the intuitive geometry of the Voronoi diagram to develop a newclass of methods for quantifying and testing for statistical dependence between real-valuedrandom variables. The contributions of this work are threefold: First, we introduce the util-ity of probability integral transforms of the datapoints to produce Voronoi diagrams whichexhibit desirable properties provided by the theory of copula distributions. Second, we pro-pose a novel nonparametric statistical test (VORTI) to determine whether two variablesare independent based on the areas of the cells in the Voronoi diagram. Based on similarideas, we finally introduce a new method (SVORMI) for estimating mutual informationwhich relies on a novel regularization technique we found no mention of in the literature ondensity estimation via Voronoi tessellation. Through extensive simulation experiments, wedemonstrate that our proposed estimator is competitive with the state-of-the-art k-nearestneighbor mutual information estimator of Kraskov et al. (2004).

Introduction

Detecting various forms of deviations from statistical independence between random vari-ables is a cornerstone of statistics. A wide range of approaches to this problem have beenexplored, ranging from correlation coefficients based on data values/ranks/distances [Pear-son (1896); Hazewinkel (2001); Kendall (1938); Szekely and Rizzo (2009)], Fourier or RKHSrepresentations [Stein and Weiss (1971); Gretton et al. (2005, 2007)], and information-theoretic tools [Cover and Thomas (2006)]. Each class of methods has its advantages anddisadvantages and relies on different assumptions on relationships present in the data.

An extremely popular measure of dependence in machine learning is the mutual in-formation (see Definition 1). Intuitively, mutual information (MI) measures the degree towhich knowing the value of one variable gives information about another. It has severalproperties that make it a desirable basis for a measure of dependence: it is non-negativeand symmetric, it is invariant under order-preserving transformations of random variablesX and Y , and, most importantly, it equals zero precisely when X and Y are statisticallyindependent. These properties—which are not shared by simpler measures of correlationsuch as Pearson correlation—have led to its widespread use in diverse applications [Ele-mento et al. (2007)].

Mueller Reshef

The challenge in applying mutual information to quantify relationships between contin-uous variables lies in the fact that it is defined as a functional of their unknown probabilitydensity function (p.d.f.), which in practice must be estimated from finite data samples fromthis distribution. This task has proven difficult, leading to the development of a wide rangeof mutual information estimators have been based on a different techniques. The sim-plest estimators rely on discretizing the data via gridding to estimate its joint distributionwithin each grid [Steuer et al. (2002)]. More sophisticated methods utilize kernel densityestimation (KDE) to estimate a smooth joint distribution [Moon et al. (1995)]. Comparedwith gridding-based approaches, KDE mutual information estimators exhibit a better meansquare error rate of convergence of the estimate to the underlying density, an insensitiv-ity to the choice of origin, and the ability to specify more flexible window shapes than therectangular window used in grid-based approaches [Steuer et al. (2002); Moon et al. (1995)].

Exploring the geometric properties of randomly sampled data points from some bivari-ate distribution, we find that the Voronoi diagram exhibits different characteristics in thecase where the two random variables are statistically dependent than if the variables wereindependent. By exploiting these properties, we are able to develop new methods for as-sessing dependence and measuring mutual information. Our approaches are both intuitiveand highly successful in simulation studies.

Notation

In this paper, all logarithms are assumed to be natural (base e), so MI is estimated in natsrather than bits (as is common practice in the continuous setting). Also, all geometry inthis work is assumed to be Euclidean. To keep our exposition clean and intuitive, we limitour discussion to univariate real-valued random variables X,Y ∈ R (unless explicitly statedotherwise), although all of our methods may be applied to higher-dimensional variables inpractice. Furthermore, the primary goal of many data analyses is to identify and measurepairwise dependencies between single features which are commonly measured on a contin-uous scale in R, so our restriction to the 1-D setting is widely applicable.

Let FXY denote the joint distribution of (X,Y ), let FX , FY denote the respectivemarginal distributions (as cumulative distribution functions), and let fXY , fX , fY the cor-responding probability density functions. Given n i.i.d. samples {(xi, yi)}ni=1 from FXY , wewish to measure the statistical dependence between X and Y by estimating I(X,Y ), thecontinuous mutual information of X and Y .

Definition 1 The mutual information of X and Y is given by

I(X,Y ) :=

∫R2

fXY (x, y) log

(fXY (x, y)

fX(x)fY (y)

)d(x, y)

Copulas

In our methods, we first apply the probability integral transforms U = FX(X), V = FY (Y )so that the random vector (U, V ) has uniformly distributed marginals. By Sklar’s Theorem,

2


any distribution FXY can be expressed in terms of univariate marginal distributions and acopula in which the dependence between X and Y is fully encoded Sklar (1959).

Definition 2 The copula of (X,Y ) is defined as the joint c.d.f. of (U, V )

C(u, v) := Pr[U ≤ u, V ≤ v]

=⇒ C(u, v) = Pr[X ≤ F−1X (u), Y ≤ F−1Y (v)] (1)

Letting c(u, v) denote the p.d.f. corresponding to the copula c.d.f., a simple calculationdemonstrates how the original joint density may be expressed in terms of the copula densityof the transformed variables:

f(x, y) = c (FX(x), FY (y)) fX(x)fY (y) (2)

For independent random variables X,Y : C(u, v) = u · v, while if X and Y are perfectlycorrelated, then C(u, v) = min{u, v}. This property of the copula is highly desirable formeasuring dependence: no matter what form the joint distribution of X,Y takes (assumingit is sufficiently “nice”, e.g. smooth, supported on set of nonzero measure), if X and Y areindependent, then after probability integral transform, the resulting joint distribution ofU, V is always identical to the product distribution of standard Uniform[0, 1] variables.

Since the true distributions are unknown, empirical c.d.f.s FX , FY are used in practice to

transform each sample (xi, yi)→(ui = FX(xi), vi = FY (yi)

), where FX(x) = 1

n

∑ni=1 1[xi ≤

x]. As the sample size n grows, these one-dimensional empirical c.d.f.s rapidly converge totheir population counterparts pointwise at the standard

√n rate given by the central limit

theorem (Glivenko-Cantelli theorem says the convergence is in fact uniform), and a numberof weak convergence results have also been established for the empirical copula process.

Voronoi diagrams for data analysis

After we have transformed our data via the probability integral transform so that all pointslie in the 2-D unit square B = [0, 1]2 ⊂ R2 (or the unit cube if dimension d > 2). TheVoronoi diagram/tesselation constructed from points x1, . . . , xn ∈ B, which we denoteV or(x1, . . . , xn), is the unique partitioning of B into cells C1, . . . , Cn so that each poly-gon Ci = {x ∈ B : ||x− xi||2 ≤ ||x− xj ||2 ∀j 6= i} is the region of B which is nearer to xithan any of the other n− 1 points (see Figure 1 for an example). We include the boundaryof B as faces of the cells which lie at the border.

Via the sweep-line algorithm of Fortune (1986), the Voronoi diagram of bivariate real-valued data points can be efficiently constructed in O(n log n) time and O(n) space. Forhigher dimensional data, one may use the Bowyer-Watson incremental insertion algorithm[Bowyer (1981); Watson (1981)] or the method of Dwyer (1991) which is proven to run inlinear expected time for random points (for a fixed d). These space-partitioning structureshave a long history in computational geometry, but they have only drawn limited interestfrom statisticians, predominantly in the context of spatial data analysis [Barr (2008); Chui(2003)]. A major advantage of the Voronoi methods in data analysis are their adaptivity tothe spatial structure of the sampled points. Given n points from an unknown distribution,

3

Mueller Reshef

(a)

U

V

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b)

U

V0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1: Voronoi-based density estimates in the case of: (a) independent uniform randomvariables; (b) highly correlated Gaussian variables. The points are plotted afterapplication of the probability integral transform, and the shade of each Voronoicell corresponds to the VDE estimate of the underlying p.d.f. (darker = higherdensity).

the standard Voronoi density estimator (VDE) produces the following step-function approx-imation to the probability density function (where 1(·) denotes the indicator function):

f(p) =1

n

n∑i=1

1[p ∈ Ci]

V ol(Ci)∀ p ∈ Rd (3)

The VDE may be derived from the fact that if we assume the underlying density is trulyconstant in each cell, i.e. f(p) = ci ∀p ∈ Ci, then we must have:

Pr[p ∈ Ci] =

∫Ci

f(p) dp =

∫Ci

cidp =⇒ ci =Pr[p ∈ Ci]

V ol(Ci)

where we have observed exactly one of our samples in each Voronoi cell, leading to thefollowing empirical estimate: Pr[x ∈ Ci] = 1/n. Intuitively, we expect more data from highdensity regions to appear in our sample and therefore the Voronoi cells in these regions willtend to have smaller volumes (in our 2-D setting, we use the terms volume and area inter-changeably). For data sampled from a homogeneous Poisson process of rate λ, Meijering(1953) showed the expected Voronoi cell area of points sampled from a homogeneous Pois-son process is λ−1, a result which Barr (2008) have extended, demonstrating approximateunbiasedness of the Voronoi cell areas as estimators for the intensity of an inhomogeneous

4


Poisson process. Although Browne (2007) find that the bias of the VDE is remarkably low,they state that its variance is too large (as each Voronoi cell only contains a single point)to produce useful density estimates in practice. A number of variance-reducing modifica-tions have been suggested to produce better estimators based on these geometric objects byimposing additional smoothness [Browne (2007, 2012); Liu (2007); Barr (2008)]. Becausewe are interested in measuring statistical dependence via the mutual information whichinvolves averaging a function of the joint density over the sample space, we hypothesizethat the effect of the highly variable Voronoi-based density estimates is mitigated by theconcentration properties of the average (although the density estimates in neighboring cellsare certainly not independent, the dependence between the estimates diminishes rapidlyfor cells which are farther apart in the Voronoi diagram, so we can still expect the meanestimate over all cells to exhibit some concentration).

Testing for statistical dependence

We have so far established two fundamental facts:

1. The cell-areas of the Voronoi tessellation for a random set of points offer an unbiasedapproximation of the local underlying density from which the points are drawn.

2. After probability integral transform, any independent X,Y follow the independent-bivariate uniform distribution over the unit square (with equal probability mass ateach point in this region).

Integrating these two ideas, we expect the areas of the cells in the Voronoi diagram over{(ui, vi)}ni=1 to exhibit more uniformity if X and Y are independent than otherwise. Toempirically investigate this assumption, we generate the following 2 datasets of independentvariables with very different marginal distributions and 3 datasets of different types ofdependent variables:

• 1000 i.i.d. points where Xi ∼ Uniform[0, 1], Yi ∼ Uniform[0, 1]

• 1000 i.i.d. points where Xi ∼ N(5, 10), Yi ∼ N(2, 3)

• 1000 i.i.d. points where (Xi, Yi) follow a bivariate Gaussian with correlation = 0.9

• 1000 i.i.d. points where (Xi, Yi) follow a bivariate Gaussian with correlation = 0.2

• 1000 i.i.d. points whereXi ∼ N(0, 2) and Yi|xi ∼ sin(xi)+ε where ε ∼ Uniform[−12 ,

12 ]

(see Figure 6 for an example dataset)

For each of these datasets, we probability-integral-transform the points, generate the cor-responding Voronoi diagram, and compute the areas of each cell in the tessellation. Thisprocedure is repeated 10 times and we plot the resulting distribution of Voronoi cell areasin Figure 2.

Based on the distribution of the cell areas which is clearly altered under dependencebetween X and Y , we propose the Voronoi test for independence (VORTI) for testing thefollowing hypotheses:

H0 : X and Y are independent

HA : X and Y are not independent (i.e. P (Y | X) 6= P (Y ))

VORTI (Voronoi Test for Independence)

5

Mueller Reshef

0.000

0.005

0.010

0.015

0.020

Uniform Independent Gaussian Gaussian (cor = 0.2) Gaussian (cor = 0.9) Sine + Noise

Vor

onoi

Cel

l Are

a

Figure 2: Violin plots the distribution of cell areas in the Voronoi diagrams correspondingto 1000 randomly sampled points from different distributions. Each plot depictsa kernel density estimate in the vertical direction fitted to the Voronoi cell areasfrom 10 datasets of size 1000 sampled from each distribution. The median Voronoicell area is represented by the white dot, which is in turn incorporated into aboxplot of the cell areas (shown in black).

1. Compute empirical c.d.f.s FX , FY (where FX(x) = 1n

∑ni=1 1[xi ≤ x])

2. For i ∈ {1, . . . , n}: transform data ui = FX(xi), vi = FY (yi)

3. Construct Voronoi diagram V or ((u1, v1), . . . , (un, vn))

4. Compute empirical distribution Areas which is a list of the areas of each cell in theVoronoi diagram.

5. Lookup the appropriate approximate null distribution F0 for this sample size n froma pre-computed table

6. Perform one-sample Kolmogorov-Smirnov test whether Areas is a sample from F0

and return the p-value of this test as the p-value for this independence test

VORTI involves a pre-computation stage which is only done one time ever to obtainsimulated null-distributions of the Voronoi-cell areas under independence for large samplesizes n. In this pre-computation: we generate B simulated datasets of size n in which bothvariables are i.i.d. samples from the Uniform[0, 1] distribution. For each dataset, we con-struct the Voronoi diagram (after empirical copula transform) and calculate the areas ofeach Voronoi cell. Since the joint distribution of the probability-integral-transformed U, Vfollows the independent-uniform distribution for any independent X,Y (regardless of their

6


distribution), the distribution of these pre-computed areas converges (as B → ∞) to F0,the true distribution of Areas under the null hypothesis H0 (we set B = 1000 in our exper-iments). Note that n must be very large; we cannot obtain an accurate null distribution ofour test statistic this way in the small sample-size regime because there is no guarantee thatthe empirical copula process which describes our finite sampled data is close to the inde-pendent uniform joint distribution. For small sample sizes, the underlying joint distributionof X and Y thus cannot be ignored (even under independence) as it will heavily influencethe distribution of the Voronoi-cell Areas statistic computed from the probability-integral-transformed data.

Nevertheless, assuming n is sufficiently large, we have: if X and Y are independent,then Areas is approximately a sample from this null distribution, so we can test whetherthe latter is possible to determine whether the former holds. Furthermore, if X and Y arenot independent, then the corresponding U, V will not be uniformly distributed in the unitsquare, so there will be regions of unit square where the joint density of (U, V ) is > 1 fromwhich Voronoi cell areas are expected to be smaller than under a uniform density, as wellas regions of the square where the density is less than 1 in which Voronoi cell areas willbe larger on average. Thus, in the case of dependent X,Y , the distribution from whichAreas is a sampled will necessarily be distinct from the null distribution. By adoptingVoronoi cell-areas as a test-statistic, we manage to convert a 2-D goodness-of-fit test ofthe empirical 2-D copula distribution against the independent-uniform joint distribution(which is equivalent to our test of H0 vs HA) to a 1-D goodness-of-fit test of the empiricaldistribution of Areas against F0. This reduction is likely even more useful if we are testinghigher-dimensional X and Y for independence as the performance of goodness-of-fit testsheavily suffers with increasing dimensionality.

The final step of VORTI employs the Kolmogorov-Smirnov (KS) test, a popular non-parametric method for determining whether one-dimensional obeserved data is sampledfrom some continuous reference distribution F0 [Bickel and Doksum (2003)]. This test re-quires minimal assumptions regarding the true data-generating distribution and comparesboth location and shape of the empirical distribution with F0. As a result of its extremeflexibility, the KS test does not exhibit much statistical power, generally requiring fairlylarge n to reject the null hypothesis. This is not a problem for our method as we anywayrequire large n to establish an accurate approximation to F0 in the first place. While wehave not yet explored the statistical power of VORTI in practice, we believe the proposal ofa novel nonparametric test to detect arbitrary forms of dependence from real-valued datais a significant contribution in itself, considering the dearth of methods currently availableto tackle this basic problem.

Estimating mutual information

The simplest adaptation of the Voronoi diagram to estimate mutual information is to simplyreplace the rectangular grid (of user-specified bin size) in the discretization-based estimatorswith the data-adaptive tessellation produced by the Voronoi method and the corresponding

7

Mueller Reshef

density estimates of the VDE. We improve upon this approach by leveraging properties ofthe copula.

Theorem 1 The mutual information between X and Y is equal to the negative continuousentropy of the corresponding copula distribution:

I(X,Y ) =

∫ 1

0

∫ 1

0c(u, v) log [c(u, v)] du dv

Proof

I(X,Y ) =

∫R2

fXY (x, y) log

(fXY (x, y)

fX(x)fY (y)

)d(x, y)

=

∫R2

c (FX(x), FY (y)) fX(x)fY (y) log [c (FX(x), FY (y))] d(x, y) (applying (2))

=

∫ 1

0

∫ 1

0c(u, v) log [c(u, v)] du dv (by change-of-variables transform)

Combining the result of Theorem 1 and the definition of the VDE (3), we obtain a Voronoi-based estimator of mutual information (VORMI) by computing the following statistic fromthe cells of the Voronoi diagram constructed over the probability integral transformed datapoints:

V ORMI =

n∑i=1

1

n ·Area(Ci)log

[1

n ·Area(Ci)

]Area(Ci)

=1

n

n∑i=1

log

[1

n ·Area(Ci)

](4)

Interested in estimating multivariate entropies, Miller (2003) have previously consideredthis same estimator, and like Browne (2007, 2012); Barr (2008), they present a method re-duce the variance of the resulting estimator. We now describe all of the previously proposedvariance reduction strategies in turn and discuss why each one is inadequate for accuratemutual information estimation.

To remedy the high-variance of the VORMI estimator, Miller (2003) suggest agglom-eratively merging neighboring Voronoi cells into “hyper-regions” (starting with m randomseed cells) and performing the estimation as if these hyper-regions were the original Voronoicells, but they themselves acknowledge that this Voronoi-cell-binning clearly biases the re-sulting entropy estimates upwards (since the probability mass is assumed to uniformlydistributed within each hyper-region). We find that even without the merging of cells intohyper-regions, VORMI tends to significantly underestimate mutual information in the casewhere X and Y are highly dependent (see Figure 3) due to its assumption of uniformlydistributed probability mass within each cell, and merging cells into even larger suppos-edly uniformly-distributed regions will only exacerbate this problem. Furthermore, such

8


a merging approach negates a major advantage the Voronoi tessellation posses over ker-nel/griding based methods, namely the fact that it probes the data at the finest resolutionpossible. Because Miller (2003) do not provide any empirical results demonstrating that useof the hyper-region areas leads to improved entropy estimates, we opt to forgo this approach.

To improve upon the VDE, Browne (2012) propose subsampling the data (only retain-ing m out of the n points where m� n) and averaging the Voronoi regions over numerousrandom subsamples to obtain a smoother final estimate of the density. Although this is areasonable approach for density estimation, it produces significantly downwardly biased MIestimates. Because each subsample only consists of a small fraction of the data, the Voronoidiagram over the subsample is thus far less fine-grained than the Voronoi diagram over thefull data, which as in the case of the hyper-region approach leads to larger regions which areassigned uniform probability. In the case of dependent random variables (in which case theunderlying copula is not uniform over the unit square), this again leads to the appearanceof greater copula entropy than is actually present in the underlying distribution and thusunderestimates of mutual information from this subsample. Finally, averaging over down-wardly biased subsamples cannot resolve this issue, so the resulting overall MI estimator isalso heavily biased. We implemented a subsampling-based variant of VORMI and confirmedempirically that this is indeed the case (especially in the small sample-size regime of 30-100which is typical for datasets in the sciences).

Another improvement to reduce the variance of the VDE proposed by both Browne(2007) and Barr (2008) is the regularization of the Voronoi diagram over the observed dataso that it more closely resembles a centroidal Voronoi tessellation (CVT). Given a set ofpoints and their Voronoi diagram, the CVT is produced by iteratively updating the lo-cations of each point to be the centroid of its Voronoi cell and recomputing the Voronoidiagram until we have a “regular” Voronoi tessellation in which each point lies exactly atthe centroid of its Voronoi cell. However, theoretical justifications of this approach areunclear, the method is unfaithful to the observed data (points are moved away from theiroriginal locations), and the resulting distribution of Voronoi cell areas after regularizationis biased toward uniformity.

From Figure 3, we find that VORMI in fact does not even exhibit the problem of sig-nificant variance (it is no more variable that the k-NN estimator), suggesting that soleemphasis on variance-reduction will only produce limited improvements in MI estimates.While the individual probability density estimate at a specific location is highly variable(it entirely depends on the area of the Voronoi cell encompassing this location which mayfluctuate greatly depending on the geometric arrangement of the sampled data points),VORMI sums functions of these cell-areas over all n Voronoi cells and therefore exhibitsmuch greater stability presumably as a result of the concentration of measure phenomenon(despite the fact that estimates from nearby cells are not necessarily independent).

Clearly evident from Figure 3, the primary factor impairing the performance of VORMIis its high bias rather than a large variance: From independent (or low-dependence) data,VORMI heavily overestimates MI on average due to the fact that numerous small Voronoi

9

Mueller Reshef

cells appear with high probability, even if the data are truly uniformly distributed through-out the unit square (see Figure 1 for an example). In the opposite situation where X andY are highly dependent, VORMI tends to significantly underestimate MI because the as-sumption of uniform density within each Voronoi cell is a significant overestimate in thecells which touch the border of the [0, 1]2 bounding square assuming X,Y are each sup-ported on the entire real line (because then their densities must fall off near the boundariesof the space in which all probability integral transformed data points are constrained tolie). Furthermore, the number of these boundary-adjacent cells is typically much greaterwhen the variables are highly dependent; see Figure 1). This issue is most apparent in theextreme case where X and Y are comonotonic or countermonotonic (i.e. Y = g(X) for somemonotonic function g), then all empirical probability integral transformed data points inwill lie on a line L. Thus, every cell of the Voronoi diagram over these points will touch theunit square boundary and the areas of these cells do not accurately reflect the fact that theunderlying joint distribution is only supported on L.

Smoothed Voronoi diagram for mutual information estimation

To amend the previously highlighted faults of VORMI, we propose a bias-correction tech-nique which drastically improves MI estimates. The resulting estimator, which we referto as SVORMI, is very simple and balances the following two separate objectives whileremaining faithful to the original Voronoi diagram, and hence the data:

1. Smoothness in the distribution corresponding to Voronoi cell pseudo-areas

2. Mitigation of the effects of the boundary-bordering cell-areas which may be very poorapproximations of the local density.

Maintaining a set B which contains the indices of Voronoi cells that intersect the unitsquare boundary, SVORMI also maintains a pseudo-area Ai for each Voronoi cell Ci whichis a smoothed approximation to the cell’s true area. Initializing all Ai with the areas ofthe corresponding Voronoi cells in the tessellation constructed over our probability-integral-transformed data points, SVORMI simultaneously applies a border-sensitive local-averagingupdate to all Ai in the tth iteration.

More specifically, in each iteration of SVORMI, the pseudo-area of a non-boundary-bordering cell Ci is locally averaged with the pseudo-areas of its non-boundary-borderingneighboring Voronoi cells to produce the new pseudo area Ai, while the pseudo-area of aboundary-bordering cell is locally averaged with the pseudo-areas of all neighboring cells,regardless whether they’re boundary-bordering or not. In a single-iteration, these updatesare simultaneously applied to all cells using the pseudo-area values from the previous itera-tion. Finally, the mutual information is estimated using the VORMI formula (4), but withpseudo-areas used in place of the original Voronoi cell areas.

The rationale behind our algorithm is that the non-boundary cell update will correct forthe presence of a few Voronoi cells with minuscule area in the case of independent randomvariables by averaging their areas with the (in high probability larger-area) neighboringcells. At the same time, the areas of boundary cells, which are more likely to lead to in-accurate density/MI estimates are not allowed to affect the pseudo-areas of non-boundary

10


SVORMI Iterative update

N(Ci) = indices of neighboring Voronoi cells which share a face with Ci

B = indices of Voronoi cells which intersect unit square boundary

If i /∈ B and |N(Ci)\B| > 0 : A(t+1)i ← 1

|N(Ci)\B|+ 1

A(t)i +

∑j∈N(Ci)\B

A(t)j

If i ∈ B : A

(t+1)i ← 1

|N(Ci)|+ 1

A(t)i +

∑j∈N(Ci)

A(t)j

Otherwise: A

(t+1)i ← A

(t)i

cells, while at the same time being smoothed by the areas of the other Voronoi cells.

We find SVORMI to be fairly insensitive to the choice of T (the total number of border-sensitive neighborhood averaging updates) which in addition to controlling the bias due tothe border Voronoi cells and cells of minuscule size also controls the overall smoothness ofthe resulting estimate. The resulting pseudo-areas produced after T SVORMI updates mayalso be viewed as solutions to the following modified graph-Laplacian optimization problemfor a discrete sequence of increasing values of the regularization parameter λ:

minA1,...,An

n∑i=1

(Ai −Area(Ci))2 + λ

∑(i,j)∈Ei,j /∈B

(Ai −Aj)2 +

∑(i,j)∈Ei,j∈B

(Ai −Aj)2 +

∑(i,j)∈E

(Ai −Aj)2

subject to

n∑i=1

Ai = 1 and Ai > 0 ∀ i (5)

where E is the set of O(n log n) edges of the triangulation graph corresponding to the De-launay triangulation of our data points (two data points are connected by an edge in thistriangulation iff their Voronoi cells share a face). The Laplacian-optimization perspectiveof SVORMI makes explicit the objective of overall smoothness of the VDE-estimated den-sity while simultaneously remaining faithful to the observed data. Solving this quadraticprogramming problem under different values of λ allows us to control the regularization ofthe SVORMI method at finer scales than permitted by the discrete nature of T .

A principled way to select the parameter λ or T is leave-one-out likelihood maximization:Each individual data point (ui, vi) is in turn left-out of the dataset and the Voronoi diagramconstruction proceeds over the other n−1 points. After we have computed the correspondingn − 1 pseudo-areas, we evaluate the likelihood of the held-out (ui, vi) as the probabilitydensity estimate produced for the Voronoi cell containing (ui, vi) by the VDE with Voronoi

11

Mueller Reshef

cell areas replaced by their smoothed psuedo-counterparts. Subsequently, we select theparameter setting which produces the best average held-out likelihood across each of the npoints being omitted as the held-out point. We finally note that this leave-one-out methodcan be efficiently implemented if we have already computed the full Voronoi diagram bydeleting a single point and updating the appropriate Voronoi cells from the data structure,rather than constructing the Voronoi diagram over the remaining n− 1 points from scratchfor each of the n iterations of leave-one-out likelihood evaluation.

Results

One of the leading mutual information estimators is the k-nearest neighbor method (k-NN)introduced by Kraskov et al. (2004). This popular state-of-the-art method uses similarprinciples to the KDE-based approach, but rather than Gaussian kernels, k-nearest neigh-bor distances are used to obtain entropy estimates. The k-NN method exhibits a numberof advantages over KDE mutual information estimators: it is more efficient, it is adaptive(resolution is higher where data are more numerous), and tends to have less bias [Kraskovet al. (2004)].

In our experiments, we find that this k-NN method significantly outperforms both thediscretized and KDE-based mutual information estimates in applications to continuous datawithout additional parameter optimization. We thus omit comparisons of the methods pre-sented in this paper with discretized / KDE-based mutual information estimates, focusingsolely on how our methods fare against Kraskov’s presumably superior method.

SVORMI and the k-NN estimator are similar in their reliance on nearest neighbors,although the nearest neighbors considered in SVORMI are the set of points in the spacewhich are closest to each data point, while the nearest neighbors in the k-NN method aresimply k other samples which lie closer to a given data point than any other of the samples.Both methods are data efficient (considering structures down to the smallest scale possi-ble for the given data), adaptive, and have minimal bias, although for extremely highlydependent variables, the MI estimates of both SVORMI and the k-NN estimator tend toremain biased downwards. However, this downward bias is of little concern since such greatdegrees of dependence are virtually nonexistent in real-world datasets, and furthermore, thecontinuous form of MI itself is an increasing unstable measure as Y approaches a functionof X (it tends toward infinity; see Fact 1).

The number of local averaging updates used in SVORMI (T ) is conceptually similarto the number of neighbors (k) in the k-NN MI estimation algorithm in that it regulatesthe resolution at which we treat the data (with T = 0, i.e. no neighborhood pseudo-areaaveraging, corresponding to the case in which the data are treated at the finest possiblescale). In our experiments, we perform no parameter tuning, setting both T and k equalto log2(n) to obtain fair comparisons (the theory behind k-NN methods suggests O(log n)to be optimal choice of k as sample size grows, but we have no theoretical justification forsetting T to this value and it may be that SVORMI performs even better under different

12


● ● ●●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

n = 20

Correlation

Est

imat

ed M

utua

l Inf

orm

atio

n (in

nat

s)

●

−−

●

−

−●

−

−●

−−

●

−

−

●

−

− ●

−−

●

−

−●

−

−●

−

−

●

−

−

●

−

− ●

−

−

●

−

−

●

−

−●

−

−●

−

−

●

−

− ●

−

−●

−

−

●

−

− ●

−

−●

−

−

●

−

−●

−

−●

−

−

●

−

− ●

−

−●

−

−

●

−

−

● ● ●●

●

●

●

●

●

●

VORMISVORMIk−NNtrue MI

● ● ●●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

n = 50

Correlation

Estim

ated

Mut

ual I

nfor

mat

ion

(in n

ats)

●−−●

−−●

−−

●−−●

−

−●

−−

●−−●

−−●

−

−●−−●

−

−●

−

−●

−−

●

−

−●

−

− ●

−−●

−

−●

−

− ●

−−●

−

−●

−

− ●

−−●

−

−●

−

− ●

−

−●

−

−●

−

−

●

−

−●

−

−●

−

−

● ● ●●

●

●

●

●

●

●


5

● ● ●●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

n = 100

Correlation

Est

imat

ed M

utua

l Inf

orm

atio

n (in

nat

s)

●−−

●−−

●

−−

●−−

●−−

●

−−

●−−

●

−−

●

−−

●−−

●

−−

●

−−

●

−−

●

−−

●

−−

●

−−

●

−

−●

−

−

●

−

−

●

−

−●

−

−●

−

−●

−

−●

−

−●

−

−●

−

−●

−

−

●

−

−●

−

−●

−

−

● ● ●●

●

●

●

●

●

●


● ● ●●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

n = 250

Correlation

Est

imat

ed M

utua

l Inf

orm

atio

n (in

nat

s)

●−−

●−−●

−−

●−−

●−−●−−

●−−

●−−●

−−

●−−

●−−●

−−

●

−−

●−−

●

−−

●−−

●

−−

●

−−

●−−

●

−−

●

−−

●

−−

●

−−

●

−−

●

−−

●

−−

●

−

−

●−−

●

−−

●

−−

● ● ●●

●

●

●

●

●

●


Figure 3: Mutual information estimates from VORMI, SVORMI, and the k-NN methodof Kraskov et al. (2004) applied to bivariate Gaussian data (of the indicatedsample sizes n) and different levels of correlation. For each sample size and eachcorrelation value, we sampled 100 datasets of size n and computed estimatesfor each dataset. The colored dots indicate the average estimate over these 100repetitions and the vertical bars denote the standard deviation, while the blackdots illustrate the true mutual information of the underlying distribution.

asymptotic parameter settings).

13

Mueller Reshef

0.0 0.2 0.4 0.6 0.8

0.00

00.

002

0.00

40.

006

0.00

80.

010

n = 20

Correlation

Mea

n Sq

uare

d Er

ror

●● ● ●

●

●●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●●

●●

● ●

●

●

●● ● ●

●●

●

●

●

●

SVORMI (T = log(n))SVORMI (T = 1)k−NN (k = log(n))3−NN

0.0 0.2 0.4 0.6 0.8

0.01

0.02

0.03

0.04

n = 50

Correlation

Mea

n Sq

uare

d Er

ror

●

●●

● ●●

●

●

●

●

●

●●

●● ●

●●

●

● ●

● ●

●

● ●

●

●

●

●

●●

●

●●

●●

●

●


0.0 0.2 0.4 0.6 0.8

0.00

00.

005

0.01

00.

015

n = 100

Correlation

Mea

n Sq

uare

d Er

ror

●●

●

● ●

●

●

●

●

●

● ●●

●

●

●●

●

●

●●

●● ●

●●

●

●

●

●

● ●

●

●

● ●

●

●

●


0.0 0.2 0.4 0.6 0.8

0.00

00.

002

0.00

40.

006

0.00

8n = 250

Correlation

Mea

n Sq

uare

d Er

ror

●●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●


Figure 4: Mean-squared errors of the MI estimates obtained via nearest-neighbor methodsof Kraskov et al. (2004) and our Voronoi diagram based approaches for bivariateGaussian data of different sample sizes/correlations. Due to time constraints, theMSE at each sample-size, correlation-value setting was estimated from only 20datasets drawn from the corresponding Gaussian. We have run SVORMI undertwo choices for the number of local-averaging updates T (1 and log2(n)) and thek-NN method with k = 3 (the default in the parmigene R package which we reliedon in our experiments) and k = log2(n).

Figures 3 and 4 clearly demonstrate that for samples of various sizes from a bivariateGaussian distribution, the performance of SVORMI is competitive to that of the k-NN

14


Function Name Definition

Linear+Periodic, High Freq y = 110 sin(10.6(2x− 1)) + 11

10(2x− 1) x ∈ [0, 1]

Exponential [2x] y = 2x x ∈ [0, 10]

Parabola y = 4x2 x ∈ [−12 ,

12 ]

Sine, High Freq y = sin(16πx) x ∈ [0, 1]

Varying Freq [Medium] Cosine y = sin(5πx(1 + x)) x ∈ [0, 1]

Table 1: Definitions of the functions used to analyze the various mutual information esti-mators in this work.

estimator for all but the highest correlation values which are of little interest in practicalsettings. Verifying that our method matches the current leading estimator on Gaussian datais important, but the primary advantage of mutual information as a measure of dependencestems from its ability to detect dependence in arbitrary irregular distributions beyond Gaus-sians. Therefore, we compare SVORMI against the k-NN estimator over datasets sampledfrom the 5 different functional forms detailed in Table 1, where we add increasing amountsof noise in the y coordinates (measured by the coefficient of determination, R2) to determinehow these methods fare in different signal-to-noise regimes. For each functional form, at agiven sample size and noise-level, we generate 50 datasets to obtain stable measurementsof the estimators’ performances (see Figure 7 for an example of one dataset used in ourevaluations). Figures 5, 9,10 demonstrate that for certain distributions / sample-sizes /noise-levels, SVORMI may be a better MI estimator than the k-NN approach (under thesquared-loss criterion which is the standard decision-theoretic measure by which continuous-valued estimators are judged).

Future Work

The most immediate extension of this work is an investigation of how our methods copewith the curse of dimensionality. The volumes of Voronoi cells will inevitably produce worseestimates of the underlying distribution in high-dimensional settings, , and it is importantto establish to what extent the statistical power of VORTI and the convergence-rate ofSVORMI to the true mutual information are weakened for large d. Furthermore, computa-tion of Voronoi diagrams and volumes of cells also becomes expensive in the high dimensionalsetting, possibly requiring efficient approximation algorithms to maintain the feasibility ofapplying our methods to large datasets. It may be preferable in higher dimensions to adaptour methods for the dual of the Voronoi Diagram, the Delaunay triangulation. Many ap-plied domains would also benefit from extensions of these methods to metrics beyond theEuclidean distance. In general, we believe there are many more promising results to befound by bringing tools from computational geometry to problems in statistics and dataanalysis.

15

Mueller Reshef

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4Varying Freq Cosine (n = 30)

Coefficient of Determination

MS

E /

MI2

●

●

●●

●

● ● ●

●

●

● ● ●● ● ●

●

●

●

●●

●● ●

●

●

●

●

●

● ● ●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4

Varying Freq Cosine (n = 50)


MS

E /

MI2

●

●

●● ● ● ●

●

●

● ● ●● ● ●

●

●

●●

●● ● ● ●

●

●

●

●● ●

● ●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4

Varying Freq Cosine (n = 100)


MS

E /

MI2

●

●

●● ● ● ● ●

●

●●

●● ● ● ●

●

●

●●

●● ●

●

●

●

●

●

● ● ● ●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4Varying Freq Cosine (n = 250)


MS

E /

MI2

●

●

●●

●

● ● ●

●

●

● ● ●● ● ●

●

●

●

●●

●● ●

●

●

●

●

●

● ● ●


Figure 5: Mean-squared errors of the MI estimates obtained via nearest-neighbor methodsof Kraskov et al. (2004) and our Voronoi diagram based approaches from datasampled from the functional form described in Table 1 with an amount of noiseadded to Y which is specified by the coefficients of determination (R2 values).For ease of comparison we have inversely scaled each MSE by the square of thetrue underlying mutual information (where if R2 = 0, i.e. MI = 0, we scale by0.012). We have run SVORMI under two choices for the number of local-averagingupdates T (1 and log2(n)) and the k-NN method with both k = 3 (the default inthe parmigene R package) and k = log2(n).

16


References

Christopher Barr. Applications of Voronoi tessellations in point pattern analysis. Phdthesis, University of California, Los Angeles, 2008.

Peter Bickel and Kjell Doksum. Mathematical Statistics. Pearson Prentice Hall, 2 edition,2003.

Adrian Bowyer. Computing Dirichlet tessellations. Comput. J., 24(2):162–166, 1981.

Matthew Browne. A geometric approach to non-parametric density estimation. PatternRecognition, 40:134–140, 2007.

Matthew Browne. Regularized tessellation density estimation with bootstrap aggregationand complexity penalization. Pattern Recognition, 45:1531–1539, 2012.

S. N. Chui. Spatial Point Pattern Analysis by using Voronoi Diagrams and DelaunayTessellations A Comparative Study. Biometrical Journal, 45(3):367–376, 2003.

T Cover and J Thomas. Elements of Information Theory. New York: John Wiley & Sons,Inc, 2006.

R. A. Dwyer. Higher-dimensional voronoi diagrams in linear expected time. Discrete &Computational Geometry, 6(1):343–367, 1991.

Herbert Edelsbrunner, David Kirkpatrick, and Raimund Seidel. On the shape of a set ofpoints in the plane. Information Theory, IEEE Transactions on, 29(4):551–559, 1983.

O Elemento, N Slonim, and S Tavazoie. A Universal Framework for Regulatory ElementDiscovery across All Genomes and Data Types. Molecular cell, 28(2):337–350, 2007.

Steven Fortune. A sweepline algorithm for Voronoi diagrams. Proceedings of the secondannual symposium on Computational geometry, pages 313–322, 1986.

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Scholkopf. Measuring sta-tistical dependence with Hilbert-Schmidt norms. In Algorithmic learning theory, pages63–77. Springer, 2005.

Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Scholkopf, and Alex JSmola. A Kernel Statistical Test of Independence. In NIPS, volume 20, pages 585–592,2007.

M Hazewinkel. Encyclopaedia of mathematics: Supplement. Encyclopaedia of Mathematics:Supplement. Springer, 2001. ISBN 9781402001987.

M G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938. ISSN0006-3444.

A Kraskov, H Stogbauer, and P Grassberger. Estimating mutual information. PhysicalReview E, 69, 2004.

17

Mueller Reshef

Yan Liu. Probability Density Estimation on Rotation and Motion Groups. Phd thesis, JohnsHopkins University, 2007.

J. L. Meijering. Interface area, edge length, and number of vertices in crystal aggregationwith random nucleation. Philips Research Reports, 8:270–290, 1953.

E. G. Miller. A new class of entropy estimators for multi-dimensional densities. InternationalConference on Acoustics, Speech, and Signal Processing, 3:297–300, 2003.

Y Moon, B Rajagopalan, and U Lall. Estimation of Mutual Information Using KernelDensity Estimators. Physical Review E, 52(3):2318–2321, 1995.

K Pearson. Mathematical contributions to the theory of evolution. III. Regression, heredity,and panmixia. Philosophical Transactions of the Royal Society of London. Series A,Containing Papers of a Mathematical or Physical Character, 187:253–318, 1896. ISSN0264-3952.

A Sklar. Fonctions de repartition a n dimensions et leurs marges. Publications de lInstitutStatistique de lUniversite de Paris, 8:229–231, 1959.

E M Stein and G L Weiss. Introduction to Fourier analysis on Euclidean spaces. PrincetonUniv Pr, 1971. ISBN 069108078X.

R Steuer, J Kurths, C Daub, J Weise, and J Selbig. The Mutual Information: Detectingand Evaluating Dependencies between Variables. Bioinformatics, 18:231–240, 2002.

G J Szekely and M L Rizzo. Brownian distance covariance. The Annals of Applied Statistics,3(4):1236–1265, 2009.

David F. Watson. Computing the n-dimensional Delaunay tessellation with application toVoronoi polytopes. Comput. J., 24(2):167–172, 1981.

18


Supplementary Figures and Facts

-6 -4 -2 0 2 4

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

X

Y

Figure 6: An i.i.d. sample of 1000 points where X ∼ N(0, 2) and Yi | xi ∼ sin(xi) +Uniform[−1

2 ,12 ]

0.0 0.2 0.4 0.6 0.8 1.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

Linear + Periodic (n = 500, R2 = 0.9)

X

Y

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.5

0.0

0.5

1.0

Linear + Periodic (n = 500, R2 = 1.0)

X

Y

Figure 7: Two different samples drawn from the Linear + Periodic functional form describedin Table 1 with no noise introduced on the dataset depicted on the right and asmall amount of noise added in the dataset with coefficient of determination = 0.9.

19

Mueller Reshef

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4

Sine (n = 30)


MSE

/ M

I2

●

●

●

● ● ● ● ●

●

●

● ●●

● ● ●

●

●

●

●

●● ● ●

●

●

●

●●

● ● ●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4

Sine (n = 100)


MSE

/ M

I2

●

●

● ● ● ● ● ●

●

●

●●

●● ● ●

●

●

●

● ● ● ● ●

●

●

●

● ● ● ● ●


Figure 8: Mean-squared errors of the MI estimates obtained via nearest-neighbor methodsof Kraskov et al. (2004) and our Voronoi diagram based approaches from datasampled from the functional form described in Table 1 with an amount of noiseadded to Y which is specified by the coefficients of determination (R2 values).For ease of comparison we have inversely scaled each MSE by the square of thetrue underlying mutual information (where for R2 = 0, i.e. MI = 0, we scale by0.012). We have run SVORMI under two choices for the number of local-averagingupdates T (1 and log2(n)) and the k-NN method with k = 3 (the default in theparmigene R package) and k = log2(n).

20


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4

Sine (n = 30)


MSE

/ M

I2

●

●

●

● ● ● ● ●

●

●

● ●●

● ● ●

●

●

●

●

●● ● ●

●

●

●

●●

● ● ●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01

23

4

Sine (n = 100)


MSE

/ M

I2

●

●

● ● ● ● ● ●

●

●

●●

●● ● ●

●

●

●

● ● ● ● ●

●

●

●

● ● ● ● ●


Figure 9: Mean-squared errors of the MI estimates obtained via nearest-neighbor methodsof Kraskov et al. (2004) and our Voronoi diagram based approaches from datasampled from the functional form described in Table 1 with an amount of noiseadded to Y which is specified by the coefficients of determination (R2 values).For ease of comparison we have inversely scaled each MSE by the square of thetrue underlying mutual information (where for R2 = 0, i.e. MI = 0, we scale by0.012. We have run SVORMI under two choices for the number of local-averagingupdates T (1 and log2(n)) and the k-NN method with k = 3 (the default in theparmigene R package) and k = log2(n).

21

Mueller Reshef

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Weak Exponential (n = 30)


Mea

n S

quar

ed E

rror

of E

stim

ates

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●SVORMI (T = log(n))SVORMI (T = 1)k−NN (k = log(n))3−NN

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12



Mea

n S

quar

ed E

rror

of E

stim

ates

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12



Mea

n S

quar

ed E

rror

of E

stim

ates

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12



Mea

n S

quar

ed E

rror

of E

stim

ates

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●


Figure 10: Mean-squared errors of the MI estimates obtained via nearest-neighbor methodsof Kraskov et al. (2004) and our Voronoi diagram based approaches from datasampled from the functional form described in Table 1 with an amount of noiseadded to Y which is specified by the coefficients of determination (R2 values).We have run SVORMI under two choices for the number of local-averagingupdates T (1 and log2(n)) and the k-NN method with k = 3 (the default in theparmigene R package) and k = log2(n).

22


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.05

0.10

0.15

0.20

Parabola (n = 30)


Mea

n S

quar

ed E

rror

of E

stim

ates

● ●

●

●

●

●

●

●

● ●●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.05

0.10

0.15

0.20

Parabola (n = 50)


Mea

n S

quar

ed E

rror

of E

stim

ates

● ●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●

●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Parabola (n = 100)


Mea

n S

quar

ed E

rror

of E

stim

ates

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Parabola (n = 250)


Mea

n S

quar

ed E

rror

of E

stim

ates

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●



23

Mueller Reshef

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Linear + Periodic (n = 30)


Mea

n S

quar

ed E

rror

of E

stim

ates

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12



Mea

n S

quar

ed E

rror

of E

stim

ates

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12



Mea

n S

quar

ed E

rror

of E

stim

ates

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.00

0.02

0.04

0.06

0.08

0.10

0.12



Mea

n S

quar

ed E

rror

of E

stim

ates

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●



24


Fact 1 If Y = g(X) for monotone function g, then I(X,Y ) = +∞ for “nice” joint distri-butions (which are smooth and supported on a set of positive measure).

Proof

I(X,Y ) = Ex∼FX

[DKL(FY |X ||FX)

](DKL denotes the Kullback-Leibler divergence)

= Ex∼FX

[∫ ∞−∞

log

(δx(g−1(y))

fX(g−1(y))

)fX(g−1(y))

∣∣∣∣ ddy (g−1(y))

∣∣∣∣ dy] (δx(·) is the Dirac delta function)

and for each value of x:

∫ ∞−∞

log

(δx(g−1(y))

fX(g−1(y))

)fX(g−1(y))

∣∣∣∣ ddy (g−1(y))

∣∣∣∣ dy = +∞

sinceδx(g−1(y))

fX(g−1(y))≡ 0 on a set of positive measure where fX(g−1(y)) > 0

Note that the continuous formulation differs significantly from the discrete-setting in whichthe mutual information of X with itself is simply its entropy.

25

voronoi methods for assessing statistical dependence · 2017-12-01 · voronoi methods for...

Documents