lsmi-sinkhorn: semi-supervised squared-loss mutual ...problems such as image matching and photo...

15
LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual Information Estimation with Optimal Transport Yanbin Liu 1 * , Makoto Yamada 2,3 * , Yao-Hung Hubert Tsai 4 , Tam Le 3 , Ruslan Salakhutdinov 4 , Yi Yang 1 1 University of Technology Sydney, 2 Kyoto University, 3 RIKEN AIP, 4 Carnegie Mellon University September 6, 2019 Abstract Estimating mutual information is an important machine learning and statistics problem. To estimate the mutual information from data, a common practice is preparing a set of paired samples {(xi , yi )} n i=1 i.i.d. p(x, y). However, in some cases, it is difficult to obtain a large number of data pairs. To address this problem, we propose squared-loss mutual information (SMI) estimation using a small number of paired samples and the available unpaired ones. We first represent SMI through the density ratio function, where the expectation is approximated by the samples from marginals and its assignment parameters. The objective is formulated using the optimal transport problem and quadratic programming. Then, we introduce the least-square mutual information-Sinkhorn algorithm (LSMI-Sinkhorn) for efficient optimization. Through experiments, we first demonstrate that the proposed method can estimate the SMI without a large number of paired samples. We also evaluate and show the effectiveness of the proposed LSMI-Sinkhorn on various types of machine learning problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical independence between two random variables [4], and it is widely used in various types of machine learning applications including feature selection [24, 25], dimensionality reduction [23], and causal inference [26]. More recently, deep neural network (DNN) models have started using MI as a regularizer for obtaining better representations from data such as infoVAE [30] and deep infoMax [11]. Another application is improving the generative adversarial networks (GANs) [8]. For instance, Mutual Information Neural Estimation (MINE) [1] was proposed to maximize or minimize the MI in deep networks and alleviate the mode-dropping issues in GANS. In all these examples, MI estimation is the core of all these applications. In various MI estimation approaches, the probability density ratio function is considered to be one of the most important components: r(x, y)= p(x, y) p(x)p(y) . A straightforward method to estimate this ratio is the estimation of the probability densities (i.e., p(x, y), p(x), and p(y)), followed by calculating their ratio. However, directly estimating the probability density is difficult, thereby making this two-step approach inefficient. To address the issue, Suzuki et al. [25] proposed to directly * Equal contribution 1 arXiv:1909.02373v1 [stat.ML] 5 Sep 2019

Upload: others

Post on 27-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual Information

Estimation with Optimal Transport

Yanbin Liu1∗, Makoto Yamada2,3∗, Yao-Hung Hubert Tsai4, Tam Le3,

Ruslan Salakhutdinov4, Yi Yang11University of Technology Sydney, 2Kyoto University, 3RIKEN AIP, 4Carnegie Mellon University

September 6, 2019

Abstract

Estimating mutual information is an important machine learning and statistics problem. To estimate

the mutual information from data, a common practice is preparing a set of paired samples {(xi,yi)}ni=1i.i.d.∼

p(x,y). However, in some cases, it is difficult to obtain a large number of data pairs. To address this problem,we propose squared-loss mutual information (SMI) estimation using a small number of paired samples andthe available unpaired ones. We first represent SMI through the density ratio function, where the expectationis approximated by the samples from marginals and its assignment parameters. The objective is formulatedusing the optimal transport problem and quadratic programming. Then, we introduce the least-square mutualinformation-Sinkhorn algorithm (LSMI-Sinkhorn) for efficient optimization. Through experiments, we firstdemonstrate that the proposed method can estimate the SMI without a large number of paired samples. Wealso evaluate and show the effectiveness of the proposed LSMI-Sinkhorn on various types of machine learningproblems such as image matching and photo album summarization.

1 Introduction

Mutual information (MI) represents the statistical independence between two random variables [4], and it iswidely used in various types of machine learning applications including feature selection [24, 25], dimensionalityreduction [23], and causal inference [26]. More recently, deep neural network (DNN) models have started usingMI as a regularizer for obtaining better representations from data such as infoVAE [30] and deep infoMax[11]. Another application is improving the generative adversarial networks (GANs) [8]. For instance, MutualInformation Neural Estimation (MINE) [1] was proposed to maximize or minimize the MI in deep networksand alleviate the mode-dropping issues in GANS. In all these examples, MI estimation is the core of all theseapplications.

In various MI estimation approaches, the probability density ratio function is considered to be one of themost important components:

r(x,y) =p(x,y)

p(x)p(y).

A straightforward method to estimate this ratio is the estimation of the probability densities (i.e., p(x,y), p(x),and p(y)), followed by calculating their ratio. However, directly estimating the probability density is difficult,thereby making this two-step approach inefficient. To address the issue, Suzuki et al. [25] proposed to directly

∗Equal contribution

1

arX

iv:1

909.

0237

3v1

[st

at.M

L]

5 S

ep 2

019

Page 2: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

estimate the density ratio by avoiding the density estimation [24, 25]. Nonetheless, the abovementioned methodsrequires a large number of paired data when estimating the MI.

Under practical setting, we can only obtain a small number of paired samples. For example, it requires amassive amount of human labor to obtain one-to-one correspondences from one language to another. Thus, itprevents us to easily measure the MI across languages. Hence, a research question arises:

Can we perform mutual information estimation using unpaired samples and a small number ofdata pairs?

To answer the above question, in this paper, we propose a semi-supervised MI estimation algorithm, par-ticularly designed for the squared-loss mutual information (SMI) (a.k.a., χ2-divergence between p(x,y) andp(x)p(y)) [24]. We first formulate the SMI estimation as the optimal transport problem with density-ratio es-timation. Then, we propose the least-squares mutual information with Sinkhorn (LSMI-Sinkhorn) algorithm tosolve the problem. The algorithm has the computational complexity of O(nxny); hence, it is computationallyefficient. We present the connection between the proposed algorithm and the Gromov-Wasserstein distance [15],which is a popular distance for measuring the discrepancy between different domains. Through experiments, wefirst demonstrate that the proposed method can estimate the SMI without a large number of paired samples.Finally, for image matching and photo album summarization, we show the effectiveness of our proposed method.

We summarize the contributions of this paper as follows:

• We proposed a semi-supervised mutual information estimation approach that does not require a largenumber of paired samples.

• We formulated the MI estimation as a combination of density-ratio fitting and optimal transport.

• We proposed the LSMI-Sinkhorn algorithm, which can be efficiently computed and the loss is guaranteedto be monotonically decreasing.

• We determined a connection between the proposed method and the Gromov-Wasserstein distance.

2 Problem Formulation

In this section, we formulate the problem of squared-loss mutual information (SMI) estimation using a smallnumber of paired samples.

Let X ⊂ Rdx be the domain of vector x and Y ⊂ Rdy be the domain of vector y. Suppose we are given nindependent and identically distributed (i.i.d.) paired samples:

{(xi,yi)}ni=1,

where we consider the number of paired samples n is small.In addition to the paired samples, we suppose to have nx and ny i.i.d. samples from the marginal distributions:

{xi}n+nxi=n+1

i.i.d.∼ p(x) and {yj}n+ny

j=n+1i.i.d.∼ p(y),

where the number of unpaired samples nx and ny is much larger than that of the paired samples n. For example,n = 10 and nx = ny = 1000.

We also denote x′i = xi−n, i = n+1, n+2, . . . , n+nx and y′j = yj−n, j = n+1, n+2, . . . , n+ny, respectively.It should be noted that the numbers of input dimensions dx and dy and the number of samples nx and ny canbe different.

2

Page 3: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

This paper aims to estimate the SMI (a.k.a., χ2-divergence between p(x,y) and p(x)p(y)) [24] from {(xi,yi)}ni=1

by leveraging the use of the unpaired samples {xi}n+nxi=n+1 and {yj}

n+ny

j=n+1, respectively.The SMI between two random variables X and Y is defined as

SMI(X,Y )=1

2

∫∫(r(x,y)−1)

2p(x)p(y)dxdy, (1)

where

r(x,y) =p(x,y)

p(x)p(y)(2)

is the density-ratio function. SMI takes 0 if and only if X and Y are independent (i.e., p(x,y) = p(x)p(y)), andtakes a non-negative value if they are not independent.

If we know the estimation of the density-ratio function, we can approximate the SMI as

SMI(X,Y )=1

2(n+nx)(n+ny)

n+nx∑i=1

n+ny∑j=1

(rα(xi,yi)−1)2,

where rα(x,y) is an estimation of the true density ratio function parameterized by α.However, since we do not have enough number of paired samples in this paper to estimate the ratio function,

estimating the SMI from the limited number of paired samples is very challenging. The key idea is to align theunpaired samples using the paired samples and use them to improve the SMI estimation accuracy.

3 Proposed Method

In this section, we propose the SMI estimation algorithm with limited number of paired samples and largenumber of unpaired samples.

3.1 Least-Squares Mutual Information with Sinkhorn (LSMI-Sinkhorn)

Model: We employ the following density-ratio model:

rα(x,y) =

b∑`=1

α`K(x`,x)L(y`,y) = α>ϕ(x,y), (3)

where α ∈ Rb, K(x,x′) and L(y,y′) are the kernel functions, k(x) = (K(x1,x), . . . ,K(xb,x))> ∈ Rb, l(y) =(L(y1,y), . . . , L(yb,y))> ∈ Rb, and ϕ(x,y) = k(x) ◦ l(y). {xi}bi=1 and {yi}bi=1 are the sets of basis vectors

which are sampled from {xi}n+nxi=1 and {yj}

n+ny

j=1 , respectively.In this paper, we optimize α by minimizing the difference between the true density-ratio function and its

ratio model:

1

2

∫∫ (p(x,y)

p(x)p(y)− rα(x,y)

)2

p(x)p(y)dxdy

= Const. +1

2

∫∫rα(x,y)2p(x)p(y)dxdy −

∫∫rα(x,y)p(x,y)dxdy. (4)

For the second term of Eq. (4), we can approximate it by using a large number of unpaired samples. However,to approximate the third term, paired samples from the joint distribution are required. Because we have limitednumber of paired samples in our setting, the approximation of the third term can be poor.

3

Page 4: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

Algorithm 1 LSMI-Sinkhorn algorithm.

Initialize Π(0) and Π(1) such that ‖Π(1)−Π(0)‖F > η (η is the stopping parameter), andα(0), the regularizationparameters ε and λ, the number of maximum iterations T , and the iteration index t = 1.while t ≤ T and ‖Π(t) −Π(t−1)‖F > η doα(t+1) = argminα J(Π(t),α).Π(t+1) = argmin Π J(Π,α(t+1)).t = t+ 1.

end whilereturn Π(t−1) and α(t−1).

To deal with this issue, we propose the utilization of unpaired samples for the approximation of the expectationof the thrid term. Specifically, we first introduce πij ≥ 0 (

∑nx

i=1

∑ny

j=1 πi,j = 1) and we represent the third termas ∫∫

rα(x,y)p(x,y)dxdy ≈ β

n

n∑i=1

rα(xi,yi) + (1− β)

nx∑i=1

ny∑j=1

πijrα(x′i,y′j),

where 0 ≤ β ≤ 1 is a tuning parameter between the terms of paired and unpaired samples. Note that if we setπij = δ(x′i,y

′j)/n

′ where δ(x′i,y′j) is one if x′i and y′j are paired and 0 otherwise, and n′ is the total number of

pairs, then we can recover the original empirical estimate.Then, for the density-ratio model (Eq. (3)), the loss function (Eq. (4)) can be approximated as

J(α,Π) =1

2α>Hα−α>hΠ,β ,

where

H=1

(n+ nx)(n+ ny)

n+nx∑i=1

n+ny∑j=1

ϕ(xi,yj)ϕ(xi,yj)>,

hΠ,β =β

n

n∑i=1

ϕ(xi,yi) + (1− β)

nx∑i=1

ny∑j=1

πijϕ(x′i,y′j).

Since we want to estimate the density-ratio function by minimizing Eq. (4), the optimization problem isgiven as

minΠ,α

J(Π,α)=1

2α>Hα−α>hΠ,β+εH(Π)+

λ

2‖α‖22

s.t. Π1ny= n−1x 1nx

and Π>1nx= n−1y 1ny

. (5)

where H(Π) =∑nx

i=1

∑ny

j=1 πij(log πij−1) is the negative entropic regularization to ensure Π non-negative, ε > 0

is the regularization parameter, ‖α‖22 is the regularization, and λ ≥ 0 is the regularization parameter.The objective function J(Π,α) is not jointly convex. However, if we fix one of the model parameters, it

becomes a convex function. Thus, we employ the alternating optimization approach (see Algorithm 1).

Optimizing Π using the Sinkhorn algorithm: With fixing α, because we have the relationship:

nx∑i=1

ny∑j=1

πijα>ϕ(x′i,y

′j) =

nx∑i=1

ny∑j=1

πij [Cα]ij ,

4

Page 5: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

whereCα = K>diag(α)L ∈ Rnx×ny ,K = (k(x′1),k(x′2), . . . ,k(x′nx)) ∈ Rb×nx , andL = (l(y′1), l(y′2), . . . , l(y′ny

)) ∈Rb×ny . It is evident that this representation can be considered to be an optimal transport problem if we maxi-mize it with respect to Π [5]. It should be noted that the rank of Cα is at most b � min(nx, ny), where b is aconstant (e.g., b = 100), and the computational complexity for the cost matrix Cα is O(nxny).

Thus, the optimization problem for Π can be written as

minΠ

−nx∑i=1

ny∑j=1

πij(1− β)[Cα]ij + εH(Π)

s.t. Π1ny = n−1x 1nx and Π>1nx = n−1y 1ny ,

and this optimization problem can be efficiently solved using the Sinkhorn algorithm [5, 21]. In this paper, weuse the log-stabilized Sinkhorn [20]. Note that the optimization problem is convex if we fix α.

Optimizing α: Next, we update α with given Π. The optimization problem for α is equivalent to

minα

1

2α>Hα−α>hΠ,β +

λ

2‖α‖22. (6)

Since the optimization problem is a quadratic programming and convex, the solution is analytically given as

α = (H + λIb)−1hΠ,β , (7)

where Ib ∈ Rb×b is the identity matrix. Note that the H matrix does not depend on both Π and α.

Convergence Analysis: To optimize J(Π,α), we simply need to alternatively solve the two convex optimiza-tion problems. Thus, the following nice property holds true.

Proposition 1 Algorithm 1 will monotonically decrease the objective function J(Π,α) in each iteration.Proof. See the supplementary material.

Model Selection: The LSMI-Sinkhorn algorithm includes several tuning parameters (i.e., λ and β) and deter-mining the model parameters is critical to obtain a good estimate of SMI. Accordingly, we use the cross validationwith the hold-out set to select the model parameters.

First, the paired samples {(xi,yi)}ni=1 are divided into two subsets Dtr and Dte. Then, we train the density-

ratio rα(x,y) using Dtr and the unpaired samples: {xi}n+nxi=n+1 and {yj}

n+ny

j=n+1. The hold-out error can becalculated by approximating Eq. (4) using the hold-out samples Dte as

Jte =1

2|Dte|2∑

x,y∈Dte

rα(x,y)2 − 1

|Dte|∑

(x,y)∈Dte

rα(x,y),

where |D| denotes the number of samples in the set D,∑x,y∈Dte

denotes the summation over all combinationsof x and y in Dte, and

∑(x,y)∈Dte

denotes the summation over all pairs for x and y in Dte. We select the

parameter that has the smallest Jte.

Relation to the Gromov-Wasserstein: For λ = 0, ε = 0, and β = 0, by substituting the optimal α, the lossfunction (Eq. (5)) can be represented as

J(Π,α) = −1

2

nx∑i=1

ny∑j=1

nx∑i′=1

ny∑j′=1

πijπi′j′s(x′i,y′j ,x′i′ ,y

′j′),

where s(xi,yj ,xi′ ,yj′) = ϕ(xi,yj)>H−1ϕ(xi′ ,yj′).

5

Page 6: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

Thus, the optimization problem for Π can be written as

maxΠ

1

2

nx∑i=1

ny∑j=1

nx∑i′=1

ny∑j′=1

πijπi′j′s(x′i,y′j ,x′i′ ,y

′j′)

s.t. Π1ny= n−1x 1nx

,Π>1nx= n−1y 1ny

, πij ≥ 0. (8)

This can be considered as a relaxed variant of the quadratic assignment problem (QAP) with s(xi,yj ,xi′ ,yj′).Since Gromov-Wasserstein is also related to a QAP problem [15], the proposed method is related to Gromov-Wasserstein.

Relation to Least-Squares Object Matching: In this section, we show that the LSOM algorithm [27, 28]can be considered a special case of the proposed framework.

If Π is a permutation matrix and n′ = nx = ny,

Π = {0, 1}n′×n′

, Π1n′ = 1n′ , and Π>1n′ = 1n′ ,

where Π>Π = ΠΠ> = In′ . Note that we only assume 1>nxΠ1ny = 1 for the Sinkhorn formulation.

Then, the estimation of SMI using the permutation matrix can be written as

SMI(X,Y ) =β

2n

n∑i=1

rα(xi,yi) +1

2n′

n′∑i=1

(1− β)rα(x′i,y′π(i)

)− 1

2,

where π(i) is the permutation function. The optimization problem is written as

minΠ,α

1

2α>Hα−α>hΠ,β +

λ

2‖α‖22

s.t. Π1n′ = 1n′ , Π>1n′ = 1n′ , Π ∈ {0, 1}n′×n′

. (9)

To solve this problem, we can use the Hungarian algorithm [14] instead of the Sinkhorn algorithm [5] for optimiz-ing Π. It is noteworthy that in the original LSOM algorithm, the permutation matrix is introduced to permutethe Gram matrix (i.e., ΠLΠ>) and Π is also included within the H computation. However, in our formulation,the permutation matrix depends only on hΠ,β . This is a small difference in formulation. However, owing to thisdifference, we can show that the monotonic decrease in the loss function of the proposed algorithm.

Since LSOM finds the alignment, this approach is more suited to find the exact match among samples. Incontrast, the proposed Sinkhorn formulation is more suited when there are no exact matches. Moreover, theLSOM formulation can only handle the same number of samples (i.e., nx = ny). For computational complexity,

the Hungarian algorithm requires O(n′3) while the Sinkhorn requires O(n′

2).

Computational Complexity: The computational complexity of estimating Π is based on the computation ofthe cost matrix Cα and the Sinkhorn iterations. The computational complexity of Cα is O(nxny) and that ofSinkhorn algorithm is O(nxny). Therefore, the computational complexity of the Sinkhorn iteration is O(nxny).For the α computation, the complexity to compute H is O((n+nx)2 + (n+ny)2) and that for hΠ,β is O(nxny).However, to estimate the α, the complexity should be O(b3) but it is small. Therefore, the total computationalcomplexity of the initialization needs O((n+nx)2 +(n+ny)2) and the iterations requires O(nxny). In particular,for small n and nx = ny, the computational complexity is O(n2x).

In contrast, the complexity of computing the objective function of Gromov-Wasserstein is O(n4x) for generalcases and O(n3x) for some specific losses (e.g. L2 loss, or Kullback-Leibler loss) [18]. Moreover, Gromov-Wasserstein is generally NP-hard for arbitrary inputs [17, 18].

6

Page 7: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

4 Related Work

The proposed algorithms are related to MI estimation. Moreover, our LSMI-Sinkhorn algorithm is highly relatedto the Gromov-Wasserstein and the kernelized sorting.

Mutual information estimation: To estimate the MI, the simplest approach is to estimate the probabilitydensities p(x,y) from the paired samples {(xi,yi)}ni=1, p(x) from {xi}ni=1, and p(y) from {yi}ni=1, respectively.

However, because the estimation of the probability density is also a difficult problem, the naive approachdoes not tend to work well. To handle this, a density-ratio based approach can be promising [24, 25]. Morerecently, a deep learning based mutual information estimation algorithm has been proposed [1]. However, theseapproaches still require a large number of paired samples to estimate the models. Thus, if we have a limitednumber of paired samples, existing approaches are not efficient.

Most recently, the Wasserstein Dependency Measure (WDM), which measures the discrepancy between thejoint probability p(x,y) and its marginals p(x) and p(y), has been proposed and used for representation learning[16]. Since WDM can be used as an independence measure, it is highly related to LSMI-Sinkhorn. However,they focus on finding a good representation by maximizing WDM (i.e., maximize the mutual information), whilewe focus on estimating true SMI.

Gromov-Wasserstein and Kernelized Sorting: Given two set of vectors in different spaces, the Gromov-Wasserstein distance [15] can be used to find the optimal alignment between them. This method considers thepair-wise distance between samples in the same set to build the distance matrix, then find a match by minimizingthe difference between the pair-wise distance matrices:

minΠ

nx∑i=1

ny∑j=1

nx∑i′=1

ny∑i′=1

πijπi′j′(D(xi,xi′)−D(yj ,yj′))2,

s.t. Π1ny= a,Π>1nx

= b, πij ≥ 0,

where a ∈ Σnx , b ∈ Σny and Σn = {p ∈ R+n ;∑i pi = 1}.

Therefore, the alignment can be estimated first, followed by estimating the SMI from the aligned samples.However, Gromov-Wasserstein distance must solve the quadratic assignment problem (QAP), and it is generallyNP-hard for arbitrary inputs [17, 18]. In this work, we estimate the SMI by simultaneously solving the alignmentand fitting the distribution ratio by efficiently leveraging the Sinkhorn algorithm and properties of the squared-loss. Moreover, we show that our approach can be considered an example of the Gromov-Wasserstein by properlysetting the cost function. Recently, semi-supervised Gromov-Wasserstein-based Optimal transport has beenproposed and applied to the heterogeneous domain adaptation problems [29]. Their approach can handle taskssimilar to those mentioned in this paper. However, their method cannot be used to measure the independence.

The kernelized sorting [6, 12, 19] is highly related to the Gromov-Wasserstein. Specifically, the kernelizedsorting determines a set of paired samples by maximizing the Hilbert-Schmidt independence criterion (HSIC)between samples [9]. However, the kernelized sorting can only handle the same number of samples (i.e., {x′i}n

i=1

and {y′i}n′

j=1).

5 Experiments

In this section, we evaluate the proposed algorithm using the synthetic data and benchmark datasets.

7

Page 8: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

Figure 1: Convergence curve of loss and SMI value.

5.1 Setup

For all methods, we use the Gaussian kernels.

K(x,x′)=exp

(−‖x−x

′‖222σ2

x

), L(y,y′)=exp

(−‖y−y

′‖222σ2

y

),

where σx and σy denote the widths of the kernel that are set using the median heuristic [22].

σx = 2−1/2median({‖xi − xj‖2}nxi,j=1),

σy = 2−1/2median({‖yi − yj‖2}ny

i,j=1).

We set the number of basis b = 200, ε = 0.3, the maximum number of iterations T = 20, and the stoppingparameter η = 10−9. The parameters β and λ are chosen by cross-validation.

5.2 Convergence and Runtime

We first demonstrate the convergence of the loss function and the estimated SMI value. Here, we generatesynthetic data from y = 0.5x +N (0, 0.01) and randomly choose n = 50 as paired samples and nx = ny = 500as unpaired samples. The convergence curve is shown in Figure 1. The values of loss and SMI converge quickly(<5 iterations). This is consistent with Proposition 1.

Then, we perform a comparison between the runtimes of the proposed LSMI-Sinkhorn and Gromov-Wassersteinfor CPU and GPU implementation. The data are sampled using two 2D random measures, where nx = ny ∈{100, 200, . . . , 9000, 10000} is the unpaired data and n = 100 is the paired data (only for LSMI-Sinkhorn). ForGromov-Wasserstein, we use the CPU implementation from Python Optimal Transport toolbox [7] and the Py-torch GPU implementation from [2]. We use the squared loss function and set the entropic regularization εto 0.005 according to the original code. For LSMI-Sinkhorn, we implement the CPU and GPU versions usingnumpy and Pytorch, respectively. For fair comparison, we use the log-stabilized Sinkhorn algorithm and thesame early stopping criteria and the same maximum iterations as in Gromov-Wasserstein. As shown in Figure2, in comparison to the Gromov-Wasserstein, LSMI-Sinkhorn is more than one order of magnitude faster for theCPU version and several times faster for the GPU version. This is consistent with our computational complexityanalysis. Moreover, the GPU version of our algorithm costs only 3.47s to compute 10, 000 unpaired samples,indicating that it is suitable for large-scale applications.

8

Page 9: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

Figure 2: Runtime Comparison of LSMI-Sinkhorn and Gromov-Wasserstein.

5.3 SMI Estimation

For SMI estimation, we set up four baselines:

• LSMI (full): 10, 000 paired samples are used for cross-validation and SMI estimation. It is considered asthe ground truth value.

• LSMI: Only n (usually small) paired samples are used for cross-validation and SMI estimation.

• LSMI (opt): n paired samples are used for SMI estimation. However, we use the optimal parametersfrom LSMI (full) here. This can be seen as the upper bound of SMI estimation with limited number ofpaired data because the optimal parameters are usually unavailable.

• Gromov-SMI: The Gromov-Wasserstein distance is applied on unpaired samples to find potential match-ing (n = min(nx, ny)). Then, the n matched pairs and existing n paired samples are combined to performcross-validation and SMI estimation.

Synthetic Data: In this experiment, we manually generated four types of paired samples: random normal,y = 0.5x + N (0, 0.01) (Linear), y = sin(x) (Nonlinear), and y = PCA(x). We changed the number of pairedsamples n ∈ {10, 20, . . . , 100} while fixing nx = 1000 and ny = 500 for Gromov-SMI and the proposed LSMI-Sinkhorn, respectively. The model parameters λ and β are selected by cross-validation using the paired exampleswith λ ∈ {0.1, 0.01, 0.001, 0.0001} and β ∈ {0.2, 0.4, 0.6, 0.8, 1.0}. The results are shown in Figure 3. In therandom case, the data are nearly independent and our algorithm achieves a small SMI value. In other cases,LSMI-Sinkhorn yields a better estimation of the SMI value and it lies near the ground truth when n increases.In contrast, Gromov-SMI has a small estimation value, which may be due to the incorrect potential matchingwhen nx 6= ny.

UCI Datasets: We selected four benchmark datasets from the UCI machine learning repository. For eachdataset, we split the features into two sets as paired samples. To ensure high dependence between these twosubsets of features, we utilized the same splitting strategy as [19] according to the correlation matrix. Theexperimental setting was same as the synthetic data, except nx = ny = 500. We show the SMI estimation

9

Page 10: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

Figure 3: SMI estimation on synthetic data.

results in Figure 4. Similarly, LSMI-Sinkhorn obtains better estimation values in all four datasets. In mostcases, Gromov-SMI tends to overestimate the value by a large margin, while other baselines underestimate thevalue.

5.4 Deep Image Matching

Next, we consider an image matching task with deep convolution features. We use two commonly-used imageclassification benchmarks: CIFAR10 [13] and STL10 [3].We extracted 64 dim features from the last layer ofResNet20 [10] pretrained on the training set of CIFAR10. The features are divided into two 32-dim partsdenoted by {xi}Ni=1 and {yi}Ni=1. We shuffle the samples of y and attempt to match x and y with limited pairsamples (n ∈ {10, 20, . . . , 100}) and unpaired samples (nx = ny = 500). Other settings are the same as aboveexperiments.

To evaluate the matching performance, we used top-1 accuracy, top-2 accuracy (correct matching is achievedin the top-2 highest scores), and class accuracy (matched samples are in the same class). As shown in Figure5, LSMI-Sinkhorn obtains high accuracy with only a few tens of supervised pairs. Additionally, the high classmatching performance implies that our algorithm can be applied to further applications such as semi-supervised

10

Page 11: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

Figure 4: SMI estimation on UCI datasets.

image classification.We then investigate the impact of Sinkhorn regularization ε. With n fixed to be 50, we show the matching

accuracy of CIFAR10 and STL10 w.r.t changing ε in Figure 6. Matching accuracy gradually dropped when thevalue of ε increased. This is due to the intrinsic property of Sinkhorn regularization: with larger ε, the assignmentmatrix Π becomes smoother, thereby the matching accuracy drops.

5.5 Photo Album Summarization

Finally, we apply the proposed LSMI-Sinkhorn to the photo album summarization problem, where images arematched to a predefined structure according to the Cartesian coordinate system.

Color Feature: We first used 320 images collected from Flickr [19] and extracted the original RGB pixels ascolor feature. Figure 7a and 7b depict the semi-supervised summarization to the triangle and 16× 20 grids withthe corners of the grids fixed to green, orange, black (triangle), and blue (rectangle) images. Similarly, we showthe summarization results on an “AAAI 2020” grid with the center of each character fixed. It can be seen thatthese layouts show good color topology according to the fixed color images.

Semantic Feature: We then used CIFAR10 with the ResNet20 feature to illustrate the semantic album sum-marization. Figure 8 shows the layout of 1000 images into the same triangle, 16 × 20, and “AAAI 2020” grids.

11

Page 12: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

Figure 5: Deep image matching.

Figure 6: Matching accuracy with various ε.

For Figure 8a and 8b, we fixed corners of the grid to automobile, airplane, horse (triangle) and dog (rectangle)images. For Figure 8c, we fixed the corresponding character centers. It can be seen that similar objects arealigned together by their semantic meanings rather than colors with respect to the fixed images.

In comparison to previous summarization algorithms, LSMI-Sinkhorn has two advantages. First, the semi-supervised property enables interactive album summarization, while kernelized sorting [6, 19] and object matching[28] cannot. Second, we obtained a solution for general rectangular matching (nx 6= ny), e.g., 320 images toa triangle grid, 1000 images to a 16 × 20 grid, while most previous methods [19, 28] relied on the Hungarianalgorithm [14] to obtain square matching, which is not as flexible as the proposed method.

6 Conclusion

In this paper, we proposed the least-square mutual information Sinkhorn (LSMI-Sinkhorn) algorithm to estimatethe SMI from a limited number of paired samples. To the best of our knowledge, this is the first semi-supervisedSMI estimation algorithm. Through experiments on synthetic and real-world examples, we showed that theproposed algorithm can successfully estimate SMI with a small number of paired samples. Moreover, we demon-strated that the proposed algorithm can be used for image matching and photo album summarization.

12

Page 13: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

(a) Triangle (b) 16× 20 (c) “AAAI 2020”

Figure 7: Photo album summarization on Flickr dataset. For (a) and (b), we fixed the corners with green, orange,black (triangle), and blue (rectangle) images. For (c), we fixed the center of each character with different images.

(a) Triangle (b) 16× 20 (c) “AAAI 2020”

Figure 8: Photo album summarization on CIFAR10 dataset. For (a) and (b), we fixed the corners with automo-bile, airplane, horse (triangle), and dog (rectangle) images. For (c), we fixed the center of each character withdifferent images.

References

[1] Belghazi, M. I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Hjelm, D.; and Courville, A. 2018. Mutualinformation neural estimation. In ICML.

[2] Bunne, C.; Alvarez-Melis, D.; Krause, A.; and Jegelka, S. 2019. Learning generative models across incompa-rable spaces. In ICML.

[3] Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning.In AISTATS.

[4] Cover, T. M., and Thomas, J. A. 2006. Elements of Information Theory. Hoboken, NJ, USA: John Wiley &Sons, Inc., 2nd edition.

[5] Cuturi, M. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS.

[6] Djuric, N.; Grbovic, M.; and Vucetic, S. 2012. Convex kernelized sorting. In AAAI.

[7] Flamary, R., and Courty, N. 2017. Pot python optimal transport library.

[8] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio,Y. 2014. Generative adversarial nets. In NIPS.

13

Page 14: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

[9] Gretton, A.; Bousquet, O.; Smola, A.; and Scholkopf, B. 2005. Measuring statistical dependence withHilbert-Schmidt norms. In ALT.

[10] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.

[11] Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y.2019. Learning deep representations by mutual information estimation and maximization. In ICLR.

[12] Jebara, T. 2004. Kernelized sorting, permutation, and alignment for minimum volume PCA. In COLT.

[13] Krizhevsky, A., et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.

[14] Kuhn, H. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1-2):83–97.

[15] Memoli, F. 2011. Gromov–wasserstein distances and the metric approach to object matching. Foundationsof Computational Mathematics 11(4):417–487.

[16] Ozair, S.; Lynch, C.; Bengio, Y.; Oord, A. v. d.; Levine, S.; and Sermanet, P. 2019. Wasserstein dependencymeasure for representation learning. NeurIPS.

[17] Peyre, G., and Cuturi, M. 2019. Computational optimal transport. Foundations and Trends R© in MachineLearning 11(5-6):355–607.

[18] Peyre, G.; Cuturi, M.; and Solomon, J. 2016. Gromov-wasserstein averaging of kernel and distance matrices.In ICML.

[19] Quadrianto, N.; Smola, A.; Song, L.; and Tuytelaars, T. 2010. Kernelized sorting. IEEE Transactions onPattern Analysis and Machine Intelligence 32:1809–1821.

[20] Schmitzer, B. 2019. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAMJournal on Scientific Computing 41(3):A1443–A1481.

[21] Sinkhorn, R. 1974. Diagonal equivalence to matrices with prescribed row and column sums. Proceedings ofthe American Mathematical Society 45(2):195–198.

[22] Sriperumbudur, B. K.; Fukumizu, K.; Gretton, A.; Lanckriet, G. R.; and Scholkopf, B. 2009. Kernel choiceand classifiability for rkhs embeddings of probability distributions. In NIPS.

[23] Suzuki, T., and Sugiyama, M. 2010. Sufficient dimension reduction via squared-loss mutual informationestimation. In AISTATS.

[24] Suzuki, T.; Sugiyama, M.; Kanamori, T.; and Sese, J. 2009. Mutual information estimation reveals globalassociations between stimuli and biological processes. BMC Bioinformatics 10(S52).

[25] Suzuki, T.; Sugiyama, M.; and Tanaka, T. 2009. Mutual information approximation via maximum likelihoodestimation of density ratio. In ISIT.

[26] Yamada, M., and Sugiyama, M. 2010. Dependence minimizing regression with model selection for non-linearcausal inference under non-gaussian noise. In AAAI.

[27] Yamada, M., and Sugiyama, M. 2011. Cross-domain object matching with model selection. In AISTATS.

14

Page 15: LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual ...problems such as image matching and photo album summarization. 1 Introduction Mutual information (MI) represents the statistical

[28] Yamada, M.; Sigal, L.; Raptis, M.; Toyoda, M.; Chang, Y.; and Sugiyama, M. 2015. Cross-domain match-ing with squared-loss mutual information. IEEE transactions on Pattern Analysis and Machine Intelligence37(9):1764–1776.

[29] Yan, Y.; Li, W.; Wu, H.; Min, H.; Tan, M.; and Wu, Q. 2018. Semi-supervised optimal transport forheterogeneous domain adaptation. In IJCAI.

[30] Zhao, S.; Song, J.; and Ermon, S. 2017. Infovae: Information maximizing variational autoencoders. arXivpreprint arXiv:1706.02262.

15