manifold regularization and semi-supervised learning: some … · 2020. 9. 21. · manifold...

22
Journal of Machine Learning Research 14 (2013) 1229-1250 Submitted 4/08; Revised 3/10; Published 5/13 Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses Partha Niyogi Departments of Computer Science, Statistics University of Chicago 1100 E. 58th Street, Ryerson 167 Hyde Park, Chicago, IL 60637, USA Editor: Zoubin Ghahramani Abstract Manifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machine learning within which several semi-supervised algorithms have been constructed. Here we try to provide some theoretical understanding of this approach. Our main result is to expose the natural structure of a class of problems on which manifold regularization methods are helpful. We show that for such problems, no supervised learner can learn effectively. On the other hand, a manifold based learner (that knows the manifold or “learns” it from unlabeled examples) can learn with relatively few labeled examples. Our analysis follows a minimax style with an emphasis on finite sample results (in terms of n: the number of labeled examples). These results allow us to properly interpret manifold regularization and related spectral and geometric algorithms in terms of their potential use in semi-supervised learning. Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates 1. Introduction The last decade has seen a flurry of activity within machine learning on two topics that are the subject of this paper: manifold method and semi-supervised learning. While manifold methods are generally applicable to a variety of problems, the framework of manifold regularization (Belkin et al., 2006) is especially suitable for semi-supervised applications. Manifold regularization provides a framework within which many graph based algorithms for semi-supervised learning have been derived (see Zhu, 2008, for a survey). There are many things that are poorly understood about this framework. First, manifold regularization is not a single algo- rithm but rather a collection of algorithms. So what exactly is “manifold regularization”? Second, while many semi-supervised algorithms have been derived from this perspective and many have en- joyed empirical success, there are few theoretical analyses that characterize the class of problems on which manifold regularization approaches are likely to work. In particular, there is some confusion on a seemingly fundamental point. Even when the data might have a manifold structure, it is not clear whether learning the manifold is necessary for good performance. For example, recent results (Bickel and Li, 2007; Lafferty and Wasserman, 2007) suggest that when data lives on a low dimen- sional manifold, it may be possible to obtain good rates of learning using classical methods suitably . This article had been accepted subject to minor revisions by JMLR at the time the author sadly passed away. JMLR thanks Mikhail Belkin and Richard Maclin for their help in preparing the final version. c 2013 Partha Niyogi.

Upload: others

Post on 27-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

Journal of Machine Learning Research 14 (2013) 1229-1250 Submitted 4/08; Revised 3/10; Published 5/13

Manifold Regularization and Semi-supervised Learning:Some Theoretical Analyses

Partha Niyogi∗

Departments of Computer Science, StatisticsUniversity of Chicago1100 E. 58th Street, Ryerson 167Hyde Park, Chicago, IL 60637, USA

Editor: Zoubin Ghahramani

AbstractManifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machinelearning within which several semi-supervised algorithmshave been constructed. Here we try toprovide some theoretical understanding of this approach. Our main result is to expose the naturalstructure of a class of problems on which manifold regularization methods are helpful. We showthat for such problems, no supervised learner can learn effectively. On the other hand, a manifoldbased learner (that knows the manifold or “learns” it from unlabeled examples) can learn withrelatively few labeled examples. Our analysis follows a minimax style with an emphasis on finitesample results (in terms ofn: the number of labeled examples). These results allow us to properlyinterpret manifold regularization and related spectral and geometric algorithms in terms of theirpotential use in semi-supervised learning.Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates

1. Introduction

The last decade has seen a flurry of activity within machine learning on two topics that are thesubject of this paper:manifold methodandsemi-supervised learning. While manifold methods aregenerally applicable to a variety of problems, the framework of manifold regularization (Belkinet al., 2006) is especially suitable for semi-supervised applications.

Manifold regularization provides a framework within which many graph based algorithms forsemi-supervised learning have been derived (see Zhu, 2008, for a survey). There are many thingsthat are poorly understood about this framework.First, manifold regularization is not a single algo-rithm but rather a collection of algorithms. So what exactly is “manifold regularization”? Second,while many semi-supervised algorithms have been derived from this perspective and many have en-joyed empirical success, there are few theoretical analyses that characterize the class of problems onwhich manifold regularization approaches are likely to work. In particular,there is some confusionon a seemingly fundamental point. Even when the data might have a manifold structure, it is notclear whether learning the manifold isnecessaryfor good performance. For example, recent results(Bickel and Li, 2007; Lafferty and Wasserman, 2007) suggest that when data lives on a low dimen-sional manifold, it may be possible to obtain good rates of learning using classical methods suitably

∗. This article had been accepted subject to minor revisions by JMLR at the time the author sadly passed away. JMLRthanks Mikhail Belkin and Richard Maclin for their help in preparing the finalversion.

c©2013 Partha Niyogi.

Page 2: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

adapted without knowing very much about the manifold in question beyond its dimension. This hasled some people (e.g., Lafferty and Wasserman, 2007) to suggest that manifold regularization doesnot provide any particular advantage.

What is particularly missing in the prior research so far is a crisp theoreticalstatement whichshows the benefits of manifold regularization techniques quite clearly. This paper provides sucha theoretical analysis, and explicates the nature of manifold regularization inthe context of semi-supervised learning. Our main theorems (Theorems 2 and 4) show that there can be classes of learn-ing problems on which (i) a learner that knows the manifold (alternatively learns it from large (in-finite) unlabeled data via manifold regularization) obtains a fast rate of convergence (upper bound)while (ii) without knowledge of the manifold (via oracle access or manifold learning),no learningschemeexists that is guaranteed to converge to the target function (lower bound). This provides forthe first time a clear separation between a manifold method and alternatives fora suitably chosenclass of problems (problems that have intrinsic manifold structure). To illustrate this conceptualpoint, we have defined a simple class of problems where the support of the data is simply a onedimensional manifold (the circle) embedded in an ambient Euclidean space. Our result is the first ofthis kind. However, it is worth emphasizing that this conceptual point may alsoobtain in far moregeneral manifold settings. The discussion of Section 2.3 and the theorems ofSection 3.2 providepointers to these more general results that may cover cases of greater practical relevance.

The plan of the paper: Against this backdrop, the rest of the paper is structured as follows.In Section 1.1, we develop the basic minimax framework of analysis that allows us to comparethe rates of learning for manifold based semi-supervised learners and fully supervised learners.Following this in Section 2, we demonstrate a separation between the two kinds oflearners byproving an upper bound on the manifold based learner and a lower boundon any alternative learner.In Section 3, we take a broader look at manifold learning and regularizationin order to exposesome subtle issues around these subjects that have not been carefully considered by the machinelearning community. This section also includes generalizations of our main theorems of Section 2.In Section 4, we consider the general structure that learning problems must have for semi-supervisedapproaches to be viable. We show how both the classical results of Castelliand Cover (1996, oneof the earliest known examples of the power of semi-supervised learning)and the recent results ofmanifold regularization relate to this general structure. Finally, in Section 5 we reiterate our mainconclusions.

1.1 A Minimax Framework for Analysis

A learning problem is specified by a probability distributionp onX×Y according to which labelledexampleszi = (xi ,yi) pairs are drawn and presented to a learning algorithm (estimation procedure).We are interested in an understanding of the case in whichX = R

D, Y ⊂ R but pX (the marginaldistribution ofp on X) is supported on some submanifoldM ⊂ X. In particular, we are interestedin understanding how knowledge of this submanifold may potentially help a learning algorithm. Tothis end, we will consider two kinds of learning algorithms:

1. Algorithms that have no knowledge of the submanifoldM but learn from(xi ,yi) pairs in apurely supervised way.

2. Algorithms that have perfect knowledge of the submanifold. This knowledge may be acquiredby a manifold learning procedure through unlabeled examplesxi ’s and having access to an

1230

Page 3: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

essentially infinite number of them. Such a learner may be viewed as a semi-supervisedlearner.

Our main result is to elucidate the structure of a class of problems on which there is a differencein the performance of algorithms of Type 1 and 2.

LetP be a collection of probability distributionsp and thus denote a class of learning problems.For simplicity and ease of comparison with other classical results, we place some regularity con-ditions onP . Every p ∈ P is such that its marginalpX has support on ak-dimensional manifoldM ⊂ X. Different p’s may have different supports. For simplicity, we will consider the case wherepX is uniform onM : this corresponds to a situation in which the marginal is the most regular.

Given such aP we can naturally define the classPM to be

PM = p∈ P|pX is uniform onM .

Clearly, we haveP = ∪M PM .

Considerp∈ PM . This denotes a learning problem and the regression functionmp is defined as

mp(x) = E[y|x] whenx∈M .

Note thatmp(x) is not defined outside ofM . We will be interested in cases whenmp belongs tosome restricted family of functionsHM (for example, a Sobolev space). Thus assuming a familyHM is equivalent to assuming a restriction on the class of conditional probability distributionsp(y|x)wherep∈P . For simplicity, we will assume the noiseless case wherep(y|x) is either 0 or 1 for everyx and everyy, that is, there is no noise in the Y space.

SinceX\M has measure zero (with respect topX), we can definemp(x) to be anything we wantwhenx∈ X\M . We definemp(x) = 0 whenx /∈M .

For a learning problemp, the learner is presented with a collection of labeled exampleszi =(xi ,yi), i = 1, . . . ,n where eachzi is drawn i.i.d. according top. A learning algorithm A maps thecollection of dataz= (z1, . . . ,zn) into a functionA(z). Now we can define the following minimaxrate (for the classP ) as

R(n,P ) = infA

supp∈P

Ez||A(z)−mp||L2(pX).

This is the best possible rate achieved by any learner that hasno knowledge of the manifoldM . Wewill contrast it with a learner that has oracle access endowing it with knowledge of the manifold. Tobegin, note that sinceP = ∪M PM , we see that

R(n,P ) = infA

supM

supp∈PM

Ez||A(z)−mp||L2(pX).

Now a manifold based learnerA′ is given a collection of labeled examplesz= (z1, . . . ,zn) justlike the supervised learner. However, in addition, it also has knowledge of M (the support of theunlabeled data). It might acquire this knowledge through manifold learning or through oracle access(the limit of infinite amounts of unlabeled data). ThusA′ maps(z,M ) into a function denoted byA′(z,M ). The minimax rate for such a manifold based learner for the classPM is given by

infA′

supp∈PM

Ez||A′(z,M )−mp||L2(pX).

1231

Page 4: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

Taking the supremum over all possible manifolds (just as in the supervised case), we have

Q(n,P ) = supM

infA′

supp∈PM

Ez||A′−mp||L2(pX).

1.2 The Manifold Assumption for Semi-supervised Learning

So the question at hand is: for what class of problemsP with the structure as described above, mightone expect a gap betweenR(n,P ) andQ(n,P ). This is a class of problems for which knowing themanifold confers an advantage to the learner.

There are two main assumptions behind the manifold based approach to semi-supervised learn-ing. First, one assumes that the support of the probability distribution is on some low dimensionalmanifold. The motivation behind this assumption comes from the intuition that although naturaldata in its surface form lives in a high dimensional space (speech, image, text, etc.), they are oftengenerated by systems with much fewer underlying degrees of freedom and therefore have lowerintrinsic dimensionality. This assumption and its corresponding motivation has been articulatedmany times in papers on manifold methods (see Roweis and Saul, 2000, for example). Second, oneassumes that the underlying target function one is trying to learn (for prediction) is smooth withrespect to this underlying manifold. A smoothness assumption lies at the heartof many machinelearning methods including especially splines (Wahba, 1990), regularization networks (Evgeniouet al., 2000), and kernel based methods (using regularization in reproducing kernel Hilbert spaces;Scholkopf and Smola, 2002). However, smoothness in these approaches is typically measured inthe ambient Euclidean space. In manifold regularization, a geometric smoothness penalty is insteadimposed.

Thus, for a manifoldM, let φ1,φ2, . . . , be the eigenfunctions of the manifold Laplacian (orderedby frequency). Then,mp(x) may be expressed in this basis asmp = ∑i αiφi or

mp = sign(∑i

αiφi)

where theαi ’s have a sharp decay to zero.Against this backdrop, one might now consider manifold regularization to get some better under-

standing of when and why it might be expected to provide good semi-supervised learning. First off,it is worthwhile to clarify what is meant by manifold regularization. The term “manifold regulariza-tion” was introduced by Belkin et al. (2006) to describe a class of algorithmsin which geometricallymotivated regularization penalties were used. One unifying framework adopts a setting of Tikhonovregularization over a Reproducing Kernel Hilbert Space of functions toyield algorithms that ariseas special cases of the following:

f = arg minf∈HK

1n

n

∑i=1

V( f (xi),yi)+ γA|| f ||2K + γI || f ||2I . (1)

HereK : X×X → R is a p.d. kernel that defines a suitable RKHS (HK) of functions that are ambi-ently defined. The ambient RKHS norm|| f ||K and an “intrinsic norm”|| f ||I are traded-off againsteach other. Intuitively the intrinsic norm|| f ||I penalizes functions by considering onlyfM therestriction of f to M and essentially considering various smoothness functionals. Since the eigen-functions of the Laplacian provide a basis forL2 functions intrinsically defined onM , one might

1232

Page 5: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

expressfM = ∑i αiφi in this basis and consider constraints on the coefficients.

Remarks:

1. Various choices of|| f ||2I include: (i) iteratedLaplacian given by∫M f (∆i f ) = ∑ j α2

j λij , (ii)

heat kernelgiven by∑ j etλ j α2

j , and (iii)band limitinggiven by|| f ||2I = ∑i µiα2i whereµi = ∞

for all i > p.

2. The loss functionV can vary from squared loss to hinge loss to the 0–1 loss for classificationgiving rise to different kinds of algorithmic procedures.

3. While Equation 1 is regularization in the Tikhonov form, one could considerother kinds ofmodel selection principles that are in the spirit of manifold regularization. Forexample, themethod of Belkin and Niyogi (2003) is a version of the method of sieves that may be inter-preted as manifold regularization with bandlimited functions where one allows thebandwidthto grow as more and more data becomes available.

4. The formalism provides a class of algorithmsA′ that have access to labeled examplesz andthe manifoldM from which all the terms in the optimization of Equation 1 can be computed.ThusA′(z,M ) = f .

5. Finally it is worth noting that in practice when the manifold is unknown, the quantity || f ||2I =∫M f (∆i f ) is approximated by collecting unlabeled pointsxi ∈M , making a suitable nearest

neighbor graph with the vertices identified with the unlabeled points, and regularizing thefunction using the graph Laplacian. The graph is viewed as a proxy for the manifold andin this sense, many graph based approaches to semi-supervised learning(see Zhu, 2008, forreview) may be accommodated within the purview of manifold regularization.

The point of these remarks is that manifold regularization combines the perspective of kernelbased methods with the perspective of manifold and graph based methods. Itadmits a varietyof different algorithms that incorporate a geometrically motivated complexity penalty. We willlater demonstrate (in Section 3) one such canonical algorithm for the class of learning problemsconsidered in Section 2 of this paper.

2. A Prototypical Example: Embeddings of the Circle into Euclidean Space

In this section, we will construct a class of learning problemsP that have manifold structureP =∪M PM and demonstrate a separation betweenR(n,P ) andQ(n,P ). For simplicity, we will show aspecific construction where everyM considered is a different embedding of the circle into Euclideanspace. In particular, we will see thatR(n) = Ω(1) while limn→∞Q(n) = 0 at a fast rate. Thus thelearner with knowledge of the manifold learns easily while the learner with no such knowledgecannot learn at all.

Let φ : S1 → X be an isometric embedding of the circle into a Euclidean space. Now considerthe family of such isometric embeddings and let this be the family of one-dimensional submanifoldsthat we will deal with. Thus eachM ⊂ X is of the formM = φ(S1) for someφ.

Let HS1 be the set of functions defined on the circle that take the value+1 on half the circle and−1 on the other half. Thus in local coordinates (θ denoting the coordinate of a point inS1), we canwrite the classHS1 as

1233

Page 6: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

HS1 = hα : S1 → R|hα(θ) = sign(sin(θ+α));α ∈ [0,2π).

Now for eachM = θ(S1) we can define the classHM as

HM = h : M → R|h(x) = hα(φ−1(x)) for somehα ∈ HS1. (2)

This defines a class of regression functions (also classification functions) for our setting. Cor-respondingly, in our noiseless setting, we can now definePM as follows. For each,h ∈ HM , we

can define the probability distributionp(h) on X×Y by letting the marginalp(h)X be uniform onMand the conditionalp(h)(y|x) be a probability mass function concentrated on two pointsy=+1 andy=−1 such that

p(h)(y=+1|x) = 1⇐⇒ h(x) = +1

Thus

PM = p(h)|h∈ HM

In our setting, we can therefore interpret the learning problem as an instantiation either of re-gression or of classification based on our interest.

Now thatPM is defined, the setP = ∪M PM follows naturally. A picture of the situation isshown in Figure 1.

Remark 1 Recall that many machine learning methods (notably splines and kernel methods) con-struct classifiers from spaces of smooth functions. The Sobolev spaces(see Adams and Fournier,2003) are spaces of functions whose derivatives up to a certain order are square integrable. Thesespaces arise in theoretical analysis of such machine learning methods and it is often the case thatpredictors are chosen from such spaces or regression functions are assumed to be in such spacesdepending on the context of the work. For example, Lafferty and Wasserman (2007) make preciselysuch an assumption. In our setting, note that HS1 and correspondingly HM as defined above is notitself a Sobolev space. However, it is obtained by thresholding functions in aSobolev space. Inparticular, we can write

HS1 = sign(h)|h= αφ+βψ

whereφ(θ) = sin(θ) and ψ(θ) = cos(θ) are eigenfunctions of the Laplacian∆S1 on the circle.These are the eigenfunctions corresponding toλ = 1 and define the corresponding two dimensionaleigenspace. More generally one could consider a family of functions obtained by thresholdingfunctions in a Sobolev space of any chosen order and clearly HS1 is contained in any such family.Finally it is worth noting that the arguments presented below do not depend on thresholding andwould work with functions that are bandlimited or in a Sobolev space just as well.

1234

Page 7: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

+1

−1

+1

−1

Μ Μ1 2

Figure 1: Shown are two embeddings ofM1 andM2 of the circle in Euclidean space (the plane inthis case). The two functions, one fromM1 →R and the other fromM2 →R are denotedby labelings+1,−1 correspond to half circles shown.

2.1 Upper Bound on Q(n,P )

Let us begin by noting that if the manifoldM is known, the learner knows the classPM . Thelearner merely needs to approximate one of the target functions inHM . It is clear that the spaceHM

is a family of 0 – 1 valued functions whose VC-dimension is 2. Therefore, analgorithm that does

empirical risk minimization over the classHM will yield an upperbound of O(√

log(n)n ) by the usual

arguments. Therefore the following theorem is obtained.

Theorem 2 Following the notation of Section 1, let HM be the family of functions defined by Equa-tion 2 andP be the corresponding family of learning problems. Then the learner with knowledge ofthe manifold converges at a fast rate given by

Q(n,P )≤ 2

3log(n)n

and this rate is optimal. Thus every problem in this class P can be learned efficiently.

Remark 3 If the class HM is a a parametric family of the form∑pi=1 αiφi whereφi are the eigen-

functions of the Laplacian, one obtains the same parametric rate. Similarly, ifthe class HM is aball in a Sobolev space of appropriate order, suitable rates on the family may be obtained by theusual arguments.

2.2 Lower Bound on R(n,P )

We now prove the following.

Theorem 4 Let P = ∪M PM where eachM = φ(S1) is an isometric embedding of the circle intoX as shown. For each p∈ P , the marginal pX is uniform on someM and the conditional p(y|x) is

1235

Page 8: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

given by the construction in the previous section. Then

R(n,P ) = infA

supp∈P

Ez||A(z)−mp||L2(pX) = Ω(1)

Thus, it is not the case that every problem in the classP can be learned efficiently. In other words,for every n, there exists a problem inP that requires more than n examples.

We provide the proof below. A specific role in the proof is played by a construction (Construc-tion 1 later in the proof) that is used to show the existence of a family of geometrically structuredlearning problems (probability measures) that will end up becoming unlearnable as a family.Proof Given n, choose a numberd = 2n. Following Construction 1, there exist a set (denotedby Pd ⊂ P ) of 2d probability distributions that may be defined. Our proof uses the probabilisticmethod. We show that there exists a universal constantK (independent ofn) such that

∀A,12d ∑

p∈Pd

Ez||A(z)−mp||L2(pX) ≥ K

from which we conclude that

∀A, supp∈Pd

Ez||A(z)−mp||L2(pX) ≥ K

SincePd ⊂ P , the result follows.To begin, consider ap∈ P . Let z= (z1, . . . ,zn) be a set of i.i.d. examples drawn according to

p. Note that this is equivalent to drawingx = (x1, . . . ,xn) i.i.d. according topX and for eachxi ,drawingyi according top(y|xi). Since the conditionalp(y|x) is concentrated on one point, theyi ’sare deterministically assigned. Accordingly, we can denote this dependence by writingz= zp(x).

Now considerEz||A(z)−mp||L2(pX).

This is equal to∫

ZndP(z)||A(z)−mp||L2(pX) =

∫Xn

dpnX(x)||A(zp(x))−mp||L2(pX).

(To clarify notation, we observe thatdpnX is the singular measure onXn with support onM n which

is the natural product measure corresponding to the distribution ofn data pointsx1, . . . ,xn drawni.i.d. with eachxi distributed according topX.) The above in turn is lowerbounded by

≥n

∑l=0

∫x∈Sl

dpnX(x)||A(zp(x))−mp||L2(pX)

whereSl = x∈ Xn| exactlyl segments contain data and links do not.

More formally,

Sl = x∈ Xn|x∩ci 6= φ for exactlyl segmentsci andx∩B= /0.

1236

Page 9: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

Now we concentrate on lowerbounding∫

x∈Sldpn

X(x)||A(zp(x))−mp||L2(pX). Using the fact thatpX is uniform, we have thatdpn

X(x) = cd(x) (wherec is a normalizing constant andd(x) is theLebesgue measure or volume form on the associated product space) and therefore

∫x∈Sl

dpnX(x)||A(zp(x))−mp||L2(pX) =

∫x∈Sl

cd(x)||A(zp(x))−mp||L2(pX).

Thus, we have

Ez||A(z)−mp||L2(pX) ≥n

∑l=0

∫x∈Sl

cd(x)||A(zp(x))−mp||L2(pX). (3)

Now we see that

[l ]12d ∑

p∈Pd

Ez||A(zp(x))−mp||L2(pX) ≥12d ∑

p∈Pd

n

∑l=0

c∫

x∈Sl

d(x)||A−mp||

≥n

∑l=0

c∫

x∈Sl

(

12d ∑

p||A−mp||

)

d(x).

By Lemma 5, we see that for eachx∈ Sl , we have

12d ∑

p||A−mp|| ≥ (1−α−β)

d−n4d

from which we conclude that

12d ∑

pEz||A(z)−mp||L2(pX) ≥ (1−α−β)

d−n4d

n

∑l=0

∫x∈Sl

cd(x).

Now we note thatn

∑l=0

∫x∈Sl

cd(x) = Prob(x∩B= /0)≥ (1−β)n.

Therefore,

supp

Ez||A(z)−mp||L2(pX) ≥ (1−α−β)d−n4d

(1−β)n ≥ (1−α−β)18(1−β)n. (4)

Sinceα andβ (and for that matter,d) are in our control, we can choose them to make the right-hand side of Inequality 4 greater than some constant. This proves our theorem.

We now construct a family of intersecting manifolds such that given two pointson any manifoldin this family, it is difficult to judge (without knowing the manifold) whether these points are nearor far in geodesic distance. The class of learning problems consists of probability distributionspsuch thatpX is supported on some manifold in this class. This construction plays a central role inthe proof of the lower bound.Construction 1. Consider a set of 2d manifolds where each manifold has a structure shown inFigure 2. Each manifold has three disjoint subsets:A (loops),B (links), andC (chain) such that

M = A∪B∪C.

1237

Page 10: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

Loops (A) Loops (A)

C C C C C1 4 532

Links

Links

Figure 2: Figure accompanying Construction 1.

The chainC consists ofd segments denoted byC1,C2, . . . ,Cd such thatC= ∪Ci . The links connectthe loops to the segments as shown in Figure 2 so that one obtains a closed curve corresponding toan embedding of the circle intoRD. For each choiceS⊂ 1, . . . ,d one constructs a manifold (wecan denote this byMS) such that the links connectCi (for i ∈ S) to the “upper half” of the loop andthey connectCj (for j ∈ 1, . . . ,d \S) to the “bottom half” of the loop as indicated in the figure.Thus there are 2d manifolds altogether where eachMS differs from the others in the link structurebut the loops and chain are common to all, that is,

A∪C⊂ ∩SMS.

For manifoldMS, letl(A)

l(MS)=

∫A

p(S)X (x)dx= αS

wherep(S)X is the probability density function on the manifoldMS. Similarly

l(B)l(MS)

=∫

Bp(S)X (x)dx= βS

andl(C)

l(MS)=

∫C

p(S)X (x)dx= γS.

It is easy to check that one can construct these manifolds so that

βS≤ β;γS≥ γ.

Thus for each manifoldMS, we have the associated class of probability distributionsPMS. These

are used in the construction of the lower bound. Now for each such manifold MS, we pick oneprobability distributionp(S) ∈ PMS

such that for everyk∈ S, we have

For allk∈ S, p(S)(y=+1|x) = 1 for all x∈Ck

1238

Page 11: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

and for everyk∈ 1, . . . ,d\S, we have

For allk∈ 1, . . . ,d\S, p(S)(y=−1|x) = 1 for all x∈Ck.

Furthermore, thep(S) are all chosen so that the associated conditionalsp(S)(y=+1|x) agree onthe loops, that is, for anyS,S′ ∈ 1, . . . ,d,

p(S)(y=+1|x) = p(S′)(y=+1|x) for all x∈ A

This defines 2d different probability distributions that satisfy for eachp: (i) the support of themarginalpX includesA∪C, (ii) the support ofpX for different p have different link structures (iii)the conditionalsp(y|x) disagree on the the chain. We now prove the following technical lemma thatproves an inequality that holds when the data only lives on the segments and not on the links thatconstitute the embedded circle of Construction 1. This inequality is used in the proof of Theorem 4.

Lemma 5 Let x∈ Sl be a collection of n points such that no point belongs to the links and exactlyl segments contain at least one point.

12d ∑

p∈Pd

||A(zp(x))−mp||L2(pX) ≥ (1−α−β)d−n4d

.

Proof Sincex ∈ Sl , there ared− l segments of the chainC such that no data is seen from them.We let A(zp(x)) be the function hypothesized by the learner on receiving the data setzp(x). Webegin by noting that the familyPd may be naturally divided into 2l subsets in the following way.Following the notation of Construction 1, recall that every element ofPd may be identified witha setS⊂ 1, . . . ,d. We denote this element byp(S). Now let L denote the set of indices of thesegmentsCi that contain data, that is,

L = i|Ci ∩x 6= /0.

Then for every subsetD ⊂ L, we have

PD = p(S) ∈ Pd|S∩L = D.

Thus all the elements ofPD agree in their labelling of the segments containing data but disagreein their labelling of segments not containing data. Clearly there are 2l possible choices forD andeach such choice leads to a family containing 2dl probability distributions. Let us denote these 2l

families byP1 throughP2l .ConsiderPi . By construction, for all probability distributionsp,q ∈ Pi , we have thatzp(x) =

zq(x). Let us denote this byzi(x), that is,zi(x) = zp(x) for all p∈ Pi .Now f = A(zi(x)) is the function hypothesized by the learner on receiving the data setzi(x).

For anyp∈ P and any segmentck, we say thatp “disagrees” withf on ck if | f (x)mp(x)| ≥ 1 on amajority ofck, that is, ∫

ApX(x)≥

∫ck\A

pX(x)

whereA= x∈ ck|| f (x)mp(x)| ≥ 1. Therefore, iff andp disagree onck, we have∫

ck

( f (x)−mp(x))2pX(x)≥

12

∫ck

pX(x)≥12d

(1−α−β).

1239

Page 12: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

It is easy to check that for every choice ofj unseen segments, there exists ap ∈ Pi such thatpdisagrees withf on each of the chosen segments. Therefore, for such ap, we have

||(A(zp(x))−mp||2L2(pX)≥ 1

2jd(1−α−β).

Counting all the 2dl elements ofPi based on the combinatorics of unseen segments, we see (using

the fact that||A(zp(x))−mp|| ≥√

12

jd(1−α−β)≥ 1

2jd(1−α−β))

∑p∈Pi

||A(xp(x)−mp|| ≥d−l

∑j=0

(

d− lj

)

12

jd(1−α−β) = 2d−l (1−α−β)

d− l4d

.

Therefore, sincel ≤ n, we have

2l

∑i=1

∑p∈Pi

||A(xp(x))−mp|| ≥ 2d(1−α−β)d−n4d

.

2.3 Discussion

Thus we see that knowledge of the manifold can have profound consequences for learning. Theproof of the lower bound reflects the intuition that has always been at the root of manifold basedmethods for semi-supervised learning. Following Figure 2, if one knows themanifold, one sees thatC1 andC4 are “close” whileC1 andC3 are “far.” But this is only one step of the argument. Wemust further have the prior knowledge that the target function varies smoothly along the manifoldand so “closeness on the manifold” translates to similarity in function values (orlabel probabilities).However, this closeness is not obvious from the ambient distances alone.This makes the task ofthe learner who does not know the manifold difficult: in fact impossible in the sense described inTheorem 4.

Some further remarks are in order. These provide an idea of the ways in which our main the-orems can be extended. Thus we may appreciate the more general circumstances under which wemight see a separation between manifold methods and alternative methods.

1. While we provide a detailed construction for the case of different embeddings of the circleinto R

N, it is clear that the argument is general and similar constructions can be madeformany different classes ofk-manifolds. Thus ifM is taken to be ak-dimensional submanifoldof RN, then one could letM be a family ofk-dimensional submanifolds ofRN and letP bethe naturally associated family of probability distributions that define a collection of learningproblems. Our proof of Theorem 4 can be naturally adapted to such a setting.

2. Our example explicitly considers a classHM that consists of a one-parameter family of func-tions. It is important to reiterate that many different choices ofHM would provide the sameresult. For one, thresholding is not necessary, and if the classHM was simply defined asbandlimited functions, that is, consisting of functions of the form∑p

i=1 αiφi (whereφi arethe eigenfunctions of the Laplacian ofM ), the result of Theorem 4 holds as well. SimilarlySobolev spaces (constructed from functionsf = ∑i αiφi whereα2

i λsi < ∞) also work with and

without thresholding.

1240

Page 13: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

3. We have considered the simplest case where there is no noise in theY-direction, that is, theconditionalp(y|x) is concentrated at one pointmp(x) for eachx. Considering a more generalsetting with noise does not change the import of our results. The upper bound of Theorem 2makes use of the fact thatmp belongs to a restricted (uniformly Glivenko-Cantelli) familyHM . With a 0 – 1 loss function defined asV(h,z) = 1[y6=h(x)], the rate may be as good asO∗(1

n) in the noise-free case but drops toO∗( 1√n) in the noisy case. The lower bound of

Theorem 4 for the noiseless case also holds for the noisy case by immediate implication.Both upper and lower bounds are valid also for arbitrary marginal distributions pX (not justuniform) that have support on some manifoldM .

4. Finally, one can consider a variety of loss functions other than theL2 loss function consideredhere. The natural 0 – 1-valued loss function (which for the special case of binary valuedfunctions coincides with the theL2 loss) can be interpreted as the probability of error of theclassifier in the classification setting.

3. Manifold Learning and Manifold Regularization

3.1 Knowing the Manifold and Learning It

In the discussion so far, we have implicitly assumed that an oracle can provide perfect informationabout the manifold in whatever form we choose. We see that access to such an oracle can providegreat power in learning from labeled examples for classes of problems that have a suitable structure.

Yet, the whole issue ofknowing the manifoldis considerably more subtle than appears at firstblush and in fact has never been carefully considered by the machine learning community. Forexample, consider the following oracles that all provide knowledge of the manifold but in differentforms.

1. One could knowM as a set through some kind of set-membership oracle. For example, amembership oracle that makes sense is of the following sort: given a pointx and a numberr >0, the oracle tells us whetherx is in a tubular neighborhood of radiusr around the manifold.

2. One could know a system of coordinate charts on the manifold. For example, maps of theform ψi : Ui → R

D whereUi ⊂ Rk is an open set.

3. One could know in some explicit form the harmonic functions on the manifold,the Laplacian∆M , and the Heat KernelHt(p,q) on the manifold.

4. One could know the manifold up to some geometric or topological invariants. For exam-ple, one might know just the dimension of the manifold. Alternatively, one might know thehomology, the homeomorphism or diffeomorphism type, etc. of the manifold.

5. One could have metric information on the manifold. One might know the metric tensor atpoints on the manifold, one might know the geodesic distances between points on the mani-fold, or one might know the heat kernel from which various derived distances (such as diffu-sion distance) are obtained.

Depending upon the kind of oracle access we have, the task of the learner might vary fromsimple to impossible. For example, in the problem described in Section 2 of this paper, the natural

1241

Page 14: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

algorithm that realizes the upper bound of Theorem 2 performs empirical risk minimization over theclassHM . To do this it needs, of course, to be able to representHM in a computationally efficientmanner. In order to do this, it needs to know the eigenfunctions (in the specific example, only thefirst two, but in general some arbitrary number depending on the choice of HM ) of the Laplacian ontheM . This is immediately accessible from Oracle 3. It can be computed from Oracles 1, 2, and 5but this computation is intractable in general. From Oracle 4, it cannot be computed at all.

The next question one needs to address is: In the absence of an oraclebut given random samplesof example points on the manifold, can onelearn the manifold? In particular, can one learn it in aform that is suitable for further processing. In the context of this paper, the answer is yes.

Let us recall the following fundamental fact from Belkin and Niyogi (2005) that has somesignificance for the problem in this paper.

Let M be a compact, Riemannian submanifold (without boundary) ofRN and let∆M be the

Laplace operator (on functions) on this manifold. Letx= x1, . . . ,xm be a collection ofm pointssampled in i.i.d. fashion according to the uniform probability distribution onM . Then one maydefine the point cloud Laplace operatorLt

m as follows:

Ltm f (x) =

1t

1

(4πt)d/2

1m

m

∑i=1

( f (x)− f (xi))e− ||x−xi ||2

4t

The point cloud Laplacian is a random operator that is the natural extension of the graph Lapla-cian operator to the whole space. For any thrice differentiable functionf : M → R, we have

Theorem 6lim

t→0,m→∞Lt

m f (x) = ∆M f (x).

Some remarks are in order:

1. Givenx ∈M as above, consider the graph with vertices (letV be the vertex set) identified

with the points inx and adjacency matrixWi j =1mt

1(4πt)d/2 e−

||x−xi ||24t . Given f : M → R, the

restriction fV : x→R is a function defined on the vertices of this graph. Correspondingly, thegraph LaplacianL = (D−W) acts onfV and it is easy to check that

(Ltm f )|xi = (L fV)|xi .

In other words, the point cloud Laplacian and graph Laplacian agree onthe data. However,the point cloud Laplacian is defined everywhere while the graph Laplacianis only defined onthe data.

2. The quantityt (similar to a bandwidth) needs to go to zero at a suitable rate (tmd+2 → ∞)so there exists a sequencetm such that the point cloud Laplacian converges to the manifoldLaplacian asm→ ∞.

3. It is possible to show (see Belkin and Niyogi, 2005; Coifman and Lafon,2006; Gine andKoltchinskii, 2006; Hein et al., 2005) that this basic convergence is true for arbitrary proba-bility distributions (not just the uniform distribution as stated in the above theorem) in whichcase the point cloud Laplacian converges to an operator of the Laplace type that may berelated to the weighted Laplacian (Grigoryan, 2006).

1242

Page 15: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

4. While the above convergence is pointwise, it also holds uniformly over classes of functionswith suitable conditions on their derivatives (Belkin and Niyogi, 2008; Gine and Koltchinskii,2006).

5. Finally, and most crucially (see Belkin and Niyogi, 2006), ifλ(i)m andφ(i)

m are theith (in in-creasing order) eigenvalue and corresponding eigenfunction respectively of the operatorLtm

m,then with probability 1, asmgoes to infinity,

limm→∞|λi −λ(i)m |= 0

andlimm→∞|φ(i)

m −φi |L2(M ) = 0.

In other words, the eigenvalues and eigenfunctions of the point cloud Laplacian converge tothose of the manifold Laplacian as the number of data pointsmgo to infinity.

These results enable us to present a semi-supervised algorithm that learns the manifold fromunlabeled data and uses this knowledge to realize the upper bound of Theorem 2.

3.2 A Manifold Regularization Algorithm For Semi-supervised Learning

Let z= (z1, . . . ,zn) be a set ofn i.i.d. labeled examples drawn according top andx= (x1, . . . ,xm)be a set ofm i.i.d. unlabeled examples drawn according topX. Then a semi-supervised learner’sestimate may be denoted byA(z,x). Let us consider the following kind of manifold regularizationbased semi-supervised learner.

1. Construct the point cloud Laplacian operatorLtmm from theunlabeleddatax.

2. Solve for the eigenfunctions ofLtmm and take the first two (orthogonal to the constant unction).

Let these beφm andψm respectively.

3. Perform empirical risk minimization with the empirical eigenfunctions by minimizing

fm = arg minf=αφm+βψm

1n

n

∑i=1

V( f (xi),yi)

subject toα2i +β2

i = 1. HereV( f (x),y) = 14|y− sign( f (x))|2 is the 0−1 loss. This is equiva-

lent to Ivanov regularization with an intrinsic norm that forces candidate hypothesis functionsto be bandlimited.

Note that if the empirical risk minimization was performed with the true eigenfunctions (φ andψ respectively), then the resulting algorithm achieves the rate of Theorem 2. Since for largem, theempirical eigenfunctions are close to the true ones by the result in Belkin andNiyogi (2006), we mayexpect the above algorithm to perform well. Thus we may compare the two manifold regularizationalgorithms (an empirical one with unlabeled data and an oracle one that knowsthe manifold):

A(z,x) = sign( fm) = sign(αmφm+ βmψm)

andAoracle(z,M ) = sign( f ) = sign(αφ+ βψ).

We can now state the following:

1243

Page 16: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

Theorem 7 For anyε > 0, we have

supp∈PM

Ez||mp−A||2L2(pX)≤ 4

2π(arcsin(ε))+

1ε2(||φ−φm||2+ ||ψ−ψm||2)+3

3log(n)n

.

Proof Considerp∈ P and letmp = sign(αpφ+βpψ). Let gm= αpφm+βpψm. Now, first note thatby the fact of empirical risk minimization, we have

1n ∑

z∈zV(A(x),y)≤ 1

n ∑z∈z

V(sign(gm(x)),y).

Second, note that the set of functionsF = sign( f )| f = αφm+βψm has VC dimension equalto 2. Therefore the empirical risk converges to the true risk uniformly overthis class so that withprobability> 1−δ, we have

[l ]Ez[V(A(x),y)]−√

2log(n)+ log(1/δ)n

≤ 1n ∑

z∈zV(A(x),y)

≤ 1n ∑

z∈zV(sign(gm(x)),y)≤ Ez[V(sign(gm(x)),y)]+

2log(n)+ log(1/δ)n

.

Using the fact thatV(h(x),y) = 14(y−h)2, we have in general for anyh

Ez[V(h(x),y)] =14

Ez(y−mp)2+

14||mp−h||2L(pX)

from which we obtain with probability> 1−δ over choices of labeled training setsz,

||mp−A||2 ≤ ||mp− sign(gm)||2+2

2log(n)+ log(1/δ)n

.

Settingδ = 1n and noting that||mp−A||2 ≤ 1, we have after some straightforward manipulations,

Ez||mp−A||2 ≤ ||mp− sign(gm)||2+3

3log(n)n

.

Using Lemma 8, we get for anyε > 0,

supp∈PM

Ez||mp−A||2 ≤ 42π

(arcsin(ε))+1ε2(||φ−φm||2+ ||ψ−ψm||2)+3

3log(n)n

.

Lemma 8 Let f,g be any two functions. Then for anyε > 0,

||sign( f )−sign(g)||2L2(pX)≤ µ(Xε, f )+

1ε2 || f −g||2L2(pX)

1244

Page 17: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

where Xε, f = x | | f (x)| ≤ ε and µ is the measure corresponding the the marginal distribution pX.Further, if f = αφ+ βψ (α2+ β2 = 1) and g= αφm+ βψm whereφ,ψ are eigenfunctions of

the Laplacian onM while φm,ψm are eigenfunctions of point cloud Laplacian as defined in theprevious developments. Then for anyε > 0

||sign( f )−sign(g)||2L2(pX)≤ 4

2π(arcsin(ε))+

1ε2(||φ−φm||2+ ||ψ−ψm||2).

Proof We see that

||sign( f )−sign(g)||2L2(pX)=

∫Xε, f

|sign( f (x))−sign(g(x))|2+∫M \Xε, f

|sign( f (x))−sign(g(x))|2

≤ 4µ(Xε, f )+∫M \Xε, f

|sign( f (x))−sign(g(x))|2.

Note that ifx∈M \Xε, f , we have that

|sign( f (x))−sign(g(x))| ≤ 2ε| f (x)−g(x)|.

Therefore,∫M \Xε, f

|sign( f (x))−sign(g(x))|2 ≤ 4ε2

∫M \Xε, f

| f (x)−g(x)|2 ≤ 4ε2 || f −g||2L2(pX)

.

This proves the first part. The second part follows by a straightforward calculation on the circle.

Some remarks:

1. While we have stated the above theorem for our running example of embeddings of the circleinto R

D, it is clear that the results can be generalized to cover arbitrary k-manifolds, moregeneral classes of functionsHM , noise, and loss functionsV. Many of these extensions arealready implicit in the proof and associated technical discussions.

2. A corollary of the above theorem is relevant for the (m= ∞) case that has been covered inCastelli and Cover (1996). We will discuss this in the next section. The corollary is

Corollary 9 Let P = ∪M PM be a collection of learning problems with the structure de-scribed in Section 2, that is, each p∈ P is such that the marginal pX has support on asubmanifoldM of RD which corresponds to a particular isometric embedding of the circleinto Euclidean space. For each such p, the regression function mp = E[p(y|x)] belongs toa class of functions HM which consists of thresholding bandlimited functions onM . Thenno supervised learning algorithm exists that is guaranteed to converge forevery problem inP (Theorem 4). Yet the semi-supervised manifold regularization algorithm described above(with infinite amount of unlabeled data) converges at a fast rate as a function of labelled data.In other words,

supM

limm→∞

supPM

||mp−A(z,x)||2L2(pX)= 3

3log(n)n

.

1245

Page 18: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

3. It is natural to ask what it would take to move the limitm→ ∞ outside. In order to do this,one will need to put additional constraints on the class of possible manifoldsM that weare allowed to consider. But putting such constraints we can construct classes of learningproblems where for any realistic number of labeled examplesn, there is a gap between theperformance of a supervised learner and the manifold based semi-supervised learner. Anexample of such a theorem is:

Theorem 10 Fix any number N. Then there exists a class of learning problemsPN such thatfor all n < N

R(n,PN) = infA

supp∈PN

Ez||mp−A(z)|| ≥ 1/100

while

Q(n,PN) = limm→∞

supp∈PN

||mp−Amanreg(z,x)||2 ≤√

3log(n)n

.

Proof We provide only a sketch of the argument and avoid technical details. We begin bychoosing a family of submanifolds of[−M,M]D with a uniform bound on their curvature. Oneform such a bound can take is the following: Letτ be the largest number such that the opennormal bundle of radiusr aboutM is an imbedding for anyr < τ. This provides a bound onthe norm of the second fundamental form (curvature) and nearness toself intersection of thesubmanifold. NowPN will contain probability distributionsp such thatpX is supported onsomeM with a τ curvature bound andp(y|x) is 0 or 1 for everyx,y such that the regressionfunctionmp = E[y|x] belongs toHM . As before, we chooseHM to be the span of the firstKeigenfunctions of the Laplacian∆ onM . For a lower boundR(n), we follow Construction1 and choosed = 2N (from the proof of the lower bound of Theorem 4). Following Con-struction 1, the circle can be embedded in[−M,M]D by twisting in all directions. Letl bethe length of a single segment of the chain. Since theτ condition needs to be respected forevery embedding, the circle cannot twist too much and come too close to self intersection. Inparticular, this will imply that 2NlVτ < MD whereVτ is the volume of theD−1 dimensionalball of radiusτ. For an upper boundQ(n), we follow the the manifold regularization algo-rithm of the previous section and note that eigenfunctions of the Laplacian can be estimatedfor compact manifolds with a curvature bound.

However, asymptotically,R(n) andQ(n) have the same rate forn >> N. SinceN can bearbitrarily chosen to be astronomically large, this asymptotic rate is of little consequence inpractical learning situations. This suggests the limitations of asymptotic analysis without acareful consideration of the finite sample situation.

4. The Structure of Semi-supervised Learning

It is worthwhile to reflect on why the manifold regularization algorithm is able to display improvedperformance in semi-supervised learning. The manifold assumption is a device that allows us to linkthe marginalpX with the conditionalp(y|x). Through unlabeled datax, we can learn the manifoldM thereby greatly reducing the class of possible conditionalsp(y|x) that we need to consider. More

1246

Page 19: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

generally, semi-supervised learning will be feasible only if such a link is made. To clarify thestructure of problems on which semi-supervised learning is likely to be meaningful, let us define amapπ : p→ pX that takes any probability distributionp onX×Y and maps it to the marginalpX.

Given any collection of learning problemsP , we have

π : P → PX

wherePX = pX|p∈ P. Consider the case in which the structure ofP is such that for anyq∈ PX,the family of conditionalsπ−1(q) = p∈ P |pX = q is “small.” For a situation like this, knowingthe marginal tells us a lot about the conditional and therefore unlabeled datacan be useful.

4.1 Castelli and Cover Interpreted

Let us consider the structure of the class of learning problems considered by Castelli and Cover(1996). They consider a two-class problem with the following structure. The class of learningproblemsP is such that for eachp∈ P , the marginalq= pX can be uniquely expressed as

q= µ f +(1−µ)g

where 0≤ µ ≤ 1 and f ,g belong to some classG of possible probability distributions. In otherwords, the marginal is always a mixture (identifiable) of two distributions. Furthermore, the classP of possible probability distributions is such that there are precisely two probability distributionsp1, p2 ∈ P such that their marginals are equal toq. In other words,

π−1(q) = p1, p2

wherep1(y= 1|x) = µ f(x)q(x) andp2(y= 1|x) = (1−µ)g(x)

q(x) .In this setting, unlabeled data allows the learner to estimate the marginalq. Once the marginal

is obtained, the class of possible conditionals is reduced toexactly two functions. Castelli andCover (1996) show that the risk now converges to the Bayes’ risk exponentially as a function oflabeled data (i.e., the analog of an upper bound onQ(n,P ) is approximatelye−n). The reason semi-supervised learning is successful in this setting is that the marginalq tells us a great deal about theclass of possible conditionals. It seems that a precise lower bound on purely supervised learning(the analog ofR(n,P )) has never been clearly stated in that setting.

4.2 Manifold Regularization Interpreted

In its most general form, manifold regularization encompasses a class of geometrically motivatedapproaches to learning. Spectral geometry provides the unifying point of view and the spectralanalysis of a suitable geometrically motivated operator yields a “distinguished basis.” Since (i) onlyunlabeled examples are needed for the spectral analysis and the learningof this basis, and (ii) thetarget function is assumed to be compactly representable in this basis, the ideahas the possibilityto succeed in semi-supervised learning. Indeed, the previous theorems clarify the theoretical basisof this approach. This, together with the empirical success of algorithms based on these intuitionssuggest there is some merit in this point of view.

In general, letq be a probability density function onX = RD. The support ofq may be a

submanifold ofX (with possibly many connected components). Alternatively, it may lie close to asubmanifold, it may be all ofX, or it may be a subset ofX. As long asq is far from uniform, that

1247

Page 20: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

is, it has a “shape,” one may consider the following “weighted Laplacian” (see Grigoryan, 2006)defined as

∆q f (x) =1

q(x)div(qgradf )

where the gradient (grad) and divergence (div) are with respect tothe support ofq (which maysimply be all ofX).

The heat kernel associated with this weighted Laplacian (essentially the Fokker-Planck operator)is given bye−t∆q . Laplacian eigenmaps and Diffusion maps are thus defined in this more generalsetting.

If φ1,φ2, . . . represent an eigenbasis for this operator, then, one may consider the regressionfunctionmq to belong to the family (parameterized bys= (s1,s2, . . .)) where eachsi ∈ R∪∞.

HSq = h : X → R such thath= ∑

i

αiφi and∑i

α2i si < ∞.

Some natural choices ofs are (i)∀i > p,si = ∞ : this gives us bandlimited functions (ii)si = λti

is theith eigenvalue of∆q : this gives us spaces of Sobolev type (iii)∀i ∈A,si = 1, elsesi = ∞ whereA is a finite set: this gives us functions that are sparse in that basis.

The class of learning problemsP (s) may then be factored as

P(s) = ∪qP(s)q

whereP

(s)q = p|px = q andmp ∈ HS

q.The logic of the geometric approach to semi-supervised learning is as follows:

1. Unlabeled data allow us to approximateq, the eigenvalues and eigenfunctions of∆q, andtherefore the spaceHs.

2. If s is such thatπ−1(q) is “small” for everyq, then a small number of labeled examples sufficeto learn the regression functionmq.

In problems that have this general structure, we expect manifold regularization and related al-gorithms (that use the graph Laplacian or a suitable spectral approximation)to work well. Precisetheorems showing the correctness of these algorithms for a variety of choices of s remains part offuture work. The theorems in this paper establish results for some choices of s and are a step in abroader understanding of this question.

5. Conclusions

We have considered a minimax style framework within which we have investigatedthe potentialrole of manifold learning in learning from labeled and unlabeled examples. Wedemonstrated thenatural structure of a class of problems on which knowing the manifold makesa big difference. Onsuch problems, we see that manifold regularization is provably better than any supervised learningalgorithm.

Our proof clarifies a potential source of confusion in the literature on manifold learning. Wesee that if data lives on an underlying manifold but this manifold isunknownand belongs to a class

1248

Page 21: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

MANIFOLD REGULARIZATION THEORY

of possible smooth manifolds, it is possible that supervised learning (classification and regressionproblems) may be ineffective, even impossible. In contrast, if the manifold is fixed though unknown,it may be possible to (e.g., Bickel and Li, 2007) learn effectively by a classical method suitablymodified. In between these two cases lie various situations that need to be properly explored for agreater understanding of the potential benefits and limitations of manifold methods and the need formanifold learning.

Our analysis allows us to see the role of manifold regularization in semi-supervised learningin a clear way. Several algorithms using manifold and associated graph-based methods have seensome empirical success recently. Our paper provides a framework within which we may be able toanalyze and possibly motivate or justify such algorithms.

Acknowledgments

I would like to thank Misha Belkin for wide ranging discussions on the themes ofthis paper andAndrea Caponnetto for discussions leading to the proof of Theorem 4.

References

Robert A. Adams and John J.F. Fournier.Sobolev Spaces, volume 140. Academic press, 2003.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and datarepresentation.Neural Computation, 15(6):1373–1396, 2003.

Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for Laplacian-based mani-fold methods. InEighteenth Annual Conference on Learning Theory, pages 486–500. Springer,Bertinoro, Italy, 2005.

Mikhail Belkin and Partha Niyogi. Convergence of Laplacian eigenmaps. In NIPS, pages 129–136,2006.

Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for Laplacian-based manifoldmethods.Journal of Computer and System Sciences, 74(8):1289–1308, 2008.

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric frame-work for learning from labeled and unlabeled examples.Journal of Machine Learning Research,7:2399–2434, 2006.

Peter J Bickel and Bo Li. Local polynomial regression on unknown manifolds. IMS Lecture Notes-Monograph Series, pages 177–186, 2007.

Vittorio Castelli and Thomas M. Cover. The relative value of labeled and unlabeled samples inpattern recognition with an unknown mixing parameter.Information Theory, IEEE Transactionson, 42(6):2102–2117, 1996.

Ronald R Coifman and Stephane Lafon. Diffusion maps.Applied and Computational HarmonicAnalysis, 21(1):5–30, 2006.

1249

Page 22: Manifold Regularization and Semi-supervised Learning: Some … · 2020. 9. 21. · Manifold regularization provides a framework within which many graph based algorithms for semi-supervised

NIYOGI

Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and sup-port vector machines.Advances in Computational Mathematics, 13(1):1–50, 2000.

Evarist Gine and Vladimir Koltchinskii. Empirical graph Laplacian approximation of Laplace-Beltrami operators: Large sample results.Lecture Notes-Monograph Series, pages 238–259,2006.

Alexander Grigoryan. Heat kernels on weighted manifolds and applications. Contemporary Math-ematics, 398:93–191, 2006.

Matthias Hein, Jean-Yves Audibert, and Ulrike von Luxburg. From graphs to manifolds - weak andstrong pointwise consistency of graph Laplacians. InCOLT, pages 470–485, 2005.

John Lafferty and Larry Wasserman. Statistical analysis of semi-supervised regression. InAdvancesin Neural Information Processing Systems. NIPS Foundation, 2007.

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embed-ding. Science, 290(5500):2323–2326, 2000.

Bernhard Scholkopf and Alexander J Smola.Learning with Kernels: Support Vector Machines,Regularization, Optimization and Beyond. the MIT Press, 2002.

Grace Wahba.Spline Models for Observational Data, volume 59. Society for industrial and appliedmathematics, 1990.

Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report TR 1530, University ofWisconsin–Madison, Computer Sciences Department, 2008.

1250