adapting to a non-ideal worldmohri/talks/smile2012.pdf · ideal ideal vs real world real world....

Adapting to a Non-Ideal World

Mehryar MohriCourant Institute and Google Research

[email protected]

Includes joint work with Corinna Cortes, Yishay Mansour, Andres Muñoz, and Afshin Rostamizadeh.

mailto:[email protected]

mailto:[email protected]

pageMehryar Mohri

Ideal World

Standard learning assumptions:

• same distribution for training and test.

• distributions fixed over time.

• IID sampling.

2

pageMehryar Mohri 3

sampling

domain

time

ideal

Ideal vs Real World

real world

pageMehryar Mohri

This Talk

Domain adaptation

• Discrepancy distance

• Theoretical guarantees

• Algorithm

• Experiments

Drifting scenario

4

pageMehryar Mohri

Domain Adaptation

Sentiment analysis: appraisal information for some domains, e.g., movies, books, music, restaurants, but no labels for travel.

Language modeling, part-of-speech tagging.

Statistical parsing.

Speech recognition.

Computer vision.

5

Solution critical for applications.

pageMehryar Mohri

Domain Adaptation Problem

Domains: source , target .

Input:

• labeled sample drawn from source.

• unlabeled sample drawn from target.

Problem: find hypothesis in with small expected loss with respect to target domain, that is

6

(Q, fQ) (P, fP )

S

T

h H

LP (h, fP ) = Ex∼P

�L

�h(x), fP (x)

��.

pageMehryar Mohri

Related Work

Single-source adaptation:

• language modeling, probabilistic parsers, maxent models: source domain used to define a prior.

• relation between adaptation and the distance [Ben-David et al. (NIPS 2006) and Blitzer et al. (NIPS 2007)].

• a few negative examples of adaptation [Ben-David et

al. (AISTATS 2010)].

• analysis and learning guarantees for importance weighting [(Cortes, Mansour, and MM (NIPS 2010)].

7

dA

pageMehryar Mohri

Related Work

Multiple-source:

• related problem: same input distribution, but different labels modulo disparity constraints [Crammer, Kearns, and Wortman (NIPS 2005, NIPS 2006)].

• theoretical analysis and method for multiple-source adaptation using the notion of Rényi divergence [Mansour, MM, and Rostami (NIPS 2008, UAI

2009)].

8

pageMehryar Mohri

Distribution Mismatch

9

PQ

Which distance should we useto compare these distributions?

pageMehryar Mohri

Simple Analysis

Proposition: assume that the loss is bounded by , then

Proof:

10

But, is this bound informative?

LM

|LQ(h, f)− LP (h, f)| ≤ M L1(Q, P ).

|LP (h, f)− LQ(h, f)| =�� E

x∼P

�L

�(h(x), f(x)

��− E

x∼Q

�L

�(h(x), f(x)

��

=��

x

�P (x) −Q(x)

�L

�(h(x), f(x)

��

≤M�

x

��P (x)−Q(x)��.

pageMehryar Mohri

Example - 0/1 Loss

11

f

ha

a

a

|LQ(h, f)− LP (h, f)| = |Q(a)− P (a)|

pageMehryar Mohri

Discrepancy

Definition:

• symmetric, triangle inequality, in general not a distance.

• helps compare distributions for arbitrary losses, e.g. hinge loss, or loss.

• generalization of distance (Devroye et al. (1996); Kifer et

al. (2004); Ben-David et al. (2007)).

12

(Mansour, MM, Rostami, 2009)

Lp

disc(P, Q) = maxh,h�∈H

��LP (h�, h)− LQ(h�, h)��.

dA

pageMehryar Mohri

Discrepancy - Properties

Theorem: for loss bounded by , for any , with probability at least ,

13

Lq δ>01−δ

M

disc(P, Q) ≤ disc( �P , �Q) + 4q

��RS(H) + �RT (H)

�

+ 3M

��log 4

δ

2m+

�log 4

δ

2n

�.

pageMehryar Mohri

This Talk

Domain adaptation



• Algorithm

• Experiments

Drifting scenario

14

pageMehryar Mohri

Theoretical Guarantees

Two types of questions:

• difference between average loss of hypothesis on versus ?

• difference of loss (measured on ) between hypothesis obtained when training on versus hypothesis obtained when training on ?

15

Q Ph

h�h ( �Q, fQ)

( �P , fP )

P

pageMehryar Mohri

Generalization Bound

Notation:

•

•

Theorem: assume that obeys the triangle inequality, then the following holds:

16

LQ(h∗Q

, f) = minh∈H

LQ(h, f)

LP (h∗P, f) = min

h∈H

LP (h, f)

L

LP (h, fP ) ≤LQ(h, h∗Q) + LP (h∗P , fP ) + disc(P, Q)

+ LQ(h∗Q, h∗P ).

[Mansour, MM, Rostami (COLT 2009)]

pageMehryar Mohri

Some Natural Cases

When ,

When (consistent case),

Bound of (Ben-David et al., NIPS 2006) & (Blitzer et al., NIPS 2007): always worse in these cases.

17

h∗ = h∗Q = h∗P

LP (h, fP ) ≤ LQ(h, h∗) + LP (h∗, fP ) + disc(P, Q).

fP ∈ H

|LP (h, fP )− LQ(h, fP )| ≤ disc(Q, P ).

pageMehryar Mohri

Kernel-Based Reg. (KBR) Algorithms

Objective function:

• broad family of algorithms including SVM, SVR, kernel ridge regression, etc.

18

F bQ(h) = λ �h�2K + �R bQ(h),

where is a PDS kernel; is a trade-off parameter; and is the empirical error of .

Kλ>0�R bQ(h) h

pageMehryar Mohri

Guarantees for KBR Algorithms

Theorem: let be a PDS kernel with and a loss function such that is -Lipschitz. Assume that , then, for all ,

19

L

��L(h�(x), y)− L(h(x), y)�� ≤ µR

�disc( �P , �Q) + µη

λ,

where η = max{L(fQ(x), fP (x)) : x∈supp( �Q)}.

K(x, x)≤R2KL(·, y) µ

fP ∈H (x, y)∈X×Y

[Cortes & MM (ALT 2011)]

pageMehryar Mohri

Guarantees for KBR Algorithms

Theorem: let be a PDS kernel with and the loss bounded by . Then, for all ,

• for Gaussian kernels and continuous.

20

L

where

K(x, x)≤R2K

(x, y)∈X×YL2 M

δ = minh∈H

�� Ex∼ bQ

��h(x)− fQ(x)

�ΦK(x)

�− E

x∼ bP

��h(x)− fP (x)

�ΦK(x)

��K

.

��L(h�(x), y)− L(h(x), y)�� ≤ R

√M

λ

�δ +

�δ2 + 4λdisc( �P , �Q)

�,

[Cortes & MM (2012)]

fP =fQδ = 0

pageMehryar Mohri

Empirical Discrepancy

Discrepancy distance critical term in bounds.

Smaller empirical discrepancy guarantees closeness of pointwise losses of h’ and h.

But, can we further reduce the discrepancy?

21

disc( �P , �Q)

pageMehryar Mohri

This Talk

Domain adaptation



• Algorithm

• Experiments

Drifting scenario

22

pageMehryar Mohri

Algorithm - Idea

Search for a new empirical distribution with same support:

Solve modified KBR problem:

23

q∗

minh

Fq∗(h) =1m

m�

i=1

q∗(xi)L(h(xi), yi) + λ�h�2K .

q∗ = argminsupp(q)⊆supp( bQ)

disc( �P , q).

pageMehryar Mohri

Case of Halfspaces

24

pageMehryar Mohri

Regression - L2 Loss

Problem:

25

minsupp(q)⊆supp( bQ)

maxh,h�∈H

�� Ex∼bP

[(h�(x)− h(x))2]− Ex∼q

[(h�(x)− h(x))2]��.

pageMehryar Mohri

Discrepancy Min. - Input space

For loss and , can be cast as an SDP:

26

H ={x �→ w�x : �w�≤Λ}

what about if we want to use kernels?

minimize �M(z)�2

subject to M(z) = M0 −m�

i=1

ziMi

M0 =q�

j=m+1

�P (xj)xjx�j

Mi = xix�i , i ∈ [1, m]

z�1 = 1 ∧ z ≥ 0.

L2

elements of supp( �Q)

pageMehryar Mohri

Discrepancy Min. with Kernels

For loss and , it can be cast as a similar SDP:

27

H = {h ∈ H : �h�K≤Λ}

minimize �M�(z)�2subject to M�(z) = M�

0 −m�

i=1

ziM�i

M�0 = K1/2D0K1/2

M�i = K1/2DiK1/2

z�1 = 1 ∧ z ≥ 0.

but, cannot be solved practically even for a fewhundred points, even with best public SDP solvers.

L2

pageMehryar Mohri

Discrepancy Min. SDP Algorithm

Smooth approximation [Nesterov (1983, 2005)]:

• not differentiable.

• : smooth unif. approximation.

Algorithm:

28

F : z �→�M(z)�2Gp: z �→ 1

2 Tr[M(z)2p]1p

Algorithm 2u0 ← argminu∈C u�Jufor k ≥ 0 dovk ← argminu∈C

2p−12 (u− uk)�J(u− uk) +∇Gp(M(uk))�u

wk ← argminu∈C2p−1

2 (u− u0)�J(u− u0) +

Pki=0

i+12 ∇Gp(M(ui))

�uuk+1 ← 2

k+3wk + k+1k+3vk

end for

Fig. 2. Smooth approximation algorithm.

with Lipschitz-continuous gradient. This is the technique we consider in the following.Recall the general form of the discrepancy minimization SDP in the feature space:

minimize �M(z)�2 (15)

subject to M(z) =m�

i=0

ziMi ∧ z0 = −1 ∧m�

i=1

zi = 1 ∧ ∀i ∈ [1,m], zi ≥ 0,

where z∈Rm+1 and where the matrices Mi∈SN+ , i∈ [0,m], are fixed SPSD matrices.

Thus, here C = {z ∈ Rm+1 : z0 = −1 ∧�m

i=1 zi = 1 ∧ ∀i ∈ [1,m], zi ≥ 0}.We further assume in the following that the matrices Mi are linearly independent sincethe problem can be reduced to that case straightforwardly. The symmetric matrix J =[�Mi,Mj�F ]i,j ∈ R(m+1)×(m+1) is then PDS and we will be using the norm x �→��Jx,x�=�x�J on Rm+1.A difficulty in solving this SDP is that the function F : z �→ �M(z)�2 is not dif-

ferentiable since eigenvalues are not differentiable functions at points where they co-alesce, which, by the nature of the minimization, is likely to be the case precisely atthe optimum. Instead, we can seek a smooth approximation of that function. One nat-ural candidate is the function z �→ �M(z)�2F . However, the Frobenius norm can leadto a very coarse approximation of the spectral norm. As suggested by Nesterov [15],the function Gp: M �→ 1

2 Tr[M2p]1p , where p ≥ 1 is an integer, can be used to give a

smooth approximation. Indeed, let λ1(M)≥ λ2(M)≥· · ·≥ λN (M) denote the list ofthe eigenvalues of a matrix M∈SN in decreasing order. By the definition of the trace,for all M∈SN , Gp(M)= 1

2

��Ni=1 λ2p

i (M)� 1

p , thus

12λ2 ≤ Gp(M) ≤ 1

2(rank(M)λ2p)

1p ,

where λ = max{λ1(M),−λN (M)} = �M�2. Thus, if we choose r as the maximumrank, r = maxz∈C rank(M(z)) ≤ max{N,

�ni=0 rank(Mi)}, then for all z∈C,

12�M(z)�22 ≤ Gp(M(z)) ≤ 1

2r

1p �M(z)�22. (16)

This leads to a smooth approximation algorithm for solving the SDP (15) derived fromAlgorithm 1 by replacing the objective function F with Gp. Choosing the prox-functiond : u �→ 1

2�u − u0�2J leads to the algorithm whose pseudocode is given in Figure 2,after some minor simplifications. The following theorem guarantees that its maximumnumber of iterations to achieve a relative accuracy of � is in O(

√r log r/�).

J = (�Mi,Mj�F )1≤i,j≤m.

[Cortes & MM (ALT 2011)]

pageMehryar Mohri

Convergence Guarantee

Let .

Theorem: for any , the algorithm solves the discrepancy minimization SDP with relative accuracy in iterations.

29

r = maxz∈C

rank(M(z)) ≤ max{N,n�

i=0

rank(Mi)}

� > 0

� O(�

r log r/�)

pageMehryar Mohri

This Talk

Domain adaptation



• Algorithm

• Experiments

Drifting scenario

30

pageMehryar Mohri

Experiments - Time

31

oo

o

o

oooo

500 2000 5000

110

1000

Set size

Seco

nds

**

**

****

80 sec.

!=3

o

*QP optimizationgrad−G computation o

o

o

o

oooo

500 2000 5000

1e+0

11e

+03

1e+0

5

Set size

Seco

nds ! =3

XXXX

oX

Algorithm 2SeDuMi

Fig. 4. The left panel shows a plot reporting run times measured empirically (mean ± one stan-

dard deviation) for the QP optimization and the computation of∇Gp as a function of the sample

size (log-log scale). The right panel compares the total time taken by Algorithm 2 to compute the

optimization solution, to the one taken by SeDuMi (log-log scale).

provides a cost per iteration of O((log p)(m + n)3), and thus depends on the com-

bined size of the labeled and unlabeled data. Figure 4 shows typical timing results ob-

tained for different samples sizes in the range m + n = 500 to m + n = 10, 000 for

p = 16, which empirically was observed to guarantee convergence. For a sample size

of m + n = 2, 000 the time is about 80 seconds. With 5 iterations of Algorithm 2 the

total time is 5× (80 + 2 ∗ 10) + 10 = 510 seconds.

In contrast, even the most efficient SDP solvers publicly available, SeDuMi, cannot

solve our discrepancy minimization SDPs for more than a few hundred points in the

kernelized version. In our experiments, SeDuMi (http ://sedumi.ie.lehigh.edu/)

simply failed for set sizes larger than m + n = 750! In Figure 4, typical run times for

Algorithm 2 with 5 iterations are compared to run times using SeDuMi.

7 Conclusion

We presented several theoretical guarantees for domain adaptation in regression and

proved that the empirical discrepancy minimization can also be cast as an SDP when

using kernels. We gave an efficient algorithm for solving that SDP using results from

smooth optimization and specific characteristics of these SDPs in our adaptation case.

Our adaptation algorithm is shown to scale to larger data sets than what could be af-

forded using the best existing software for solving such SDPs. Altogether, our results

form a complete solution for domain adaptation in regression, including theoretical

guarantees, an efficient algorithmic solution, and extensive empirical results.

Acknowledgments

We thank Steve Boyd, Michael Overton, and Katya Scheinberg for discussions about

the optimization problem addressed in this work.

Bibliography

[1] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain

adaptation. NIPS 2006, 2007.

[2] S. Ben-David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation.

Journal of Machine Learning Research - Proceedings Track, 9:129–136, 2010.

[3] J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders:

Domain Adaptation for Sentiment Classification. In ACL 2007, 2007.

pageMehryar Mohri

Experiments - Performance

• Multi-domain sentiment analysis data set (Blitzer et

al. 2007):

• Treated as regression task.

32

books, dvd, elec, kitchen.

adaptationto books

pageMehryar Mohri

Experiments - Performance

33

2 5 20 100 500

0.3

0.4

0.5

0.6

Unlabeled set size

RM

SE

Adaptation to Books

books

dvd

elec

kitchen

2 5 20 100 500

0.2

50.3

50.4

5

Unlabeled set size

RM

SE

Adaptation to Dvd

books

dvd

elec

kitchen

2 5 20 100 500

0.2

50.3

50.4

5

Unlabeled set size

RM

SE

Adaptation to Elec

books

dvd

elec

kitchen

2 5 20 100 500

0.3

20.3

60.4

0

Unlabeled set size

RM

SE

Adaptation to Kitchen

books

dvd

elec

kitchen

Fig. 3. Performance improvement of the RMSE for the 12 adaptation tasks as a function of thesize of the unlabeled data used. Note that the figures do not make use of the same y-scale.

Theorem 5. For any �>0, Algorithm 2 solves the SDP (15) with relative accuracy � inat most 4

�(1 + �)r log r/� iterations using the objective function Gp with p∈ [q0, 2q0)

and q0 =(1 + �)(log r)/�.

Proof. The proof follows directly [14], it is given in Appendix A for completeness. ��

The first step of the algorithm consists of computing the vector u0 by solving thesimple QP of line 1. We now discuss in detail how to efficiently compute the steps ofeach iteration of the algorithm in the case of our discrepancy minimization problems.

Each iteration of the algorithm requires solving two simple QPs (lines 3 and 4). Todo so, the computation of the gradient∇Gp(M(uk)) is needed. This will therefore rep-resent the main computational cost at each iteration other than solving the QPs alreadymentioned since, clearly, the sum

�ki=0

i+12 ∇Gp(M(ui))�u required at line 4 can be

computed in constant time from its value at the previous iteration. Since for any z ∈ Rm

Gp(M(z)) = Tr[M2p(z)]1/p = Tr�(

m�

i=0

ziMi)2p�1/p

,

using the linearity of the trace operator, the ith coordinate of the gradient is given by

[∇Gp(M(z))]i =�M2p−1(z),Mi�F Tr[M2p(z)]1p−1

, (17)

for all i ∈ [0,m]. Thus, the computation of the gradient can be reduced to that of thematrices M2p−1(z) and M2p(z). When the dimension of the feature space N is nottoo large, both M2p−1(z) and M2p(z) can be computed via O(log p) matrix multipli-cations using the binary decomposition method to compute the powers of a matrix [5].Since each matrix multiplication takes O(N3), the total computational cost for deter-mining the gradient is then in O((log p)N3). The cubic-time matrix multiplication canbe replaced by more favorable complexity terms of the form O(N2+α), with α = .376.Alternatively, for large values of N , that is N � (m + n), in view of Theorem 3, wecan instead solve the kernelized version of the problem. Since it is formulated as thesame SDP, the same smooth optimization technique can be applied. Instead of M(z),we need to consider the matrix M�(z) = K1/2D(z)K1/2. Now, observe that

M�2p(z) =�K1/2D(z)K1/2

�2p = K1/2�D(z)K

�2p−1D(z)K1/2.

pageMehryar Mohri

This Talk

Domain adaptation



• Algorithm

• Experiments

Drifting scenario

34

pageMehryar Mohri

Generalized Discrepancy

Definition: given two distributions and over , the -discrepancy is defined as follows:

35

D D�

X × Y Y

discY(D, D�) = suph∈H

��LD�(h)− LD(h)��,

where LD(h) = E(x,y)∼D

[L(h(x), y)].

pageMehryar Mohri

Drifting PAC Scenario

Model:

36

[MM & Muñoz, 2012]

D1 D2 DT DT+1...

(x1, y1) (x2, y2) (xT , yT ) (x, y)� ��

training

pageMehryar Mohri

Drifting PAC - Guarantee

Theorem: assume that the loss is bounded by . Let be a sequence of distributions and let . Then, for any , with probability at least ,

37

L MD1, . . . , DT+1

HL = {(x, y) �→ L(h(x), y) : h ∈ H}δ > 0 1− δ

LDT +1(h) ≤ 1T

T�

t=1

L(h(xt), yt) + 2RT (HL) +

1T

T�

t=1

discY(Dt, DT+1) + M

�log 2

δ

2T.

pageMehryar Mohri

Drifting Tracking Scenario

Model: perpetual training and testing.

• tracking:

38

[Bartlett, COLT 1992]

D1 D2 DT-1 DT...

(x1, y1) (x2, y2)

...

(xT , yT )(xT−1, yT−1)

sup(Dt)

{ E(xt,yT )∼DT

[L(ht−1(xT , yT )] : ∀t, discY(Dt, Dt+1) ≤ ∆} < infh∈H

LDT (h) + �.

pageMehryar Mohri

Drifting Tracking Scenario

Expected mistake:

• supremum over all distributions :

Algorithm - tracks if for sufficiently large,

39

MT (D) = ES∼D[�MT ] = ES∼D[L(hT−1(xT ), yT )].

A (∆, �)

�MT = supdiscY(Dt,Dt+1)<∆

MT (D).

H T

�MT < infh∈H

LDT (h) + �.

D =T�

t=1

Dt

pageMehryar Mohri

Drifting Tracking - Guarantee

Theorem: hypothesis set with VC-dimension and ; sequence of distributions such that for all . If , there exists an algorithm that -tracks .

• improvement over previous result of (Long,

(1999).

40

H d

discY(Dt, Dt+1) ≤ ∆ tRT (HL)≤O(

�d/T )

∆=O(d�3)

(∆, �) H

L1

pageMehryar Mohri

General On-Line to Batch

Theorem: assume that the loss is bounded by and that it is convex with respect to its first argument. Let be the hypotheses returned by algorithm and . Then, with probability at least ,

41

L M

h =�T

t=1 wtht

h1, . . . , hT

A1− δ

LDT +1(h) ≤T�

t=1

wtL(ht(xt), yt) +T�

t=1

wtdiscY(Dt, DT+1) + M�w�2

�2 log

1δ

LDT+1(h) ≤ infh∈H

L(h) +RT

T+

T�

t=1

wtdiscY(Dt, DT+1) + M�w− u0�1 + 2M�w�2

�2 log

2δ.

pageMehryar Mohri

Drifting Algorithm

Idea: find best mixture weight.

• deterministic setting:

• estimation of discrepancies using independent samples.

42

minw

λ�w�22 +

T�

t=1

wt (discY(Dt, DT+1) + L(ht(xt), yt))

subject to:� T�

t=1

wt = 1�∧ (∀t ∈ [1, T ], wt ≥ 0).

discY(Dt, DT+1) ≤ disc(Dt, DT+1).

pageMehryar Mohri

Experiments

• base on-line algorithm: Widrow-Hoff.

• uniform weights, uniform over last 100, and weighted algorithms.

43

0 100 200 300 400 500 600 700 800 900

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Sample size

Squ

are

erro

r

WeightedRegularFixed

-

-

-

-

-

-

-

-

-

-

-

- -

-

-

-

-

-

-

-

-

-

-

--

-

-

-

-

-

pageMehryar Mohri

Conclusion

Recent and upcoming:

• theoretical properties of discrepancy minimization.

• explicit learning guarantees in terms of size of unlabeled data.

• adaptation with small amount of labeled data: analysis, theory, and algorithms.

44

adapting to a non-ideal worldmohri/talks/smile2012.pdf · ideal ideal vs real world real world....

Documents