adapting to a non-ideal worldmohri/talks/smile2012.pdf · ideal ideal vs real world real world....
TRANSCRIPT
Adapting to a Non-Ideal World
Mehryar MohriCourant Institute and Google Research
Includes joint work with Corinna Cortes, Yishay Mansour, Andres Muñoz, and Afshin Rostamizadeh.
pageMehryar Mohri
Ideal World
Standard learning assumptions:
• same distribution for training and test.
• distributions fixed over time.
• IID sampling.
2
pageMehryar Mohri 3
sampling
domain
time
ideal
Ideal vs Real World
real world
pageMehryar Mohri
This Talk
Domain adaptation
• Discrepancy distance
• Theoretical guarantees
• Algorithm
• Experiments
Drifting scenario
4
pageMehryar Mohri
Domain Adaptation
Sentiment analysis: appraisal information for some domains, e.g., movies, books, music, restaurants, but no labels for travel.
Language modeling, part-of-speech tagging.
Statistical parsing.
Speech recognition.
Computer vision.
5
Solution critical for applications.
pageMehryar Mohri
Domain Adaptation Problem
Domains: source , target .
Input:
• labeled sample drawn from source.
• unlabeled sample drawn from target.
Problem: find hypothesis in with small expected loss with respect to target domain, that is
6
(Q, fQ) (P, fP )
S
T
h H
LP (h, fP ) = Ex∼P
�L
�h(x), fP (x)
��.
pageMehryar Mohri
Related Work
Single-source adaptation:
• language modeling, probabilistic parsers, maxent models: source domain used to define a prior.
• relation between adaptation and the distance [Ben-David et al. (NIPS 2006) and Blitzer et al. (NIPS 2007)].
• a few negative examples of adaptation [Ben-David et
al. (AISTATS 2010)].
• analysis and learning guarantees for importance weighting [(Cortes, Mansour, and MM (NIPS 2010)].
7
dA
pageMehryar Mohri
Related Work
Multiple-source:
• related problem: same input distribution, but different labels modulo disparity constraints [Crammer, Kearns, and Wortman (NIPS 2005, NIPS 2006)].
• theoretical analysis and method for multiple-source adaptation using the notion of Rényi divergence [Mansour, MM, and Rostami (NIPS 2008, UAI
2009)].
8
pageMehryar Mohri
Distribution Mismatch
9
PQ
Which distance should we useto compare these distributions?
pageMehryar Mohri
Simple Analysis
Proposition: assume that the loss is bounded by , then
Proof:
10
But, is this bound informative?
LM
|LQ(h, f)− LP (h, f)| ≤ M L1(Q, P ).
|LP (h, f)− LQ(h, f)| =��� E
x∼P
�L
�(h(x), f(x)
��− E
x∼Q
�L
�(h(x), f(x)
�����
=����
x
�P (x) −Q(x)
�L
�(h(x), f(x)
����
≤M�
x
��P (x)−Q(x)��.
pageMehryar Mohri
Example - 0/1 Loss
11
f
ha
a
a
|LQ(h, f)− LP (h, f)| = |Q(a)− P (a)|
pageMehryar Mohri
Discrepancy
Definition:
• symmetric, triangle inequality, in general not a distance.
• helps compare distributions for arbitrary losses, e.g. hinge loss, or loss.
• generalization of distance (Devroye et al. (1996); Kifer et
al. (2004); Ben-David et al. (2007)).
12
(Mansour, MM, Rostami, 2009)
Lp
disc(P, Q) = maxh,h�∈H
���LP (h�, h)− LQ(h�, h)���.
dA
pageMehryar Mohri
Discrepancy - Properties
Theorem: for loss bounded by , for any , with probability at least ,
13
Lq δ>01−δ
M
disc(P, Q) ≤ disc( �P , �Q) + 4q
��RS(H) + �RT (H)
�
+ 3M
��log 4
δ
2m+
�log 4
δ
2n
�.
pageMehryar Mohri
This Talk
Domain adaptation
• Discrepancy distance
• Theoretical guarantees
• Algorithm
• Experiments
Drifting scenario
14
pageMehryar Mohri
Theoretical Guarantees
Two types of questions:
• difference between average loss of hypothesis on versus ?
• difference of loss (measured on ) between hypothesis obtained when training on versus hypothesis obtained when training on ?
15
Q Ph
h�h ( �Q, fQ)
( �P , fP )
P
pageMehryar Mohri
Generalization Bound
Notation:
•
•
Theorem: assume that obeys the triangle inequality, then the following holds:
16
LQ(h∗Q
, f) = minh∈H
LQ(h, f)
LP (h∗P, f) = min
h∈H
LP (h, f)
L
LP (h, fP ) ≤LQ(h, h∗Q) + LP (h∗P , fP ) + disc(P, Q)
+ LQ(h∗Q, h∗P ).
[Mansour, MM, Rostami (COLT 2009)]
pageMehryar Mohri
Some Natural Cases
When ,
When (consistent case),
Bound of (Ben-David et al., NIPS 2006) & (Blitzer et al., NIPS 2007): always worse in these cases.
17
h∗ = h∗Q = h∗P
LP (h, fP ) ≤ LQ(h, h∗) + LP (h∗, fP ) + disc(P, Q).
fP ∈ H
|LP (h, fP )− LQ(h, fP )| ≤ disc(Q, P ).
pageMehryar Mohri
Kernel-Based Reg. (KBR) Algorithms
Objective function:
• broad family of algorithms including SVM, SVR, kernel ridge regression, etc.
18
F bQ(h) = λ �h�2K + �R bQ(h),
where is a PDS kernel; is a trade-off parameter; and is the empirical error of .
Kλ>0�R bQ(h) h
pageMehryar Mohri
Guarantees for KBR Algorithms
Theorem: let be a PDS kernel with and a loss function such that is -Lipschitz. Assume that , then, for all ,
19
L
��L(h�(x), y)− L(h(x), y)�� ≤ µR
�disc( �P , �Q) + µη
λ,
where η = max{L(fQ(x), fP (x)) : x∈supp( �Q)}.
K(x, x)≤R2KL(·, y) µ
fP ∈H (x, y)∈X×Y
[Cortes & MM (ALT 2011)]
pageMehryar Mohri
Guarantees for KBR Algorithms
Theorem: let be a PDS kernel with and the loss bounded by . Then, for all ,
• for Gaussian kernels and continuous.
20
L
where
K(x, x)≤R2K
(x, y)∈X×YL2 M
δ = minh∈H
��� Ex∼ bQ
��h(x)− fQ(x)
�ΦK(x)
�− E
x∼ bP
��h(x)− fP (x)
�ΦK(x)
����K
.
��L(h�(x), y)− L(h(x), y)�� ≤ R
√M
λ
�δ +
�δ2 + 4λdisc( �P , �Q)
�,
[Cortes & MM (2012)]
fP =fQδ = 0
pageMehryar Mohri
Empirical Discrepancy
Discrepancy distance critical term in bounds.
Smaller empirical discrepancy guarantees closeness of pointwise losses of h’ and h.
But, can we further reduce the discrepancy?
21
disc( �P , �Q)
pageMehryar Mohri
This Talk
Domain adaptation
• Discrepancy distance
• Theoretical guarantees
• Algorithm
• Experiments
Drifting scenario
22
pageMehryar Mohri
Algorithm - Idea
Search for a new empirical distribution with same support:
Solve modified KBR problem:
23
q∗
minh
Fq∗(h) =1m
m�
i=1
q∗(xi)L(h(xi), yi) + λ�h�2K .
q∗ = argminsupp(q)⊆supp( bQ)
disc( �P , q).
pageMehryar Mohri
Case of Halfspaces
24
pageMehryar Mohri
Regression - L2 Loss
Problem:
25
minsupp(q)⊆supp( bQ)
maxh,h�∈H
��� Ex∼bP
[(h�(x)− h(x))2]− Ex∼q
[(h�(x)− h(x))2]���.
pageMehryar Mohri
Discrepancy Min. - Input space
For loss and , can be cast as an SDP:
26
H ={x �→ w�x : �w�≤Λ}
what about if we want to use kernels?
minimize �M(z)�2
subject to M(z) = M0 −m�
i=1
ziMi
M0 =q�
j=m+1
�P (xj)xjx�j
Mi = xix�i , i ∈ [1, m]
z�1 = 1 ∧ z ≥ 0.
L2
elements of supp( �Q)
pageMehryar Mohri
Discrepancy Min. with Kernels
For loss and , it can be cast as a similar SDP:
27
H = {h ∈ H : �h�K≤Λ}
minimize �M�(z)�2subject to M�(z) = M�
0 −m�
i=1
ziM�i
M�0 = K1/2D0K1/2
M�i = K1/2DiK1/2
z�1 = 1 ∧ z ≥ 0.
but, cannot be solved practically even for a fewhundred points, even with best public SDP solvers.
L2
pageMehryar Mohri
Discrepancy Min. SDP Algorithm
Smooth approximation [Nesterov (1983, 2005)]:
• not differentiable.
• : smooth unif. approximation.
Algorithm:
28
F : z �→�M(z)�2Gp: z �→ 1
2 Tr[M(z)2p]1p
Algorithm 2u0 ← argminu∈C u�Jufor k ≥ 0 dovk ← argminu∈C
2p−12 (u− uk)�J(u− uk) +∇Gp(M(uk))�u
wk ← argminu∈C2p−1
2 (u− u0)�J(u− u0) +
Pki=0
i+12 ∇Gp(M(ui))
�uuk+1 ← 2
k+3wk + k+1k+3vk
end for
Fig. 2. Smooth approximation algorithm.
with Lipschitz-continuous gradient. This is the technique we consider in the following.Recall the general form of the discrepancy minimization SDP in the feature space:
minimize �M(z)�2 (15)
subject to M(z) =m�
i=0
ziMi ∧ z0 = −1 ∧m�
i=1
zi = 1 ∧ ∀i ∈ [1,m], zi ≥ 0,
where z∈Rm+1 and where the matrices Mi∈SN+ , i∈ [0,m], are fixed SPSD matrices.
Thus, here C = {z ∈ Rm+1 : z0 = −1 ∧�m
i=1 zi = 1 ∧ ∀i ∈ [1,m], zi ≥ 0}.We further assume in the following that the matrices Mi are linearly independent sincethe problem can be reduced to that case straightforwardly. The symmetric matrix J =[�Mi,Mj�F ]i,j ∈ R(m+1)×(m+1) is then PDS and we will be using the norm x �→��Jx,x�=�x�J on Rm+1.A difficulty in solving this SDP is that the function F : z �→ �M(z)�2 is not dif-
ferentiable since eigenvalues are not differentiable functions at points where they co-alesce, which, by the nature of the minimization, is likely to be the case precisely atthe optimum. Instead, we can seek a smooth approximation of that function. One nat-ural candidate is the function z �→ �M(z)�2F . However, the Frobenius norm can leadto a very coarse approximation of the spectral norm. As suggested by Nesterov [15],the function Gp: M �→ 1
2 Tr[M2p]1p , where p ≥ 1 is an integer, can be used to give a
smooth approximation. Indeed, let λ1(M)≥ λ2(M)≥· · ·≥ λN (M) denote the list ofthe eigenvalues of a matrix M∈SN in decreasing order. By the definition of the trace,for all M∈SN , Gp(M)= 1
2
��Ni=1 λ2p
i (M)� 1
p , thus
12λ2 ≤ Gp(M) ≤ 1
2(rank(M)λ2p)
1p ,
where λ = max{λ1(M),−λN (M)} = �M�2. Thus, if we choose r as the maximumrank, r = maxz∈C rank(M(z)) ≤ max{N,
�ni=0 rank(Mi)}, then for all z∈C,
12�M(z)�22 ≤ Gp(M(z)) ≤ 1
2r
1p �M(z)�22. (16)
This leads to a smooth approximation algorithm for solving the SDP (15) derived fromAlgorithm 1 by replacing the objective function F with Gp. Choosing the prox-functiond : u �→ 1
2�u − u0�2J leads to the algorithm whose pseudocode is given in Figure 2,after some minor simplifications. The following theorem guarantees that its maximumnumber of iterations to achieve a relative accuracy of � is in O(
√r log r/�).
J = (�Mi,Mj�F )1≤i,j≤m.
[Cortes & MM (ALT 2011)]
pageMehryar Mohri
Convergence Guarantee
Let .
Theorem: for any , the algorithm solves the discrepancy minimization SDP with relative accuracy in iterations.
29
r = maxz∈C
rank(M(z)) ≤ max{N,n�
i=0
rank(Mi)}
� > 0
� O(�
r log r/�)
pageMehryar Mohri
This Talk
Domain adaptation
• Discrepancy distance
• Theoretical guarantees
• Algorithm
• Experiments
Drifting scenario
30
pageMehryar Mohri
Experiments - Time
31
oo
o
o
oooo
500 2000 5000
110
1000
Set size
Seco
nds
**
**
****
80 sec.
!=3
o
*QP optimizationgrad−G computation o
o
o
o
oooo
500 2000 5000
1e+0
11e
+03
1e+0
5
Set size
Seco
nds ! =3
XXXX
oX
Algorithm 2SeDuMi
Fig. 4. The left panel shows a plot reporting run times measured empirically (mean ± one stan-
dard deviation) for the QP optimization and the computation of∇Gp as a function of the sample
size (log-log scale). The right panel compares the total time taken by Algorithm 2 to compute the
optimization solution, to the one taken by SeDuMi (log-log scale).
provides a cost per iteration of O((log p)(m + n)3), and thus depends on the com-
bined size of the labeled and unlabeled data. Figure 4 shows typical timing results ob-
tained for different samples sizes in the range m + n = 500 to m + n = 10, 000 for
p = 16, which empirically was observed to guarantee convergence. For a sample size
of m + n = 2, 000 the time is about 80 seconds. With 5 iterations of Algorithm 2 the
total time is 5× (80 + 2 ∗ 10) + 10 = 510 seconds.
In contrast, even the most efficient SDP solvers publicly available, SeDuMi, cannot
solve our discrepancy minimization SDPs for more than a few hundred points in the
kernelized version. In our experiments, SeDuMi (http ://sedumi.ie.lehigh.edu/)
simply failed for set sizes larger than m + n = 750! In Figure 4, typical run times for
Algorithm 2 with 5 iterations are compared to run times using SeDuMi.
7 Conclusion
We presented several theoretical guarantees for domain adaptation in regression and
proved that the empirical discrepancy minimization can also be cast as an SDP when
using kernels. We gave an efficient algorithm for solving that SDP using results from
smooth optimization and specific characteristics of these SDPs in our adaptation case.
Our adaptation algorithm is shown to scale to larger data sets than what could be af-
forded using the best existing software for solving such SDPs. Altogether, our results
form a complete solution for domain adaptation in regression, including theoretical
guarantees, an efficient algorithmic solution, and extensive empirical results.
Acknowledgments
We thank Steve Boyd, Michael Overton, and Katya Scheinberg for discussions about
the optimization problem addressed in this work.
Bibliography
[1] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain
adaptation. NIPS 2006, 2007.
[2] S. Ben-David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation.
Journal of Machine Learning Research - Proceedings Track, 9:129–136, 2010.
[3] J. Blitzer, M. Dredze, and F. Pereira. Biographies, Bollywood, Boom-boxes and Blenders:
Domain Adaptation for Sentiment Classification. In ACL 2007, 2007.
pageMehryar Mohri
Experiments - Performance
• Multi-domain sentiment analysis data set (Blitzer et
al. 2007):
• Treated as regression task.
32
books, dvd, elec, kitchen.
adaptationto books
pageMehryar Mohri
Experiments - Performance
33
2 5 20 100 500
0.3
0.4
0.5
0.6
Unlabeled set size
RM
SE
Adaptation to Books
books
dvd
elec
kitchen
2 5 20 100 500
0.2
50.3
50.4
5
Unlabeled set size
RM
SE
Adaptation to Dvd
books
dvd
elec
kitchen
2 5 20 100 500
0.2
50.3
50.4
5
Unlabeled set size
RM
SE
Adaptation to Elec
books
dvd
elec
kitchen
2 5 20 100 500
0.3
20.3
60.4
0
Unlabeled set size
RM
SE
Adaptation to Kitchen
books
dvd
elec
kitchen
Fig. 3. Performance improvement of the RMSE for the 12 adaptation tasks as a function of thesize of the unlabeled data used. Note that the figures do not make use of the same y-scale.
Theorem 5. For any �>0, Algorithm 2 solves the SDP (15) with relative accuracy � inat most 4
�(1 + �)r log r/� iterations using the objective function Gp with p∈ [q0, 2q0)
and q0 =(1 + �)(log r)/�.
Proof. The proof follows directly [14], it is given in Appendix A for completeness. ��
The first step of the algorithm consists of computing the vector u0 by solving thesimple QP of line 1. We now discuss in detail how to efficiently compute the steps ofeach iteration of the algorithm in the case of our discrepancy minimization problems.
Each iteration of the algorithm requires solving two simple QPs (lines 3 and 4). Todo so, the computation of the gradient∇Gp(M(uk)) is needed. This will therefore rep-resent the main computational cost at each iteration other than solving the QPs alreadymentioned since, clearly, the sum
�ki=0
i+12 ∇Gp(M(ui))�u required at line 4 can be
computed in constant time from its value at the previous iteration. Since for any z ∈ Rm
Gp(M(z)) = Tr[M2p(z)]1/p = Tr�(
m�
i=0
ziMi)2p�1/p
,
using the linearity of the trace operator, the ith coordinate of the gradient is given by
[∇Gp(M(z))]i =�M2p−1(z),Mi�F Tr[M2p(z)]1p−1
, (17)
for all i ∈ [0,m]. Thus, the computation of the gradient can be reduced to that of thematrices M2p−1(z) and M2p(z). When the dimension of the feature space N is nottoo large, both M2p−1(z) and M2p(z) can be computed via O(log p) matrix multipli-cations using the binary decomposition method to compute the powers of a matrix [5].Since each matrix multiplication takes O(N3), the total computational cost for deter-mining the gradient is then in O((log p)N3). The cubic-time matrix multiplication canbe replaced by more favorable complexity terms of the form O(N2+α), with α = .376.Alternatively, for large values of N , that is N � (m + n), in view of Theorem 3, wecan instead solve the kernelized version of the problem. Since it is formulated as thesame SDP, the same smooth optimization technique can be applied. Instead of M(z),we need to consider the matrix M�(z) = K1/2D(z)K1/2. Now, observe that
M�2p(z) =�K1/2D(z)K1/2
�2p = K1/2�D(z)K
�2p−1D(z)K1/2.
pageMehryar Mohri
This Talk
Domain adaptation
• Discrepancy distance
• Theoretical guarantees
• Algorithm
• Experiments
Drifting scenario
34
pageMehryar Mohri
Generalized Discrepancy
Definition: given two distributions and over , the -discrepancy is defined as follows:
35
D D�
X × Y Y
discY(D, D�) = suph∈H
��LD�(h)− LD(h)��,
where LD(h) = E(x,y)∼D
[L(h(x), y)].
pageMehryar Mohri
Drifting PAC Scenario
Model:
36
[MM & Muñoz, 2012]
D1 D2 DT DT+1...
(x1, y1) (x2, y2) (xT , yT ) (x, y)� �� �
training
pageMehryar Mohri
Drifting PAC - Guarantee
Theorem: assume that the loss is bounded by . Let be a sequence of distributions and let . Then, for any , with probability at least ,
37
L MD1, . . . , DT+1
HL = {(x, y) �→ L(h(x), y) : h ∈ H}δ > 0 1− δ
LDT +1(h) ≤ 1T
T�
t=1
L(h(xt), yt) + 2RT (HL) +
1T
T�
t=1
discY(Dt, DT+1) + M
�log 2
δ
2T.
pageMehryar Mohri
Drifting Tracking Scenario
Model: perpetual training and testing.
• tracking:
38
[Bartlett, COLT 1992]
D1 D2 DT-1 DT...
(x1, y1) (x2, y2)
...
(xT , yT )(xT−1, yT−1)
sup(Dt)
{ E(xt,yT )∼DT
[L(ht−1(xT , yT )] : ∀t, discY(Dt, Dt+1) ≤ ∆} < infh∈H
LDT (h) + �.
pageMehryar Mohri
Drifting Tracking Scenario
Expected mistake:
• supremum over all distributions :
Algorithm - tracks if for sufficiently large,
39
MT (D) = ES∼D[�MT ] = ES∼D[L(hT−1(xT ), yT )].
A (∆, �)
�MT = supdiscY(Dt,Dt+1)<∆
MT (D).
H T
�MT < infh∈H
LDT (h) + �.
D =T�
t=1
Dt
pageMehryar Mohri
Drifting Tracking - Guarantee
Theorem: hypothesis set with VC-dimension and ; sequence of distributions such that for all . If , there exists an algorithm that -tracks .
• improvement over previous result of (Long,
(1999).
40
H d
discY(Dt, Dt+1) ≤ ∆ tRT (HL)≤O(
�d/T )
∆=O(d�3)
(∆, �) H
L1
pageMehryar Mohri
General On-Line to Batch
Theorem: assume that the loss is bounded by and that it is convex with respect to its first argument. Let be the hypotheses returned by algorithm and . Then, with probability at least ,
41
L M
h =�T
t=1 wtht
h1, . . . , hT
A1− δ
LDT +1(h) ≤T�
t=1
wtL(ht(xt), yt) +T�
t=1
wtdiscY(Dt, DT+1) + M�w�2
�2 log
1δ
LDT+1(h) ≤ infh∈H
L(h) +RT
T+
T�
t=1
wtdiscY(Dt, DT+1) + M�w− u0�1 + 2M�w�2
�2 log
2δ.
pageMehryar Mohri
Drifting Algorithm
Idea: find best mixture weight.
• deterministic setting:
• estimation of discrepancies using independent samples.
42
minw
λ�w�22 +
T�
t=1
wt (discY(Dt, DT+1) + L(ht(xt), yt))
subject to:� T�
t=1
wt = 1�∧ (∀t ∈ [1, T ], wt ≥ 0).
discY(Dt, DT+1) ≤ disc(Dt, DT+1).
pageMehryar Mohri
Experiments
• base on-line algorithm: Widrow-Hoff.
• uniform weights, uniform over last 100, and weighted algorithms.
43
0 100 200 300 400 500 600 700 800 900
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Sample size
Squ
are
erro
r
WeightedRegularFixed
-
-
-
-
-
-
-
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
--
-
-
-
-
-
pageMehryar Mohri
Conclusion
Recent and upcoming:
• theoretical properties of discrepancy minimization.
• explicit learning guarantees in terms of size of unlabeled data.
• adaptation with small amount of labeled data: analysis, theory, and algorithms.
44