near-optimal algorithms for online matrix prediction elad hazan (technion) satyen kale (yahoo! labs)...
TRANSCRIPT
Near-Optimal Algorithms for Online Matrix PredictionElad Hazan (Technion)Satyen Kale (Yahoo! Labs)Shai Shalev-Shwartz (Hebrew University)
Three Prediction Problems: I. Online Collaborative Filtering
Users: {1, 2, …, m}Movies: {1, 2, …, n}
On round t:User it arrives and is interested in movie jt
Output predicted rating pt in [-1, 1]User responds with actual rating rt in [-1, 1]Loss = (pt – rt)2
Comparison class: all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)
Regret = loss of alg – loss of best bounded trace-norm matrix
Sum of singular values
If no entries repeated, [Cesa-Bianchi, Shamir ‘11]: O(n3/2) regret for τ = O(n)
Three Prediction Problems: II. Online Max Cut
2 political partiesVoters: {1, 2, …, n}
On round t:Voters it, jt arriveOutput prediction: votes agree or disagreeLoss = 1 if incorrect prediction, 0 o.w.
Comparison class: all possible bipartitionsBipartition prediction = “agree” if it, jt in same partition, “disagree” o.w.
Regret = loss of alg – loss of best bipartition
Weight = #(disagree) - #(agree)
Best bipartition = Max Cut!
Inefficient alg using the 2n bipartitions as experts: regret =
Three Prediction Problems: III. Online Gambling [Abernethy ’10, Kleinberg, Niculescu-Mizil, Sharma ‘10]
Teams: {1, 2, …, n}
In round t:Teams it, jt competeOutput: prediction which team will winLoss = 1 if incorrect prediction, 0 o.w.
Comparison class: all possible permutations πPermutation π prediction = it if π(it) ≤ π(jt); jt o.w.
Regret = loss of alg – loss of best permutation
Weight = #(i wins) - #(j wins)
i j
Best permutation = Min Feedback Arc Set!
Inefficient alg using the n! permutations as experts: regret =
Trivial bound of considered hard to improve (e.g. [Kanade, Steinke ’12])
Results
Upper Bound Lower BoundOnline Collaborative FilteringOnline Max Cut
Online Gambling
Stochastic; solves $50 open problem of [Srebro, Shamir ’11]
By [Kleinberg, Niculescu-Mizil, Sharma ‘10]
One meta-problem to rule them all:Online Matrix Prediction (OMP)
In round t:Receive pair it, jt in [m] x [n]Output prediction pt in [-1, 1]Receive true value yt in [-1, 1]Suffer loss L(pt, yt)
Comparison class: set W of m x n matrices with entries in [-1, 1]Prediction for matrix W: entry W(it, jt)
Regret = loss of alg – loss of best comparison matrix
m x n matrices 1 2 … … n
1
2
:
m
Online Collaborative Filtering as OMP
Users: {1, 2, …, m}Movies: {1, 2, …, n}
On round t:User it arrives and is interested in movie jt
Output predicted rating pt in [-1, 1]User responds with actual rating rt
Loss = (pt – rt)2
Comparison class: W = all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)
Online Max Cut as OMP
On round t:Voters it, jt arriveOutput prediction: votes agree or disagreeLoss = 1 if incorrect prediction, 0 o.w.
Comparison class: all possible bipartitionsBipartition prediction = “agree” if it, jt in same partition, “disagree” o.w.
2 political partiesVoters: {1, 2, …, n}
W = all 2n “cut matrices” WS corresponding to subsets S of [n]
WS(i, j) = 0 if both i, j in S or [n] \ S, = 1 o.w.
0 0 1 1 1
0 0 1 1 1
1 1 0 0 0
1 1 0 0 0
1 1 0 0 0
S
S [n] \ S
[n] \ S
Online Gambling as OMPTeams: {1, 2, …, n}
In round t:Teams it, jt competeOutput: prediction which team will winLoss = 1 if incorrect prediction, 0 o.w.
π(1) π(2) … … π(n)
π(1) 1 1 1 1 1
π(2) 0 1 1 1 1
: 0 0 1 1 1
: 0 0 0 1 1
π(n)
0 0 0 0 1
1 2 … … n
1 1 0 1 1 0
2 1 1 1 1 1
: 0 0 1 1 0
: 0 0 0 1 0
n 1 0 1 1 1
Comparison class: all possible permutations πPermutation π prediction = it if π(it) ≤ π(jt); jt o.w.
W = all n! “permutation matrices” Wπ corresponding to permutations π
Wπ (i, j) = 1 if π(i) ≤ π(j) = 0 o.w.
Decomposability
0 WWT 0
P= − NW is (β, τ)-decomposable if
whereP, N are positive semidefiniteDiagonal entries Pii, Nii ≤ βSum of traces Tr(P) + Tr(N) ≤ τ
Class W is (β, τ)-decomposable if every W in W is.
Symmetric square matrix of order m + n
Main Result for (β,τ)-decomposable OMP
An efficient algorithm for OMP with (β, τ)-decomposable W and Lipschitz losses with regret bound
The Technology
Theorem:
Matrix Exponentiated Gradient [Tsuda, Rätsch, Warmuth ’06]/Matrix Multiplicative Weights [Arora, K. ‘07] algorithm
Online Learning problem: in round t,• Learner chooses density (i.e. psd, trace 1) matrix Xt
• Nature reveals loss matrix Mt with eigenvalues in [-1, 1]• Learner suffers loss Tr(MtXt)
Goal: minimize regret = loss of learner – loss of best density matrix
Overview of Algorithm for OMP
W
KAll square symmetric X
of order 2(m+n) s.t.X is positive semidefinite
Diagonals Xii ≤ βTrace Tr(X) ≤ τ
P 00 N
W
Matrix MW algorithm+ Bregman projections into K
0 W
WT 0= P - N
Decomposability Theorems
Online Collaborative FilteringTrace norm ≤ τ matrices are (√(m + n), 2τ)-decomposable.
Online Max CutCut matrices WS are (½, 2n)-decomposable.
Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.
Decomposability Theorems
Online Collaborative FilteringTrace norm ≤ τ matrices are (√(m + n), 2τ)-decomposable.
Online Max CutCut matrices WS are (½, 2n)-decomposable.
Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.
Decomposability for OCF
Thm: Any symmetric matrix M of order n with entries in [-1, 1] and trace norm τ is (√n, τ)-decomposable
Eigenvalue decomposition:
Define and
Clearly Tr(P) + Tr(N) = trace-norm(M) = τ.
Diagonals of (P + N)2 = M2 bounded by n. So diagonals of (P + N) bounded by √n.So diagonals of P, N bounded by √n.
Decomposability Theorems
Online Collaborative FilteringTrace norm ≤ τ matrices are (√(m + n), 2τ)-decomposable.
Online Max CutCut matrices WS are (½, 2n)-decomposable.
Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.
Decomposability for Online Gambling
Thm: The all 1’s upper triangular matrix of order n is (O(log n), O(n log n))-decomposable.
T(n) = One rank-1 matrix +two non-overlapping T(n/2)
B(n) = 1 + B(n/2)
B(n) = O(log n).
Concluding Remarks• Gave near-optimal algorithms for various online matrix
prediction problems• Exploited spectral structure of comparison matrices to get
near-tight convex relaxations• Solved 2 COLT open problems from [Abernethy ‘10] and
[Shamir, Srebro ‘11]
• Open problem: get rid of the logarithmic gap between upper and lower bounds• Decompositions in the paper are optimal up to constant factors, so a
fundamentally different algorithm seems necessary
Thanks!