near-optimal algorithms for online matrix prediction elad hazan (technion) satyen kale (yahoo! labs)...

Near-Optimal Algorithms for Online Matrix PredictionElad Hazan (Technion)Satyen Kale (Yahoo! Labs)Shai Shalev-Shwartz (Hebrew University)

Three Prediction Problems: I. Online Collaborative Filtering

Users: {1, 2, …, m}Movies: {1, 2, …, n}

On round t:User it arrives and is interested in movie jt

Output predicted rating pt in [-1, 1]User responds with actual rating rt in [-1, 1]Loss = (pt – rt)2

Comparison class: all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)

Regret = loss of alg – loss of best bounded trace-norm matrix

Sum of singular values

If no entries repeated, [Cesa-Bianchi, Shamir ‘11]: O(n3/2) regret for τ = O(n)

Three Prediction Problems: II. Online Max Cut

2 political partiesVoters: {1, 2, …, n}

On round t:Voters it, jt arriveOutput prediction: votes agree or disagreeLoss = 1 if incorrect prediction, 0 o.w.

Comparison class: all possible bipartitionsBipartition prediction = “agree” if it, jt in same partition, “disagree” o.w.

Regret = loss of alg – loss of best bipartition

Weight = #(disagree) - #(agree)

Best bipartition = Max Cut!

Inefficient alg using the 2n bipartitions as experts: regret =

Three Prediction Problems: III. Online Gambling [Abernethy ’10, Kleinberg, Niculescu-Mizil, Sharma ‘10]

Teams: {1, 2, …, n}

In round t:Teams it, jt competeOutput: prediction which team will winLoss = 1 if incorrect prediction, 0 o.w.

Comparison class: all possible permutations πPermutation π prediction = it if π(it) ≤ π(jt); jt o.w.

Regret = loss of alg – loss of best permutation

Weight = #(i wins) - #(j wins)

i j

Best permutation = Min Feedback Arc Set!

Inefficient alg using the n! permutations as experts: regret =

Trivial bound of considered hard to improve (e.g. [Kanade, Steinke ’12])

Results

Upper Bound Lower BoundOnline Collaborative FilteringOnline Max Cut

Online Gambling

Stochastic; solves $50 open problem of [Srebro, Shamir ’11]

By [Kleinberg, Niculescu-Mizil, Sharma ‘10]

One meta-problem to rule them all:Online Matrix Prediction (OMP)

In round t:Receive pair it, jt in [m] x [n]Output prediction pt in [-1, 1]Receive true value yt in [-1, 1]Suffer loss L(pt, yt)

Comparison class: set W of m x n matrices with entries in [-1, 1]Prediction for matrix W: entry W(it, jt)

Regret = loss of alg – loss of best comparison matrix

m x n matrices 1 2 … … n

1

2

:

m

Online Collaborative Filtering as OMP

Users: {1, 2, …, m}Movies: {1, 2, …, n}

On round t:User it arrives and is interested in movie jt

Output predicted rating pt in [-1, 1]User responds with actual rating rt

Loss = (pt – rt)2

Comparison class: W = all m x n matrices with entries in [-1, 1] of trace norm ≤ τFor each such matrix W, predicted rating = W(it, jt)

Online Max Cut as OMP

On round t:Voters it, jt arriveOutput prediction: votes agree or disagreeLoss = 1 if incorrect prediction, 0 o.w.

Comparison class: all possible bipartitionsBipartition prediction = “agree” if it, jt in same partition, “disagree” o.w.

2 political partiesVoters: {1, 2, …, n}

W = all 2n “cut matrices” WS corresponding to subsets S of [n]

WS(i, j) = 0 if both i, j in S or [n] \ S, = 1 o.w.

0 0 1 1 1

0 0 1 1 1

1 1 0 0 0

1 1 0 0 0

1 1 0 0 0

S

S [n] \ S

[n] \ S

Online Gambling as OMPTeams: {1, 2, …, n}

In round t:Teams it, jt competeOutput: prediction which team will winLoss = 1 if incorrect prediction, 0 o.w.

π(1) π(2) … … π(n)

π(1) 1 1 1 1 1

π(2) 0 1 1 1 1

: 0 0 1 1 1

: 0 0 0 1 1

π(n)

0 0 0 0 1

1 2 … … n

1 1 0 1 1 0

2 1 1 1 1 1

: 0 0 1 1 0

: 0 0 0 1 0

n 1 0 1 1 1

Comparison class: all possible permutations πPermutation π prediction = it if π(it) ≤ π(jt); jt o.w.

W = all n! “permutation matrices” Wπ corresponding to permutations π

Wπ (i, j) = 1 if π(i) ≤ π(j) = 0 o.w.

Decomposability

0 WWT 0

P= − NW is (β, τ)-decomposable if

whereP, N are positive semidefiniteDiagonal entries Pii, Nii ≤ βSum of traces Tr(P) + Tr(N) ≤ τ

Class W is (β, τ)-decomposable if every W in W is.

Symmetric square matrix of order m + n

Main Result for (β,τ)-decomposable OMP

An efficient algorithm for OMP with (β, τ)-decomposable W and Lipschitz losses with regret bound

The Technology

Theorem:

Matrix Exponentiated Gradient [Tsuda, Rätsch, Warmuth ’06]/Matrix Multiplicative Weights [Arora, K. ‘07] algorithm

Online Learning problem: in round t,• Learner chooses density (i.e. psd, trace 1) matrix Xt

• Nature reveals loss matrix Mt with eigenvalues in [-1, 1]• Learner suffers loss Tr(MtXt)

Goal: minimize regret = loss of learner – loss of best density matrix

Overview of Algorithm for OMP

W

KAll square symmetric X

of order 2(m+n) s.t.X is positive semidefinite

Diagonals Xii ≤ βTrace Tr(X) ≤ τ

P 00 N

W

Matrix MW algorithm+ Bregman projections into K

0 W

WT 0= P - N

Decomposability Theorems

Online Collaborative FilteringTrace norm ≤ τ matrices are (√(m + n), 2τ)-decomposable.

Online Max CutCut matrices WS are (½, 2n)-decomposable.

Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.

Decomposability for OCF

Thm: Any symmetric matrix M of order n with entries in [-1, 1] and trace norm τ is (√n, τ)-decomposable

Eigenvalue decomposition:

Define and

Clearly Tr(P) + Tr(N) = trace-norm(M) = τ.

Diagonals of (P + N)2 = M2 bounded by n. So diagonals of (P + N) bounded by √n.So diagonals of P, N bounded by √n.

Decomposability Theorems

Online Collaborative FilteringTrace norm ≤ τ matrices are (√(m + n), 2τ)-decomposable.

Online Max CutCut matrices WS are (½, 2n)-decomposable.

Online GamblingPermutation matrices Wπ are (O(log n), O(n log n))-decomposable.

Decomposability for Online Gambling

Thm: The all 1’s upper triangular matrix of order n is (O(log n), O(n log n))-decomposable.

T(n) = One rank-1 matrix +two non-overlapping T(n/2)

B(n) = 1 + B(n/2)

B(n) = O(log n).

Concluding Remarks• Gave near-optimal algorithms for various online matrix

prediction problems• Exploited spectral structure of comparison matrices to get

near-tight convex relaxations• Solved 2 COLT open problems from [Abernethy ‘10] and

[Shamir, Srebro ‘11]

• Open problem: get rid of the logarithmic gap between upper and lower bounds• Decompositions in the paper are optimal up to constant factors, so a

fundamentally different algorithm seems necessary

Thanks!

near-optimal algorithms for online matrix prediction elad hazan (technion) satyen kale (yahoo! labs)...

Documents

t j t j t

round t

p t r t

j t regret

user i t

loss lp t

loss trm t x t goal

x n output prediction