exponentialcomputationalimprovement byreductionmean regression auc ranking classification quantile...
TRANSCRIPT
Exponential Computational Improvementby Reduction
John Langford @ Microsoft Research
Computational Challenges Workshop, May 2
The Empirical Age
Speech Recognition
ImageNetDeep LearningNeural Machine Translation
What is a theorist to do?
The Empirical Age
Speech RecognitionImageNet
Deep LearningNeural Machine Translation
What is a theorist to do?
The Empirical Age
Speech RecognitionImageNetDeep Learning
Neural Machine Translation
What is a theorist to do?
The Empirical Age
Speech RecognitionImageNetDeep LearningNeural Machine Translation
What is a theorist to do?
The Empirical Age
Speech RecognitionImageNetDeep LearningNeural Machine Translation
What is a theorist to do?
The Contextual Bandit Setting
For t = 1, . . . ,T :
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ A
3 The world reacts with reward ra ∈ [0, 1]
Goal: Learn a good policy for choosing actionsgiven context.
Reduction Results
Algo ε-greedy Bagg LinUCB Online C. Super.Loss 0.095 0.059 0.128 0.053 0.051Time 22 339 212× 103 17 6.9
Progressive validation loss on RCV1.
The Offline Contextual Bandit Setting
Given exploration data (x , a, r , p)∗
Learn a good policy for choosing actions givencontext.
Offline Reduction Results
0.2
0.4
0.6
0.8
ecoli
gla
ss
letter
optd
igits
page-b
locks
pendig
its
satim
age
vehic
le
yeast
Cla
ssific
ation E
rror
Offset Tree
Agnostic Active Learning
For t = 1, . . . ,T :
1 The world produces some context x ∈ X
2 The learner predicts a label y ∈ Y
3 The learner chooses to request a label or not. Iflabel requested:
1 observe y2 update learning algorithm
Goal: Compete with supervised learning using alllabels while requesting as few as possible.
AAL Reduction Results
0 1000 2000 3000 4000 50000
0.01
0.02
0.03
0.04
0.05
number of labels queried
test
err
or
PassiveActive
Logarithmic Time Prediction
Repeatedly1 See x2 Predict y ∈ {1, ...,K}3 See y
Goal: Find h(x) minimizing error rate:
Pr(x ,y)∼D
(h(x) 6= y)
with h(x) in time O(logK ).
Logarithmic Time Prediction
Repeatedly1 See x2 Predict y ∈ {1, ...,K}3 See y
Goal: Find h(x) minimizing error rate:
Pr(x ,y)∼D
(h(x) 6= y)
with h(x) in time O(logK ).
Log-time prediction results
-4
0
4
8
12
ALOIIm
agenet
ODP
Del
ta T
est
Err
or
(%)
From
OAA
Statistical Performance
LOMTreeRecall Tree
Summary
Problem Learning Reductions OCO PACCB Explore Yes Sorta? NoCB Learn Yes Sorta? No
Agnostic Active Yes Sorta? NoLog-time Yes No No
Outline
1 Why Reductions2 What is a learning reduction?3 Exponential Improvements
Learning Reduction Basics
Goal: minimize ` on D
`′ optimizer
Transform D into DR
Transform h with small `′(h,DR) into Rh with small`(Rh,D)...
h
such that if h does well on (DR , `′), Rh is guaranteed
to do well on (D, `).
Learning Reduction Basics
Goal: minimize ` on D
`′ optimizer
Transform D into DR
Transform h with small `′(h,DR) into Rh with small`(Rh,D)...
h
such that if h does well on (DR , `′), Rh is guaranteed
to do well on (D, `).
Learning Reduction Basics
Goal: minimize ` on D
`′ optimizer
Transform D into DR
Transform h with small `′(h,DR) into Rh with small`(Rh,D)...
h
such that if h does well on (DR , `′), Rh is guaranteed
to do well on (D, `).
Learning Reduction Basics
Goal: minimize ` on D
`′ optimizer
Transform D into DR
Transform h with small `′(h,DR) into Rh with small`(Rh,D)...
h
such that if h does well on (DR , `′), Rh is guaranteed
to do well on (D, `).
Learning Reduction Basics
Goal: minimize ` on D
`′ optimizer
Transform D into DR
Transform h with small `′(h,DR) into Rh with small`(Rh,D)...
h
such that if h does well on (DR , `′), Rh is guaranteed
to do well on (D, `).
Error Reductions: the simplest possible
Prove: Small `′ error ⇒ small ` error.
An issue: If R introduces noise, small `′ not possible.
⇒ Must prove small `′ possible for nonvacuousstatement.⇒ Error reductions weak for noisy problems.
Error Reductions: the simplest possible
Prove: Small `′ error ⇒ small ` error.
An issue: If R introduces noise, small `′ not possible.
⇒ Must prove small `′ possible for nonvacuousstatement.⇒ Error reductions weak for noisy problems.
Error Reductions: the simplest possible
Prove: Small `′ error ⇒ small ` error.
An issue: If R introduces noise, small `′ not possible.
⇒ Must prove small `′ possible for nonvacuousstatement.
⇒ Error reductions weak for noisy problems.
Error Reductions: the simplest possible
Prove: Small `′ error ⇒ small ` error.
An issue: If R introduces noise, small `′ not possible.
⇒ Must prove small `′ possible for nonvacuousstatement.⇒ Error reductions weak for noisy problems.
Regret Reductions: Dealing with noise
Let reg`,D = `(h,D)−minh′ `(h′,D)
Prove: Small reg`′,D ′ ⇒ small reg`,D .
Note: minh′ is over all functions.⇒ User is responsible for choosing right hypothesisspace.⇒ Unable to address information gathering
Regret Reductions: Dealing with noise
Let reg`,D = `(h,D)−minh′ `(h′,D)
Prove: Small reg`′,D ′ ⇒ small reg`,D .
Note: minh′ is over all functions.⇒ User is responsible for choosing right hypothesisspace.⇒ Unable to address information gathering
Regret Reductions: Dealing with noise
Let reg`,D = `(h,D)−minh′ `(h′,D)
Prove: Small reg`′,D ′ ⇒ small reg`,D .
Note: minh′ is over all functions.
⇒ User is responsible for choosing right hypothesisspace.⇒ Unable to address information gathering
Regret Reductions: Dealing with noise
Let reg`,D = `(h,D)−minh′ `(h′,D)
Prove: Small reg`′,D ′ ⇒ small reg`,D .
Note: minh′ is over all functions.⇒ User is responsible for choosing right hypothesisspace.
⇒ Unable to address information gathering
Regret Reductions: Dealing with noise
Let reg`,D = `(h,D)−minh′ `(h′,D)
Prove: Small reg`′,D ′ ⇒ small reg`,D .
Note: minh′ is over all functions.⇒ User is responsible for choosing right hypothesisspace.⇒ Unable to address information gathering
Oracle Reductions: Information gathering
Assume an oracle which given samples S returnsarg minh∈H `′(h, S)
Prove: Oracle (approximately) works ⇒Computationally efficient small online regret onoriginal problem.
Oracle Reductions: Information gathering
Assume an oracle which given samples S returnsarg minh∈H `′(h, S)
Prove: Oracle (approximately) works ⇒Computationally efficient small online regret onoriginal problem.
Mean Regression
AUC Ranking
Classification
IW ClassificationQuantile Regression 1 1
k−1 4
Classification
FilterTree
PECOC
Probing
CostingEi
Quanting
4 ECT
k/2
SearnTk ln TPSDP
Unsupervised by Self Prediction
Tk
??
−cost k
−Classificationk−Partial Labelk
T step RL with State Visitation Step RL with Demonstration PolicyT
Dynamic Models
??
OffsetTree
−way Regressionk
Regret Transform Reductions
1QuicksortRegret multiplier Algorithm Name
Programming
Reductions ⇒ modularity, code reuse ⇒ good newsfor programming!
Vowpal Wabbit (http://hunch.net/ vw ) uses thissystematically.
An Open Problem: $1K reward!
Conditional Probability EstimationDistribution D over X × Y , where Y = {1, . . . , k}.Find a Probability estimator h : X × Y → [0, 1]minimizing squared loss
`(h,D) = E(x ,y)∼D [(h(y |x)− y)2]
The problem: How can you do this in time O(log(k))with a constant regret ratio?
An Open Problem: $1K reward!
Conditional Probability EstimationDistribution D over X × Y , where Y = {1, . . . , k}.Find a Probability estimator h : X × Y → [0, 1]minimizing squared loss
`(h,D) = E(x ,y)∼D [(h(y |x)− y)2]
The problem: How can you do this in time O(log(k))with a constant regret ratio?
More Details
Beygelzimer, Langford, Daume, Mineiro, “LearningReductions that Realy Work”, IEEE 104(1), 2016https://arxiv.org/abs/1502.02704