exponentialcomputationalimprovement byreductionmean regression auc ranking classification quantile...

Exponential Computational Improvementby Reduction

John Langford @ Microsoft Research

Computational Challenges Workshop, May 2

The Empirical Age

Speech Recognition

ImageNetDeep LearningNeural Machine Translation

What is a theorist to do?

The Empirical Age

Speech RecognitionImageNet

Deep LearningNeural Machine Translation


The Empirical Age

Speech RecognitionImageNetDeep Learning

Neural Machine Translation


The Empirical Age

Speech RecognitionImageNetDeep LearningNeural Machine Translation


The Contextual Bandit Setting

For t = 1, . . . ,T :

1 The world produces some context x ∈ X

2 The learner chooses an action a ∈ A

3 The world reacts with reward ra ∈ [0, 1]

Goal: Learn a good policy for choosing actionsgiven context.

Reduction Results

Algo ε-greedy Bagg LinUCB Online C. Super.Loss 0.095 0.059 0.128 0.053 0.051Time 22 339 212× 103 17 6.9

Progressive validation loss on RCV1.

The Offline Contextual Bandit Setting

Given exploration data (x , a, r , p)∗

Learn a good policy for choosing actions givencontext.

Offline Reduction Results

0.2

0.4

0.6

0.8

ecoli

gla

ss

letter

optd

igits

page-b

locks

pendig

its

satim

age

vehic

le

yeast

Cla

ssific

ation E

rror

Offset Tree

Agnostic Active Learning

For t = 1, . . . ,T :

1 The world produces some context x ∈ X

2 The learner predicts a label y ∈ Y

3 The learner chooses to request a label or not. Iflabel requested:

1 observe y2 update learning algorithm

Goal: Compete with supervised learning using alllabels while requesting as few as possible.

AAL Reduction Results

0 1000 2000 3000 4000 50000

0.01

0.02

0.03

0.04

0.05

number of labels queried

test

err

or

PassiveActive

Logarithmic Time Prediction

Repeatedly1 See x2 Predict y ∈ {1, ...,K}3 See y

Goal: Find h(x) minimizing error rate:

Pr(x ,y)∼D

(h(x) 6= y)

with h(x) in time O(logK ).

Log-time prediction results

-4

0

4

8

12

ALOIIm

agenet

ODP

Del

ta T

est

Err

or

(%)

From

OAA

Statistical Performance

LOMTreeRecall Tree

Summary

Problem Learning Reductions OCO PACCB Explore Yes Sorta? NoCB Learn Yes Sorta? No

Agnostic Active Yes Sorta? NoLog-time Yes No No

Outline

1 Why Reductions2 What is a learning reduction?3 Exponential Improvements

Learning Reduction Basics

Goal: minimize ` on D

`′ optimizer

Transform D into DR

Transform h with small `′(h,DR) into Rh with small`(Rh,D)...

h

such that if h does well on (DR , `′), Rh is guaranteed

to do well on (D, `).

Error Reductions: the simplest possible

Prove: Small `′ error ⇒ small ` error.

An issue: If R introduces noise, small `′ not possible.

⇒ Must prove small `′ possible for nonvacuousstatement.⇒ Error reductions weak for noisy problems.




⇒ Must prove small `′ possible for nonvacuousstatement.

⇒ Error reductions weak for noisy problems.




⇒ Must prove small `′ possible for nonvacuousstatement.⇒ Error reductions weak for noisy problems.

Regret Reductions: Dealing with noise

Let reg`,D = `(h,D)−minh′ `(h′,D)

Prove: Small reg`′,D ′ ⇒ small reg`,D .

Note: minh′ is over all functions.⇒ User is responsible for choosing right hypothesisspace.⇒ Unable to address information gathering




Note: minh′ is over all functions.

⇒ User is responsible for choosing right hypothesisspace.⇒ Unable to address information gathering




Note: minh′ is over all functions.⇒ User is responsible for choosing right hypothesisspace.

⇒ Unable to address information gathering




Note: minh′ is over all functions.⇒ User is responsible for choosing right hypothesisspace.⇒ Unable to address information gathering

Oracle Reductions: Information gathering

Assume an oracle which given samples S returnsarg minh∈H `′(h, S)

Prove: Oracle (approximately) works ⇒Computationally efficient small online regret onoriginal problem.

Mean Regression

AUC Ranking

Classification

IW ClassificationQuantile Regression 1 1

k−1 4

Classification

FilterTree

PECOC

Probing

CostingEi

Quanting

4 ECT

k/2

SearnTk ln TPSDP

Unsupervised by Self Prediction

Tk

??

−cost k

−Classificationk−Partial Labelk

T step RL with State Visitation Step RL with Demonstration PolicyT

Dynamic Models

??

OffsetTree

−way Regressionk

Regret Transform Reductions

1QuicksortRegret multiplier Algorithm Name

Programming

Reductions ⇒ modularity, code reuse ⇒ good newsfor programming!

Vowpal Wabbit (http://hunch.net/ vw ) uses thissystematically.

An Open Problem: $1K reward!

Conditional Probability EstimationDistribution D over X × Y , where Y = {1, . . . , k}.Find a Probability estimator h : X × Y → [0, 1]minimizing squared loss

`(h,D) = E(x ,y)∼D [(h(y |x)− y)2]

The problem: How can you do this in time O(log(k))with a constant regret ratio?

More Details

Beygelzimer, Langford, Daume, Mineiro, “LearningReductions that Realy Work”, IEEE 104(1), 2016https://arxiv.org/abs/1502.02704

exponentialcomputationalimprovement byreductionmean regression auc ranking classification quantile...

Documents