what else can we do with more data? - huji.ac.ilshais/talks/whatelse.pdf · shai shalev-shwartz...

49
What else can we do with more data? Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem TTI, February 2011 Based on joint papers with: Ohad Shamir and Karthik Sridharan (COLT 2010) Nicol` o Cesa-Bianchi and Ohad Shamir (ICML 2010) Shai Ben-David and Ruth Urner (Submitted) and, of course, Nati Srebro Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 1 / 37

Upload: others

Post on 27-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

What else can we do with more data?

Shai Shalev-Shwartz

School of Computer Science and Engineering

The Hebrew University of Jerusalem

TTI,February 2011

Based on joint papers with:Ohad Shamir and Karthik Sridharan (COLT 2010)Nicolo Cesa-Bianchi and Ohad Shamir (ICML 2010)

Shai Ben-David and Ruth Urner (Submitted)

and, of course, Nati Srebro

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 1 / 37

Page 2: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

What else can we do with more data?

More Data

speedup runtime

trainingruntime

predictionruntime

compensatefor missinginformation

reduce error

Traditional

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 2 / 37

Page 3: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

What else can we do with more data?

More Data

speedup runtime

trainingruntime

predictionruntime

compensatefor missinginformation

reduce error

Traditional

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 2 / 37

Page 4: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Outline

How can more data speedup training runtime?

Learning using Stochastic Optimization (S. & Srebro 2008)Will not talk about this today

Injecting Structure (S., Shamir, Sirdharan 2010)

How can more data speedup prediction runtime?

Proper Semi-Supervised Learning (S., Ben-David, Urner 2011)

How can more data compensate for missing information?

Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010)Technique: Rely on Stochastic Optimization

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 3 / 37

Page 5: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Injecting Structure – Main Idea

Replace original hypothesis class with a larger hypothesis class

On one hand: Larger class has more structure ⇒ easier to optimize

On the other hand: Larger class ⇒ larger estimation error ⇒ needmore examples

OriginalHypothesis

Class

New Hypothesis Class

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 4 / 37

Page 6: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Example — Learning 3-DNF

Goal: learn a 3-DNF Boolean function h : {0, 1}d → {0, 1}DNF is a simple way to describe a concept (e.g. ”computer nerd”)

Variables are attributes. E.g.

x1 = can read binary codex2 = runs Unix as the operating system on his home computerx3 = has a girlfriendx4 = blush whenever tells someone how big his hard drive is

h(x) = (x1 ∧ ¬x3) ∨ (x2 ∧ ¬x3) ∨ (x4 ∧ ¬x3)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 5 / 37

Page 7: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Example — Learning 3-DNF

Kearns & Vazirani: If RP6=NP, it is not possible to efficiently learn anε-accurate 3-DNF formula

Claim: if m ≥ d3/ε it is possible to find a predictor with error ≤ ε inpolynomial time

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 6 / 37

Page 8: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Example — Learning 3-DNF

Kearns & Vazirani: If RP6=NP, it is not possible to efficiently learn anε-accurate 3-DNF formula

Claim: if m ≥ d3/ε it is possible to find a predictor with error ≤ ε inpolynomial time

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 6 / 37

Page 9: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Proof

Observation: 3-DNF formula can be rewritten as∧u∈T1,v∈T2,w∈T3(u ∨ v ∨ w) for three sets of literals T1, T2, T3

Define: ψ : {0, 1}d → {0, 1}2(2d)3 s.t. for each triplet of literalsu, v, w there are two variables indicating if u ∨ v ∨ w is true or false

Observation: Each 3-DNF can be represented as a single conjunctionover ψ(x)

Easy to learn single conjunction (greedy or LP)

3-DNF over x

conjunction over ψ(x)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 7 / 37

Page 10: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Trading samples for runtime

Algorithm samples runtime

3-DNF over x dε 2d

Conjunction over ψ(x) d3

ε poly(d)

Runtime

m

3-DNF

Conjunction

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 8 / 37

Page 11: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Disclaimer

Analysis is based on upper bounds

Open problem: establish gaps by deriving lower bounds

Studied by:”Computational Sample Complexity” (Decatur, Goldreich, Ron 1998)

Very few (if any) results on ”real-world” problems, e.g.Rocco Servedio showed gaps for 1-decision lists

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 9 / 37

Page 12: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Agnostic learning of Halfspaces with 0− 1 loss

Agnostic PAC:

D - arbitrary distribution over X × YTraining set: S = (x1, y1), . . . , (xm, ym)

Goal: use S to find hS s.t. w.p. 1− δ,

err(hS) ≤ minh∈H

err(h) + ε

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 10 / 37

Page 13: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Hypothesis class

H = {x 7→ φ(〈w,x〉) : ‖w‖2 ≤ 1}, φ(z) = 11+exp(−z/µ)

-1 1

1

Probabilistic classifier: Pr[hw(x) = 1] = φ(〈w,x〉)

Loss function: err(w; (x, y)) = Pr[hw(x) 6= y] =∣∣∣φ(〈w,x〉)− y+1

2

∣∣∣Remark: Dimension can be infinite (kernel methods)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 11 / 37

Page 14: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

First approach — sub-sample covering

Claim: exists 1/(εµ2) examples from which we can efficiently learnw? up to error of ε

Proof idea:

S′ = {(xi, y′i) : y

′i = yi if yi〈w?,xi〉 < −µ and else y′i = −yi}

Use surrogate convex loss 12 max{0, 1− y〈w, x〉/γ}

Minimizing surrogate loss on S′ ⇒ minimizing original loss on SSample complexity w.r.t. surrogate loss is 1/(εµ2)

Analysis

Sample complexity: 1/(εµ)2

Time complexity: m1/(εµ2) =(

1εµ

)1/(εµ2)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 12 / 37

Page 15: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Second Approach – IDPK (S, Shamir, Sridharan)

Learning fuzzy halfspaces using Infinite-Dimensional-Polynomial-Kernel

Original class: H = {x 7→ φ(〈w,x〉)}

Problem: Loss is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the lossbecomes convex

x 7→ φ(〈w,x〉)

x 7→ 〈v, ψ(x

)〉

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 13 / 37

Page 16: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Second Approach – IDPK (S, Shamir, Sridharan)

Learning fuzzy halfspaces using Infinite-Dimensional-Polynomial-Kernel

Original class: H = {x 7→ φ(〈w,x〉)}Problem: Loss is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the lossbecomes convex

x 7→ φ(〈w,x〉)

x 7→ 〈v, ψ(x

)〉

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 13 / 37

Page 17: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Second Approach – IDPK (S, Shamir, Sridharan)

Learning fuzzy halfspaces using Infinite-Dimensional-Polynomial-Kernel

Original class: H = {x 7→ φ(〈w,x〉)}Problem: Loss is non-convex w.r.t. w

Main idea: Work with a larger hypothesis class for which the lossbecomes convex

x 7→ φ(〈w,x〉)

x 7→ 〈v, ψ(x

)〉

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 13 / 37

Page 18: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {x 7→ φ(〈w,x〉) : ‖w‖ ≤ 1}New class: H′ = {x 7→ 〈v, ψ(x)〉 : ‖v‖ ≤ B} where ψ : X → RN s.t.∀j, ∀(i1, . . . , ij), ψ(x)(i1,...,ij) = 2j/2 xi1 · · ·xij

Lemma (S, Shamir, Sridharan 2009)

If B = exp(O(1/µ)) then for all h ∈ H exists h′ ∈ H′ s.t. for all x,h(x) ≈ h′(x).

Remark: The above is a pessimistic choice of B. In practice, smaller Bsuffices. Is it tight? Even if it is, are there natural assumptions underwhich a better bound holds ?(e.g. Kalai, Klivans, Mansour, Servedio 2005)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37

Page 19: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {x 7→ φ(〈w,x〉) : ‖w‖ ≤ 1}New class: H′ = {x 7→ 〈v, ψ(x)〉 : ‖v‖ ≤ B} where ψ : X → RN s.t.∀j, ∀(i1, . . . , ij), ψ(x)(i1,...,ij) = 2j/2 xi1 · · ·xij

Lemma (S, Shamir, Sridharan 2009)

If B = exp(O(1/µ)) then for all h ∈ H exists h′ ∈ H′ s.t. for all x,h(x) ≈ h′(x).

Remark: The above is a pessimistic choice of B. In practice, smaller Bsuffices. Is it tight? Even if it is, are there natural assumptions underwhich a better bound holds ?(e.g. Kalai, Klivans, Mansour, Servedio 2005)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37

Page 20: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Step 2 – Learning fuzzy halfspaces with IDPK

Original class: H = {x 7→ φ(〈w,x〉) : ‖w‖ ≤ 1}New class: H′ = {x 7→ 〈v, ψ(x)〉 : ‖v‖ ≤ B} where ψ : X → RN s.t.∀j, ∀(i1, . . . , ij), ψ(x)(i1,...,ij) = 2j/2 xi1 · · ·xij

Lemma (S, Shamir, Sridharan 2009)

If B = exp(O(1/µ)) then for all h ∈ H exists h′ ∈ H′ s.t. for all x,h(x) ≈ h′(x).

Remark: The above is a pessimistic choice of B. In practice, smaller Bsuffices. Is it tight? Even if it is, are there natural assumptions underwhich a better bound holds ?(e.g. Kalai, Klivans, Mansour, Servedio 2005)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37

Page 21: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Proof idea

Polynomial approximation: φ(z) ≈∑∞

j=0 βjzj

Therefore:

φ(〈w,x〉) ≈∞∑j=0

βj(〈w,x〉)j

=∞∑j=0

∑k1,...,kj

2−j/2βj2j/2wk1 · · ·wkjxk1 · · · ·xkj

= 〈vw, ψ(x)〉

To obtain a concrete bound we use Chebyshev approximationtechnique: Family of orthogonal polynomials w.r.t. inner product:

〈f, g〉 =∫ 1

x=−1

f(x)g(x)√1− x2

dx

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 15 / 37

Page 22: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Proof idea

Polynomial approximation: φ(z) ≈∑∞

j=0 βjzj

Therefore:

φ(〈w,x〉) ≈∞∑j=0

βj(〈w,x〉)j

=

∞∑j=0

∑k1,...,kj

2−j/2βj2j/2wk1 · · ·wkjxk1 · · · ·xkj

= 〈vw, ψ(x)〉

To obtain a concrete bound we use Chebyshev approximationtechnique: Family of orthogonal polynomials w.r.t. inner product:

〈f, g〉 =∫ 1

x=−1

f(x)g(x)√1− x2

dx

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 15 / 37

Page 23: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Proof idea

Polynomial approximation: φ(z) ≈∑∞

j=0 βjzj

Therefore:

φ(〈w,x〉) ≈∞∑j=0

βj(〈w,x〉)j

=

∞∑j=0

∑k1,...,kj

2−j/2βj2j/2wk1 · · ·wkjxk1 · · · ·xkj

= 〈vw, ψ(x)〉

To obtain a concrete bound we use Chebyshev approximationtechnique: Family of orthogonal polynomials w.r.t. inner product:

〈f, g〉 =∫ 1

x=−1

f(x)g(x)√1− x2

dx

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 15 / 37

Page 24: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Infinite-Dimensional-Polynomial-Kernel

Although the dimension is infinite, can be solved using the kernel trick

The corresponding kernel (a.k.a. Vovk’s infinite polynomial):

〈ψ(x), ψ(x′)〉 = K(x,x′) =1

1− 12〈x,x′〉

Algorithm boils down to linear regression with the above kernel

Convex! Can be solved efficiently

Sample complexity: (B/ε)2 = 2O(1/µ)/ε2

Time complexity: m2

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 16 / 37

Page 25: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Trading samples for time

Algorithm sample time

Covering 1ε2µ2

(1εµ

)1/(εµ2)

� �

IDPK(

1εµ

)1/µ1ε2

(1εµ

)2/µ1ε4

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 17 / 37

Page 26: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Agnostic learning of Halfspaces with 0− 1 loss

Runtime

m

Covering

S., Shamir, Sridharan (2010)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 18 / 37

Page 27: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Outline

How can more data speedup training runtime?

Learning using Stochastic Optimization (S. & Srebro 2008)

Injecting Structure (S., Shamir, Sirdharan 2010) X

How can more data speedup prediction runtime?

Proper Semi-Supervised Learning (S., Ben-David, Urner 2011)

How can more data compensate for missing information?

Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010)Technique: Rely on Stochastic Optimization

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 19 / 37

Page 28: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

More data can speedup prediction time

Semi-Supervised Learning: Many unlabeled examples, few labeledexamples

Most previous work: how unlabeled data can improve accuracy ?

Our goal: how unlabeled data can help constructing faster classifiers

Modeling: Proper-Semi-Supervised-Learning — we must output aclassifier from a predefined class H

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 20 / 37

Page 29: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Proper Semi-Supervised Learning

A simple two phase procedure:

Use labeled examples to learn an arbitrary classifier (which is asaccurate as possible)

Apply the learned classifier to label the unlabeled examples

Feed the now-labeled examples to a proper supervised learning for H

Lemma

Agnostic learners are robust with respect to small changes in the inputdistribution:

P [h(x) 6= f(x)] ≤ P [h(x) 6= g(x)] + P [g(x) 6= f(x)]

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 21 / 37

Page 30: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Proper Semi-Supervised Learning

A simple two phase procedure:

Use labeled examples to learn an arbitrary classifier (which is asaccurate as possible)

Apply the learned classifier to label the unlabeled examples

Feed the now-labeled examples to a proper supervised learning for H

Lemma

Agnostic learners are robust with respect to small changes in the inputdistribution:

P [h(x) 6= f(x)] ≤ P [h(x) 6= g(x)] + P [g(x) 6= f(x)]

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 21 / 37

Page 31: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Demonstration

103

104

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

# labeled examples

erro

r

LinearLinear with SSLChamfer

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 22 / 37

Page 32: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Demonstration

103

104

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

# labeled examples

erro

r

LinearLinear with SSL1−NN

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 23 / 37

Page 33: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Outline

How can more data speedup training runtime?

Learning using Stochastic Optimization (S. & Srebro 2008)

Injecting Structure (S., Shamir, Sirdharan 2010) X

How can more data speedup prediction runtime?

Proper Semi-Supervised Learning (S., Ben-David, Urner 2011) X

How can more data compensate for missing information?

Attribute Efficient Learning (Cesa-Bianchi, S., Shamir 2010)Technique: Rely on Stochastic Optimization

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 24 / 37

Page 34: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Attribute efficient regression

Each training example is a pair (x, y) ∈ Rd × RPartial information: can only view O(1) attributes of each individualexample

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 25 / 37

Page 35: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

How more data helps?

Three main techniques:

1 Missing information as noise

2 Active Exploration — try to “fish” the relevant information

3 Inject structure — problem hard in the original representation butbecomes simple in another representation (different hypothesis class)

More data helps because:

1 It reduces variance — compensates for the noise

2 It allows more exploration

3 It compensates for larger sample complexity due to using largerhypotheses classes

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 26 / 37

Page 36: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Attribute efficient regression

Formal problem statement:

Unknown distribution D over Rd × RGoal: learn a linear predictor x 7→ 〈w,x〉 with low risk:

Risk: LD(w) = ED[(〈w,x〉 − y)2]Training set: (x1, y1), . . . , (xm, ym)

Partial information: For each (xi, yi), learner can view only kattributes of xi

Active selection: learner can choose which k attributes to see

Similar to “Learning with restricted focus of attention” (Ben-David &

Dichterman 98)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 27 / 37

Page 37: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Dealing with missing information

Usually difficult — exponential ways to complete the missinginformation

Popular approach — Expectation Maximization (EM)

Previous methods usually do not come with guarantees(neither sample complexity nor computational complexity)

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 28 / 37

Page 38: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Partial information as noise

Observation:

x =

x1x2...xd

=1

d

dx10...0

+ . . .+1

d

0...0dxd

Therefore, choosing i uniformly at random gives

Ei[dxie

i] = x .

If ‖x‖ ≤ 1 then ‖dxiei‖ ≤ d (i.e. variance increased)

Reduced missing information to unbiased noise

Many examples can compensate for the added noise

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 29 / 37

Page 39: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

A Stochastic Optimization Approach

Our goal: minimize over w the true riskLD(w) = E(x,y)∼D[(〈w,x〉 − y)2]We can only obtain i.i.d. samples from D

Goal: minw LD(w)

minw1m

∑mi=1(〈w,xi〉 − yi)2

ERM(traditional)

w ← w − ηv whereE[v] = ∇LD(w)

SGD

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 30 / 37

Page 40: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

A Stochastic Optimization Approach

Our goal: minimize over w the true riskLD(w) = E(x,y)∼D[(〈w,x〉 − y)2]We can only obtain i.i.d. samples from D

Goal: minw LD(w)

minw1m

∑mi=1(〈w,xi〉 − yi)2

ERM(traditional)

w ← w − ηv whereE[v] = ∇LD(w)

SGD

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 30 / 37

Page 41: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

A Stochastic Optimization Approach

How to construct an unbiased estimate of the gradient:

Sample (x, y) ∼ DSample j uniformly from [d]

Sample i from [d] based on P [i] = |wi|/‖w‖1Set v = 2(sign(wi)‖w‖1xj − y)dxjej

Claim: E[v] = ∇LD(W )

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 31 / 37

Page 42: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

A Stochastic Optimization Approach

Theorem (Cesa-Bianchi, S, Shamir)

Let w be the output of AER and let w? be a competing vector. Then,with high probability

LD(w) ≤ LD(w?) + O

(d ‖w?‖2 ‖w?‖1√

m

),

where d is dimension and m is number of examples.

Corollary

Factor of d2 additional examples compensates for the lack of fullinformation on each individual example.

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 32 / 37

Page 43: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Demonstration

Full information classifiers (top line) ⇒ error of ∼ 1.1%

Our algorithm (bottom line) ⇒ error of ∼ 3.5%

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 33 / 37

Page 44: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Demonstration0 5000 10000

0.2

0.4

0.6

0.8

1

1.2

Number of Examples

Test

Reg

ress

ion

Erro

r

0 1 2 3 4x 104

0.2

0.4

0.6

0.8

1

1.2

Number of Features

Test

Reg

ress

ion

Erro

r

Ridge Reg.LassoAERBaseline

AER, Baseline:4 pixels per example

Ridge, Lasso:768 pixels per example

Number of observed features

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 34 / 37

Page 45: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

What to do with other loss functions?

General question: Given r.v. X and function f : R→ R, how toconstruct an unbiased estimate of f(E[X]) ?

Claim (Paninski 2003): In general, not possible

Claim (Singh 1964, The Indian Journal of Statistics): Possible ifsample size is also a random number !

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 35 / 37

Page 46: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

What to do with other loss functions?

General question: Given r.v. X and function f : R→ R, how toconstruct an unbiased estimate of f(E[X]) ?

Claim (Paninski 2003): In general, not possible

Claim (Singh 1964, The Indian Journal of Statistics): Possible ifsample size is also a random number !

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 35 / 37

Page 47: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

What to do with other loss functions?

General question: Given r.v. X and function f : R→ R, how toconstruct an unbiased estimate of f(E[X]) ?

Claim (Paninski 2003): In general, not possible

Claim (Singh 1964, The Indian Journal of Statistics): Possible ifsample size is also a random number !

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 35 / 37

Page 48: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

The key idea

Can construct Qn(x) =∑n

i=0 γn,ixi n→∞−→ f(x)

Let Q′n(x1, . . . , xn) =∑n

i=0 γn,i∏ij=1 xj

Estimator:

draw a positive integer N according to Pr(N = n) = pnsample i.i.d. x1, x2, . . . , xNreturn 1

pN

(Q′N (x1, . . . , xN )−Q′N−1(x1, . . . , xN−1)

),

Claim: This is an unbiased estimator of f(E[X])

EN,x1,...,xN

[1

pN

(Q′N (x1, . . . , xN )−Q′N−1(x1, . . . , xN−1)

)]=

∞∑n=1

pnpn

Ex1,...,xn

[Q′n(x1, . . . , xn)−Q′n−1(x1, . . . , xn−1)

]=∞∑n=1

(Qn(E[X])−Qn−1(E[X])

)= f(E[X]).

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 36 / 37

Page 49: What else can we do with more data? - huji.ac.ilshais/talks/WhatElse.pdf · Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 14 / 37. Proof idea Polynomial

Summary

Learning theory: Many examples ⇒ smaller error

This work: Many examples ⇒Speedup training timeSpeedup prediction timeCompensating for missing information

Techniques:1 Stochastic optimization2 Inject structure3 Missing information as noise4 Active Exploration

Shai Shalev-Shwartz (Hebrew U) What else can we do with more data? Feb’11 37 / 37