online learning - carnegie mellon...

24
Online Learning Fei Xia Language Technology Institute [email protected] March 16, 2015 1 / 24

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Online Learning

Fei XiaLanguage Technology Institute

[email protected]

March 16, 2015

1 / 24

Page 2: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Outline

Introduction

Why online learningBasic stuff about online learning

Prediction with expert advice

Halving algorithmWeighted majority algorithmRandomized weighted majority algorithmExponential weighted average algorithm

2 / 24

Page 3: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Why online learning?

In many cases, data arrives sequentially while predictions arerequired on-the-fly

Online algorithms do not require any distributional assumption

Applicable in adversarial environments

Simple algorithms

Theoretical guarantees

3 / 24

Page 4: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Introduction

Basic Properties:

Instead of learning from a training set and then testing on atest set, the online learning scenario mixes the training andtest phases.

Instead of assuming the distribution over data points is fixedboth for training/test points and points are sampled in andi.i.d fashion, no distributional assumption is assumed in onlinelearning.

Instead of learning a hypothesis with small generalizationerror, online learning algorithms is measured using a mistakemodel and the regret.

4 / 24

Page 5: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Introduction

Basic Setting:For t = 1, 2, ..., T

Receive an instance xt ∈ XMake a prediction yt ∈ YReceive true label yt ∈ YSuffer loss L(yt, yt)

Objective:

min

T∑t=1

L(yt, yt)

5 / 24

Page 6: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Prediction with Expert Advice

For t = 1, 2, ..., T

Receive an instance xt ∈ XReceive an advice yt,i ∈ Y, i ∈ [1, N ] from N experts

Make a prediction yt ∈ YReceive true label yt ∈ YSuffer loss L(yt, yt)

Figure : Weather forecast: an example of a prediction problem based onexpert advice [Mohri et al., 2012]

6 / 24

Page 7: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Regret Analysis

Objective: minimize the regret RT

RT =

T∑t=1

L(yt, yt)−N

mini=1

T∑t=1

L(yt,i, yt)

What does low regret mean?

It means that we don’t lose much from not knowing futureevents

It means that we can perform almost as well as someone whoobserves the entire sequence and picks the best predictionstrategy in hindsight

It means that we can compete with changing environment

7 / 24

Page 8: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Halving algorithm

Realizable case — After some number of rounds T , we willlearn the concept and no longer make errors.

Mistake bound — How many mistakes before we learn aparticular concept?

Maximum number of mistakes a learning algorithm A makes for aconcept c:

MA(c) = maxS|mistakes(A, c)|

Maximum number of mistakes a learning algorithm A makes for aconcept class C:

MA(C) = maxc∈C

MA(c)

8 / 24

Page 9: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Halving algorithm

Algorithm 1 HALVING(H)

1: H1 ← H2: for t← 1 to T do3: RECEIVE(xt)4: yt ← MAJORITYVOTE(Ht, xt)5: RECEIVE(yt)6: if yt 6= yt then7: Ht+1 ← {c ∈ Ht : c(xt) = yt}8: return HT+1

9 / 24

Page 10: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Halving algorithm

Theorem

Let H be a finite hypothesis set, then

MHalving(H) ≤ log2 |H|

Proof.

The algorithm makes predictions using majority vote from theactive set. Thus, at each mistake, the active set is reduced by atleast half. Hence, after log2 |H| mistakes, there can only remainone active hypothesis. Since we are in the realizable case, thishypothesis must coincide with the target concept and we won’tmake mistakes any more.

10 / 24

Page 11: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Weighted majority algorithm

Algorithm 2 WEIGHTED-MAJORITY(N)

1: for i← 1 to N do2: w1,i ← 1

3: for t← 1 to T do4: RECEIVE(xt)5: if

∑i:yt,i=1 wt,i ≥

∑i:yt,i=0 wt,i then

6: yt ← 17: else8: yt ← 0

9: RECEIVE(yt)10: if yt 6= yt then11: for i← 1 to N do12: if yt,i 6= yt then13: wt+1,i ← βwt,i14: else wt+1,i ← wt,i

15: return wT+1

11 / 24

Page 12: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Weighted majority algorithm

Theorem

Fix β ∈ (0, 1). Let mT be the number of mistakes made byalgorithm WM after T ≥ 1 rounds, and m∗T be the number ofmistakes made by the best of the N experts. Then, the followinginequality holds:

mT ≤logN +m∗T log 1

β

log 21+β

Proof.

Introduce a potential function Wt =∑N

1 wt,i, then derive its upperand lower bound.Since the predictions are generated using weighted majority vote, ifthe algorithm makes an error at round t, we have

Wt+1 ≤(1 + β

2

)Wt

12 / 24

Page 13: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Weighted majority algorithm

Proof (Cont.)

mT mistakes after T rounds and W1 = N , thus

WT ≤(1 + β

2

)mTN

Note that we also have

WT ≥ wT,i = βmT,i

where mT,i is the number of mistakes made by the ith expert.Thus,

βm∗T ≤

(1 + β

2

)mTN ⇒ mT ≤

logN +m∗T log 1β

log 21+β

13 / 24

Page 14: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Weighted majority algorithm

mT ≤logN+m∗

T log 1β

log 21+β

mT ≤ O(logN) + constant× |mistakes of best expert|

No assumption about the sequence of samples

The number of mistakes is roughly a constant times that ofthe best expert in hindsight

When m∗T = 0, the bound reduces to mT ≤ O(logN), whichis the same as the Halving algorithm

14 / 24

Page 15: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Randomized weighted majority algorithm

Drawback in weighted majority algorithm: zero-one loss; nodeterministic algorithm can achieve a regret RT = o(T )

In randomized scenario,

A = {1, ..., N} of N actions is available

At each round t ∈ [1, T ], an online algorithm A selects adistribution pt over the set of actions

Receive a loss vector lt, where lt,i ∈ 0, 1 is the loss associatedwith action i

Define the expected loss for round t: Lt =∑N

i=1 pt,ilt,i; the

total loss for T rounds: LT =∑T

t=1 Lt

Define the total loss associated with action i: L =∑T

t=1 lt,i;the minimal loss of a single action: Lmin

T = mini∈A LT,i

15 / 24

Page 16: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Randomized weighted majority algorithm

Algorithm 3 RANDOMIZED-WEIGHTED-MAJORITY(N)

1: for i← 1 to N do2: w1,i ← 13: p1,i ← 1/N

4: for t← 1 to T do5: for i← 1 to N do6: if lt,i = 1 then7: wt+1,i ← βwt,i8: else wt+1,i ← wt,i

9: Wt+1 ←∑Ni=1 wt+1,i

10: for i← 1 to N do11: pt+1,i ← wt+1,i/Wt+1

12: return wT+1

Note: Let w0 be the total weight on outcome 0, w1 be the total weight

on outcome 1, W = w1 + w2; then the prediction strategy is to predict i

with probability wi/W .16 / 24

Page 17: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Randomized weighted majority algorithm

Theorem

Fix β ∈ [1/2, 1]. Then for any T ≥ 1, the loss of algorithm RWMon any sequence can be bounded as follows:

LT ≤logN

1− β+ (2− β)Lmin

T

In particular, for β = max{1/2, 1−√

(logN)/T}, the loss can bebounded as:

LT ≤ Lmin + 2√T logN

Proof.

Define potential function Wt =∑N

i=1wt,i, t ∈ [1, T ].

17 / 24

Page 18: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Proof (Cont.)

Wt+1 =∑i:lt,i=0 wt,i + β

∑i:lt,i=1 wt,i

= Wt + (β − 1)Wt

∑i:lt,i=1 pt,i

= Wt(1− (1− β)Lt)

⇒ WT+1 = N∏Tt=1(1− (1− β)Lt)

Note that we also have WT+1 ≥ maxi∈[1,N ] wT+1,i = βLminT , thus,

βLminT ≤ N

∏Tt=1(1− (1− β)Lt)

⇒ LminT log β ≤ logN − (1− β)LT

⇒ LT ≤ logN1−β + (2− β)Lmin

T

Since LminT ≤ T , this also implies

LT ≤logN

1− β+ (1− β)T + Lmin

T

By minimizing the RHS w.r.t β, we get

LT ≤ LminT + 2

√T logN ⇔ RT ≤ 2

√T logN

18 / 24

Page 19: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Exponential weighted average algorithm

We have extended WM algorithm to other loss functions L takingvalues in [0,1]. The EWA algorithm here is a further extensionsuch that L is convex in its first argument.

Algorithm 4 EXPONENTIAL-WEIGHTED-AVERAGE(N)

1: for i← 1 to N do2: w1,i ← 1

3: for t← 1 to T do4: RECEIVE(xt)

5: yt ←∑Ni=1 wt,iyt,i∑Ni=1 wt,i

6: RECEIVE(yt)7: for i← 1 to N do8: wt+1,i ← wt,ie

−ηL(yt,i,yt)

9: return wT+1

19 / 24

Page 20: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Exponential weighted average algorithm

Theorem

Assume that the loss function L is convex in its first argument andtakes values in [0,1]. Then, for any η > 0 and any sequencey1, ..., yT ∈ Y, the regret for EWA algorithm is bounded as:

RT ≤logN

η+ηT

8

In particular, for η =√

8 logN/T , the regret is bounded as:

RT ≤√

(T/2) logN

Proof.

Define the potential function Φt = log∑N

i=1wt,i, t ∈ [1, T ]

20 / 24

Page 21: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Exponential weighted average algorithm

Proof.

We can prove that

Φt+1 − Φt ≤ −ηL(yt, yt) + η2

8

⇒ Φ(T + 1)− Φ1 ≤ −η∑T

t=1 L(yt, yt) + η2T8

Then we try to obtain the lower bound of ΦT+1 − Φ1.

ΦT+1 − Φ1 = log∑N

i=1 e−ηLT,i − logN

≥ log maxNi=1 e−ηLT,i − logN

= −ηminNi=1 LT,i − logN

Combining lower bound and upper bound, we get

T∑t=1

L(yt, yt)−N

mini=1

LT,i ≤logN

η+ηT

8

21 / 24

Page 22: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Exponential weighted average algorithm

The optimal choice of η requires knowledge of T , which is andisadvantage of this analysis. How to solve this?

The doubling trick. Dividing time into periods [2k, 2k+1 − 1] of

length 2k with k = 0, ..., n, and then choose ηk =√

8 logN2k

. This

leads to the following theorem.

Theorem

Assume that the loss function L is convex in its first argument andtakes values in [0, 1]. Then for any T ≤ 1 and any sequencey1, ..., yT ∈ Y, the regret of the EWA algorithm after T rounds isbounded as follows:

RT ≤√

2√2− 1

√T

2logN +

√log

N

2

22 / 24

Page 23: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Summary

In many cases, data arrives sequentially while predictions arerequired on-the-fly

Online algorithms do not require any distributional assumption

Applicable in adversarial environments

Simple algorithms

Theoretical guarantees

23 / 24

Page 24: Online Learning - Carnegie Mellon Universitynyc.lti.cs.cmu.edu/classes/11-745/S15/Slides/Online...[Online; accessed 15-Mar-2015]. 24/24 Title Online Learning Author Fei Xia Language

Reference I

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012).Foundations of machine learning.MIT press.

Shalev-Shwartz, S. and Yoram, S. (2008).Tutorial on theory and applications of online learning.http://ttic.uchicago.edu/~shai/icml08tutorial/OLtutorial.pdf.[Online; accessed 15-Mar-2015].

24 / 24