online learning - carnegie mellon...

Online Learning

Fei XiaLanguage Technology Institute

[email protected]

March 16, 2015

1 / 24

Outline

Introduction

Why online learningBasic stuff about online learning

Prediction with expert advice

Halving algorithmWeighted majority algorithmRandomized weighted majority algorithmExponential weighted average algorithm

2 / 24

Why online learning?

In many cases, data arrives sequentially while predictions arerequired on-the-fly

Online algorithms do not require any distributional assumption

Applicable in adversarial environments

Simple algorithms

Theoretical guarantees

3 / 24

Introduction

Basic Properties:

Instead of learning from a training set and then testing on atest set, the online learning scenario mixes the training andtest phases.

Instead of assuming the distribution over data points is fixedboth for training/test points and points are sampled in andi.i.d fashion, no distributional assumption is assumed in onlinelearning.

Instead of learning a hypothesis with small generalizationerror, online learning algorithms is measured using a mistakemodel and the regret.

4 / 24

Introduction

Basic Setting:For t = 1, 2, ..., T

Receive an instance xt ∈ XMake a prediction yt ∈ YReceive true label yt ∈ YSuffer loss L(yt, yt)

Objective:

min

T∑t=1

L(yt, yt)

5 / 24

Prediction with Expert Advice

For t = 1, 2, ..., T

Receive an instance xt ∈ XReceive an advice yt,i ∈ Y, i ∈ [1, N ] from N experts

Make a prediction yt ∈ YReceive true label yt ∈ YSuffer loss L(yt, yt)

Figure : Weather forecast: an example of a prediction problem based onexpert advice [Mohri et al., 2012]

6 / 24

Regret Analysis

Objective: minimize the regret RT

RT =

T∑t=1

L(yt, yt)−N

mini=1

T∑t=1

L(yt,i, yt)

What does low regret mean?

It means that we don’t lose much from not knowing futureevents

It means that we can perform almost as well as someone whoobserves the entire sequence and picks the best predictionstrategy in hindsight

It means that we can compete with changing environment

7 / 24

Halving algorithm

Realizable case — After some number of rounds T , we willlearn the concept and no longer make errors.

Mistake bound — How many mistakes before we learn aparticular concept?

Maximum number of mistakes a learning algorithm A makes for aconcept c:

MA(c) = maxS|mistakes(A, c)|

Maximum number of mistakes a learning algorithm A makes for aconcept class C:

MA(C) = maxc∈C

MA(c)

8 / 24

Halving algorithm

Algorithm 1 HALVING(H)

1: H1 ← H2: for t← 1 to T do3: RECEIVE(xt)4: yt ← MAJORITYVOTE(Ht, xt)5: RECEIVE(yt)6: if yt 6= yt then7: Ht+1 ← {c ∈ Ht : c(xt) = yt}8: return HT+1

9 / 24

Halving algorithm

Theorem

Let H be a finite hypothesis set, then

MHalving(H) ≤ log2 |H|

Proof.

The algorithm makes predictions using majority vote from theactive set. Thus, at each mistake, the active set is reduced by atleast half. Hence, after log2 |H| mistakes, there can only remainone active hypothesis. Since we are in the realizable case, thishypothesis must coincide with the target concept and we won’tmake mistakes any more.

10 / 24

Weighted majority algorithm

Algorithm 2 WEIGHTED-MAJORITY(N)

1: for i← 1 to N do2: w1,i ← 1

3: for t← 1 to T do4: RECEIVE(xt)5: if

∑i:yt,i=1 wt,i ≥

∑i:yt,i=0 wt,i then

6: yt ← 17: else8: yt ← 0

9: RECEIVE(yt)10: if yt 6= yt then11: for i← 1 to N do12: if yt,i 6= yt then13: wt+1,i ← βwt,i14: else wt+1,i ← wt,i

15: return wT+1

11 / 24


Theorem

Fix β ∈ (0, 1). Let mT be the number of mistakes made byalgorithm WM after T ≥ 1 rounds, and m∗T be the number ofmistakes made by the best of the N experts. Then, the followinginequality holds:

mT ≤logN +m∗T log 1

β

log 21+β

Proof.

Introduce a potential function Wt =∑N

1 wt,i, then derive its upperand lower bound.Since the predictions are generated using weighted majority vote, ifthe algorithm makes an error at round t, we have

Wt+1 ≤(1 + β

2

)Wt

12 / 24


Proof (Cont.)

mT mistakes after T rounds and W1 = N , thus

WT ≤(1 + β

2

)mTN

Note that we also have

WT ≥ wT,i = βmT,i

where mT,i is the number of mistakes made by the ith expert.Thus,

βm∗T ≤

(1 + β

2

)mTN ⇒ mT ≤

logN +m∗T log 1β

log 21+β

13 / 24


mT ≤logN+m∗

T log 1β

log 21+β

mT ≤ O(logN) + constant× |mistakes of best expert|

No assumption about the sequence of samples

The number of mistakes is roughly a constant times that ofthe best expert in hindsight

When m∗T = 0, the bound reduces to mT ≤ O(logN), whichis the same as the Halving algorithm

14 / 24

Randomized weighted majority algorithm

Drawback in weighted majority algorithm: zero-one loss; nodeterministic algorithm can achieve a regret RT = o(T )

In randomized scenario,

A = {1, ..., N} of N actions is available

At each round t ∈ [1, T ], an online algorithm A selects adistribution pt over the set of actions

Receive a loss vector lt, where lt,i ∈ 0, 1 is the loss associatedwith action i

Define the expected loss for round t: Lt =∑N

i=1 pt,ilt,i; the

total loss for T rounds: LT =∑T

t=1 Lt

Define the total loss associated with action i: L =∑T

t=1 lt,i;the minimal loss of a single action: Lmin

T = mini∈A LT,i

15 / 24


Algorithm 3 RANDOMIZED-WEIGHTED-MAJORITY(N)

1: for i← 1 to N do2: w1,i ← 13: p1,i ← 1/N

4: for t← 1 to T do5: for i← 1 to N do6: if lt,i = 1 then7: wt+1,i ← βwt,i8: else wt+1,i ← wt,i

9: Wt+1 ←∑Ni=1 wt+1,i

10: for i← 1 to N do11: pt+1,i ← wt+1,i/Wt+1

12: return wT+1

Note: Let w0 be the total weight on outcome 0, w1 be the total weight

on outcome 1, W = w1 + w2; then the prediction strategy is to predict i

with probability wi/W .16 / 24


Theorem

Fix β ∈ [1/2, 1]. Then for any T ≥ 1, the loss of algorithm RWMon any sequence can be bounded as follows:

LT ≤logN

1− β+ (2− β)Lmin

T

In particular, for β = max{1/2, 1−√

(logN)/T}, the loss can bebounded as:

LT ≤ Lmin + 2√T logN

Proof.

Define potential function Wt =∑N

i=1wt,i, t ∈ [1, T ].

17 / 24

Proof (Cont.)

Wt+1 =∑i:lt,i=0 wt,i + β

∑i:lt,i=1 wt,i

= Wt + (β − 1)Wt

∑i:lt,i=1 pt,i

= Wt(1− (1− β)Lt)

⇒ WT+1 = N∏Tt=1(1− (1− β)Lt)

Note that we also have WT+1 ≥ maxi∈[1,N ] wT+1,i = βLminT , thus,

βLminT ≤ N

∏Tt=1(1− (1− β)Lt)

⇒ LminT log β ≤ logN − (1− β)LT

⇒ LT ≤ logN1−β + (2− β)Lmin

T

Since LminT ≤ T , this also implies

LT ≤logN

1− β+ (1− β)T + Lmin

T

By minimizing the RHS w.r.t β, we get

LT ≤ LminT + 2

√T logN ⇔ RT ≤ 2

√T logN

18 / 24

Exponential weighted average algorithm

We have extended WM algorithm to other loss functions L takingvalues in [0,1]. The EWA algorithm here is a further extensionsuch that L is convex in its first argument.

Algorithm 4 EXPONENTIAL-WEIGHTED-AVERAGE(N)

1: for i← 1 to N do2: w1,i ← 1

3: for t← 1 to T do4: RECEIVE(xt)

5: yt ←∑Ni=1 wt,iyt,i∑Ni=1 wt,i

6: RECEIVE(yt)7: for i← 1 to N do8: wt+1,i ← wt,ie

−ηL(yt,i,yt)

9: return wT+1

19 / 24


Theorem

Assume that the loss function L is convex in its first argument andtakes values in [0,1]. Then, for any η > 0 and any sequencey1, ..., yT ∈ Y, the regret for EWA algorithm is bounded as:

RT ≤logN

η+ηT

8

In particular, for η =√

8 logN/T , the regret is bounded as:

RT ≤√

(T/2) logN

Proof.

Define the potential function Φt = log∑N

i=1wt,i, t ∈ [1, T ]

20 / 24


Proof.

We can prove that

Φt+1 − Φt ≤ −ηL(yt, yt) + η2

8

⇒ Φ(T + 1)− Φ1 ≤ −η∑T

t=1 L(yt, yt) + η2T8

Then we try to obtain the lower bound of ΦT+1 − Φ1.

ΦT+1 − Φ1 = log∑N

i=1 e−ηLT,i − logN

≥ log maxNi=1 e−ηLT,i − logN

= −ηminNi=1 LT,i − logN

Combining lower bound and upper bound, we get

T∑t=1

L(yt, yt)−N

mini=1

LT,i ≤logN

η+ηT

8

21 / 24


The optimal choice of η requires knowledge of T , which is andisadvantage of this analysis. How to solve this?

The doubling trick. Dividing time into periods [2k, 2k+1 − 1] of

length 2k with k = 0, ..., n, and then choose ηk =√

8 logN2k

. This

leads to the following theorem.

Theorem

Assume that the loss function L is convex in its first argument andtakes values in [0, 1]. Then for any T ≤ 1 and any sequencey1, ..., yT ∈ Y, the regret of the EWA algorithm after T rounds isbounded as follows:

RT ≤√

2√2− 1

√T

2logN +

√log

N

2

22 / 24

Summary

In many cases, data arrives sequentially while predictions arerequired on-the-fly

Online algorithms do not require any distributional assumption

Applicable in adversarial environments

Simple algorithms

Theoretical guarantees

23 / 24

Reference I

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012).Foundations of machine learning.MIT press.

Shalev-Shwartz, S. and Yoram, S. (2008).Tutorial on theory and applications of online learning.http://ttic.uchicago.edu/~shai/icml08tutorial/OLtutorial.pdf.[Online; accessed 15-Mar-2015].

24 / 24

http://ttic.uchicago.edu/~shai/icml08tutorial/OLtutorial.pdf

online learning - carnegie mellon...

Documents