online learning in dynamic environment · introduction dynamic environment conclusion online...

Introduction Dynamic Environment Conclusion

Online Learning in Dynamic Environment

Lijun Zhang

Dept. of Computer Science and Technology, Nanjing University, China

September 29, 2016

http://cs.nju.edu.cn/zlj Online Learning

Learning And Mining from DatA

A DAMLNANJING UNIVERSITY

http://cs.nju.edu.cn/zlj


Outline

1 IntroductionOnline LearningRegret

2 Online Learning in Dynamic EnvironmentDynamic RegretOnline Multiple Gradient DescentOnline Multiple Newton Update

3 Conclusion and Future Work



A DAML


Introduction Dynamic Environment Conclusion Online Learning Regret

Big Data

http://www.ibmbigdatahub.com/infographic/four-vs-big-data



A DAML




Big Data



Papers about Online Learning in 2015

ICML 17+, NIPS 25+, COLT 13+


A DAML




Outline






A DAML



Online Learning

Online Learning [Shalev-Shwartz, 2011]

Online learning is the process of answering a sequence ofquestions given (maybe partial) knowledge of answersto previous questions and possibly additional information.



A DAML



Online Learning



Predicted Label

Multi-class Classification



A DAML



Online Learning



Predicted Label

Ground Truth


Full-Information Online Learning



A DAML



Online Learning



Predicted Label

Correct or Not


Bandit Online Learning



A DAML



Formal Definitions

Online Learning

1: for t = 1, 2, . . . ,T do

4: end for



A DAML



Formal Definitions

Online Learning

1: for t = 1, 2, . . . ,T do2: Learner picks a decision xt ∈ X

Adversary chooses a function ft(·)

4: end for

Learner Adversary

A classifier

An example

A loss



A DAML



Formal Definitions

Online Learning


Adversary chooses a function ft(·)3: Learner suffers loss ft(xt) and updates xt

4: end for

Learner Adversary

A classifier

Suffer and update

A loss

An example



A DAML



Formal Definitions

Online Learning



4: end for

Cumulative Loss

Cumulative Loss =T∑

t=1

ft(xt)



A DAML



Formal Definitions

Online Learning



4: end for

Cumulative Loss


t=1

ft(xt)

Full-Information Online LearningLearner observes the function ft(·)Online first-order optimization

Bandit Online LearningLearner only observes the value of ft(xt)Online zero-order optimization



A DAML



Outline






A DAML



Regret

Cumulative Loss


t=1

ft(xt)



A DAML



Regret

Cumulative Loss


t=1

ft(xt)

Regret

Regret =T∑

t=1

ft(xt) − minx∈X

T∑

t=1

ft(x)



A DAML



Regret

Cumulative Loss


t=1

ft(xt)

Regret

Regret =T∑

t=1

ft(xt)

︸︷︷︸

Cumulative Loss of Online Learner

− minx∈X

T∑

t=1

ft(x)

︸︷︷︸

Minimal Loss in Batch Learner



A DAML



Regret

Cumulative Loss


t=1

ft(xt)

Regret

Regret =T∑

t=1

ft(xt)

︸︷︷︸


− minx∈X

T∑

t=1

ft(x)

︸︷︷︸


Hannan Consistent

lim supT→∞

1T

(T∑

t=1

ft(xt)− minx∈X

T∑

t=1

ft(x)

)

= 0, with probability 1



A DAML



Regret

Cumulative Loss


t=1

ft(xt)

Regret

Regret =T∑

t=1

ft(xt)

︸︷︷︸


− minx∈X

T∑

t=1

ft(x)

︸︷︷︸


Hannan ConsistentT∑

t=1

ft(xt)− minx∈X

T∑

t=1

ft(x) = o(T ), with probability 1



A DAML



State-of-the-Art in Full-Information Setting (I)

Online Gradient Descent1: for t = 1, 2, . . . ,T do2: Learner picks a decision xt ∈ X

Adversary chooses a function ft(·)3: Learner suffers loss ft(xt) and

xt+1 = ΠX (xt − ηt∇ft(xt))

4: end for



A DAML







4: end for

Round · · · t

Learner · · · xt

Adversary · · · ft(·)

Loss · · · ft(xt)



A DAML







4: end for

Round · · · t t + 1

Learner · · · xt

Adversary · · · ft(·) ft+1(·)

Loss · · · ft(xt) ft+1(xt+1)



A DAML







4: end for

Round · · · t t + 1

Learner · · · xt xt+1 = argminx∈X ft+1(x)





A DAML







4: end for

Round · · · t t + 1

Learner · · · xt xt+1 = ΠX (xt − ηt∇ft+1(xt))





A DAML







4: end for

Round · · · t t + 1

Learner · · · xt xt+1 = ΠX (xt − ηt∇ft(xt))





A DAML



State-of-the-Art in Full-Information Setting (II)

Convex Functions [Zinkevich, 2003]

Online Gradient Descent with ηt = 1/√

tT∑

t=1

ft(xt)−minx∈X

T∑

t=1

ft(x) ≤D2

2

√T+

(√T − 1

2

)

G2 = O(√

T)



A DAML






tT∑

t=1

ft(xt)−minx∈X

T∑

t=1

ft(x) ≤D2

2

√T+

(√T − 1

2

)

G2 = O(√

T)

Strongly Convex Functions [Hazan et al., 2007]

Online Gradient Descent with ηt = 1/(λt)T∑

t=1

ft(xt)− minx∈X

T∑

t=1

ft(x) ≤G2

2λ(1 + log T ) = O (log T )



A DAML






tT∑

t=1

ft(xt)−minx∈X

T∑

t=1

ft(x) ≤D2

2

√T+

(√T − 1

2

)

G2 = O(√

T)



t=1

ft(xt)− minx∈X

T∑

t=1

ft(x) ≤G2

2λ(1 + log T ) = O (log T )

Exponentially Concave Functions [Hazan et al., 2007]Convex and Smooth Functions [Srebro et al., 2010]Sparse Online Learning [Langford et al., 2009]Online Classification [Freund and Schapire, 1999]



A DAML






tT∑

t=1

ft(xt)−minx∈X

T∑

t=1

ft(x) ≤D2

2

√T+

(√T − 1

2

)

G2 = O(√

T)



t=1

ft(xt)− minx∈X

T∑

t=1

ft(x) ≤G2

2λ(1 + log T ) = O (log T )

Exponentially Concave Functions [Hazan et al., 2007]Convex and Smooth Functions [Srebro et al., 2010]Sparse Online Learning [Langford et al., 2009]Online Classification [Freund and Schapire, 1999]


Two CharacteristicsUpdating only once per round

A decreasing step size


A DAML


Introduction Dynamic Environment Conclusion Dynamic Regret Gradient Descent Newton Update

Outline






A DAML



A Paradox

The Ideal Algorithm

xt+1 = ΠX (xt − ηt∇ft+1(xt))

Online Gradient Descent is Imperfect


Fail if ft and ft+1 differ much



A DAML



A Paradox

The Ideal Algorithm

xt+1 = ΠX (xt − ηt∇ft+1(xt))

Online Gradient Descent is Imperfect


Fail if ft and ft+1 differ much

Its Theoretical Guarantees are StrongT∑

t=1

ft(xt)−minx∈X

T∑

t=1

ft(x) ≤

O(√

T)

, Convex Functions

O (log T ) , Strongly Convex Functions



A DAML



Dynamic Regret

Regret

Regret =T∑

t=1

ft(xt)− minx∈X

T∑

t=1

ft(x)



A DAML



Dynamic Regret

Regret →Static Regret

Regret =T∑

t=1

ft(xt)− minx∈X

T∑

t=1

ft(x)

The Rationale of RegretOne of the decision is reasonably good during T rounds



A DAML



Dynamic Regret


Regret =T∑

t=1

ft(xt)− minx∈X

T∑

t=1

ft(x)


Dynamic EnvironmentDifferent decisions will be good in different periodsE.g. online tracking, online advertising, portfolio investment



A DAML



Dynamic Regret


Regret =T∑

t=1

ft(xt)− minx∈X

T∑

t=1

ft(x)


Dynamic EnvironmentDifferent decisions will be good in different periodsE.g. online tracking, online advertising, portfolio investment

Dynamic Regret

Dynamic Regret =T∑

t=1

ft(xt)−T∑

t=1

minx∈X

ft(x)



A DAML



The Challenge

Sublinear Dynamic Regret is Impossible in General!



A DAML



The Challenge


An Example—A Random Adversary

1: for t = 1, 2, . . . ,T do2: Learner picks a decision xt ∈ R

3: Adversary chooses (x − 1)2 and (x + 1)2 randomly4: end for



A DAML



The Challenge





The expectation of dynamic regret at t-th round12

(

(xt − 1)2 + (xt + 1)2)

− 12

(

minx

(x − 1)2 + minx

(x + 1)2)

=x2t + 1 ≥ 1



A DAML



The Challenge





The expectation of dynamic regret at t-th round12

(

(xt − 1)2 + (xt + 1)2)

− 12

(

minx

(x − 1)2 + minx

(x + 1)2)

=x2t + 1 ≥ 1

E

[T∑

t=1

ft(xt)−T∑

t=1

minx

ft(x)

]

≥ T



A DAML



Make the Impossible Possible

Dynamic Regret

T∑

t=1

ft(xt)−T∑

t=1

minx∈X

ft(x) =T∑

t=1

ft(xt)−T∑

t=1

ft(x∗t )

where x∗t ∈ argminx∈X ft(x).



A DAML



Make the Impossible Possible

Dynamic Regret

T∑

t=1

ft(xt)−T∑

t=1

minx∈X

ft(x) =T∑

t=1

ft(xt)−T∑

t=1

ft(x∗t )

where x∗t ∈ argminx∈X ft(x).

Assumptions

Functional Variation [Besbes et al., 2015]

FT :=T∑

t=2

‖ft(·)− ft−1(·)‖∞ =T∑

t=2

maxx∈X

|ft(x)− ft−1(x)|

Path-length [Mokhtari et al., 2016]

P∗T :=

T∑

t=2

‖x∗t − x∗

t−1‖



A DAML



Existing Results (I)


For any f1, . . . , fT such that

FT :=T∑

t=2

maxx∈X

|ft(x)− ft−1(x)| ≤ VT

the restarted online gradient descent (with input VT )satisfies

Dynamic Regret ≤

O(

T 2/3VT1/3)

, Convex Functions

O(

log T√

TVT

)

, Strongly Convex Functions

Dynamic regret is sublinear if VT is sublinear



A DAML



Existing Results (I)



FT :=T∑

t=2

maxx∈X

|ft(x)− ft−1(x)| ≤ VT

the restarted online gradient descent (with input VT )satisfies

Dynamic Regret ≤

O(

T 2/3VT1/3)

, Convex Functions

O(

log T√

TVT

)

, Strongly Convex Functions

Dynamic regret is sublinear if VT is sublinear

It is not adaptive



A DAML



Existing Results (II)

Path-length[Zinkevich, 2003, Mokhtari et al., 2016, Yang et al., 2016]

Online gradient descent satisfies

Dynamic Regret ≤

O(√

TP∗T

)

, Convex Functions

O (P∗T ) , Strongly Convex and Smooth

O (P∗T ) , Convex and Smooth Functions


P∗T :=

T∑

t=2

‖x∗t − x∗

t−1‖ ≤ BT

online gradient descent (with input BT ) satisfies

Dynamic Regret ≤ O(√

TBT

)

, Convex Functions



A DAML



Our Contribution [Zhang et al., 2016]

Squared Path-length

S∗T :=

T∑

t=2

‖x∗t − x∗

t−1‖2



A DAML




Squared Path-length

S∗T :=

T∑

t=2

‖x∗t − x∗

t−1‖2

Example 1 (Path-length vs Squared Path-length)

Suppose ‖x∗t − x∗

t−1‖ = T−α for all t ≥ 1 and α > 0, we have

S∗T+1 = T 1−2α ≪ P∗

T+1 = T 1−α.

In particular, when α = 1/2, we have

S∗T+1 = 1 ≪ P∗

T+1 =√

T .



A DAML




Squared Path-length

S∗T :=

T∑

t=2

‖x∗t − x∗

t−1‖2

Online Multiple Gradient Descent

Strongly convex and smooth functions

Semi-strongly convex and smooth functions

Dynamic Regret ≤ O (min(P∗T ,S∗

T ))



A DAML




Squared Path-length

S∗T :=

T∑

t=2

‖x∗t − x∗

t−1‖2


Strongly convex and smooth functions

Semi-strongly convex and smooth functions


T ))

Online Multiple Newton Update

Self-concordant functions


T ))



A DAML



Outline






A DAML



Strongly Convex and Smooth Functions

Definition 1

A function f : X 7→ R is λ-strongly convex, if

f (y) ≥ f (x) + 〈∇f (x), y − x〉+ λ

2‖y − x‖2, ∀x, y ∈ X .

Definition 2

A function f : X 7→ R is L-smooth, if

f (y) ≤ f (x) + 〈∇f (x), y − x〉+ L2‖y − x‖2, ∀x, y ∈ X .



A DAML



Strongly Convex and Smooth Functions

Definition 1

A function f : X 7→ R is λ-strongly convex, if

f (y) ≥ f (x) + 〈∇f (x), y − x〉+ λ

2‖y − x‖2, ∀x, y ∈ X .

Definition 2

A function f : X 7→ R is L-smooth, if

f (y) ≤ f (x) + 〈∇f (x), y − x〉+ L2‖y − x‖2, ∀x, y ∈ X .

Example 2 (Strongly convex and smooth functions)

A quadratic form f (x) = x⊤Ax − 2b⊤x + c whereαI � A � βI, α > 0 and β < ∞;

The regularized logistic lossf (x) = log(1 + exp(b⊤x)) + λ

2‖x‖2, where λ > 0.



A DAML




The Algorithm1: Let x1 be any point in X2: for t = 1, . . . ,T do3: z1

t = xt

4: for j = 1, . . . ,K do5:

zj+1t = ΠX

(

zjt − η∇ft(z

jt))

6: end for7: xt+1 = zK+1

t8: end for



A DAML




The Algorithm1: Let x1 be any point in X2: for t = 1, . . . ,T do3: z1

t = xt

4: for j = 1, . . . ,K do5:

zj+1t = ΠX

(

zjt − η∇ft(z

jt))


t8: end for

Multiple updatings in each round

A constant step size



A DAML



Theoretical Guarantees

Theorem 1

By setting η ≤ 1/L and K = ⌈1/η+λ2λ ln 4⌉, we have

T∑

t=1

ft(xt)− ft(x∗t )

≤min

2GP∗T + 2G‖x1 − x∗

1‖,

12α

T∑

t=1

‖∇ft(x∗t )‖2 + 2(L + α)S∗

T + (L + α)‖x1 − x∗1‖2



A DAML




Theorem 1


T∑

t=1


≤min

2GP∗T + 2G‖x1 − x∗

1‖,

12α

T∑

t=1

‖∇ft(x∗t )‖2 + 2(L + α)S∗

T + (L + α)‖x1 − x∗1‖2

Corollary 1

Suppose∑T

t=1 ‖∇ft(x∗t )‖2 ≤ O(S∗

T ), we haveT∑

t=1

ft(xt)− ft(x∗t ) ≤ O (min(P∗

T ,S∗T )) .



A DAML




Theorem 1


T∑

t=1


≤min

2GP∗T + 2G‖x1 − x∗

1‖,

12α

T∑

t=1

‖∇ft(x∗t )‖2 + 2(L + α)S∗

T + (L + α)‖x1 − x∗1‖2

Corollary 1

In particular, if x∗t belongs to the relative interior of X

T∑

t=1


≤min(

2GP∗T + 2G‖x1 − x∗

1‖, 2LS∗T + L‖x1 − x∗

1‖2)



A DAML



Semi-strongly Convex and Smooth Functions

Definition 3 ([Gong and Ye, 2014])

A function f : X 7→ R is semi-strongly convex over X , if

f (x)− minx∈X

f (x) ≥ β

2‖x − ΠX ∗(x)‖2

where X ∗ = {x ∈ X : f (x) ≤ minx∈X f (x)}.



A DAML



Semi-strongly Convex and Smooth Functions

Definition 3 ([Gong and Ye, 2014])

A function f : X 7→ R is semi-strongly convex over X , if

f (x)− minx∈X

f (x) ≥ β

2‖x − ΠX ∗(x)‖2

where X ∗ = {x ∈ X : f (x) ≤ minx∈X f (x)}.

Example 3 (Semi-strongly Convex and Smooth Functions)

minx∈X⊆Rd

f (x) = g(Ex) + b⊤x

where g(·) is strongly convex and smooth, and X is either Rd ora polyhedral set. [Luo and Tseng, 1993]

The semi-strong convexity and error bound property areequivalent up to constant factors [Necoara et al., 2015]



A DAML




Theorem 2

By setting η ≤ 1/L and K = ⌈1/η+ββ ln 4⌉, we have

T∑

t=1

ft(xt)−T∑

t=1

minx∈X

ft(x)

≤min

2GP∗T + 2G‖x1 − x̄1‖,

12α

G∗T + 2(L + α)S∗

T + (L + α)‖x1 − x̄1‖2

where G∗T = max{x∗

t ∈X∗

t }Tt=1

∑Tt=1 ‖∇ft(x∗

t )‖2, and x̄1 = ΠX ∗

1(x1).

Corollary 2

Suppose G∗T ≤ O(S∗

T ), we haveT∑

t=1


T ,S∗T )) .



A DAML



Outline






A DAML



Self-concordant Functions

Definition 4 ([Nemirovski, 2004])A function f : X 7→ R is called self-concordant on X , if

f (xi) → ∞ along every sequence {xi ∈ X} converging, asi → ∞, to a boundary point of X ;

f satisfies the differential inequality

|D3f (x)[h,h,h]| ≤ 2(

h⊤∇2f (x)h)3/2

for all x ∈ X and all h ∈ Rd .

Example 4 (Self-concordant Functions)

The function f (x) = − log x

A convex quadratic form f (x) = x⊤Ax − 2b⊤x + c whereA ∈ R

d×d , b ∈ Rd , and c ∈ R

If f : Rd 7→ R is self-concordant, and A ∈ Rd×k , b ∈ R

d ,then f (Ax + b) is self-concordant.



A DAML



Online Multiple Newton Update

The Algorithm1: Let x1 be any point in X1

2: for t = 1, . . . ,T do3: z1

t = xt

4: for j = 1, . . . ,K do5:

zj+1t =zj

t −1

1 + λt(zjt)

[

∇2ft(zjt)]−1

∇ft(zjt)

λt(zjt) =

√

∇ft(zjt)⊤[

∇2ft(zjt)]−1

∇ft(zjt)


t8: end for



A DAML




Theorem 3

Assume

‖x∗t−1 − x∗

t ‖2t ≤ 1

144, ∀t ≥ 2.

When t = 1, we choose K = O(1)(f1(x1)− f1(x∗1) + log logµ)

such that

‖x2 − x∗1‖2

1 ≤ 1144µ

.

For t ≥ 2, we set K = ⌈log4(16µ)⌉ thenT∑

t=1

ft(xt)− ft(x∗t ) ≤ min

(16P∗

T , 4S∗T +

136

)

+ f1(x1)− f1(x∗1).

T∑

t=1


T ,S∗T )) .



A DAML



Conclusion and Future Work

Conclusion

Squared path-length: S∗T :=

∑Tt=2 ‖x∗

t − x∗t−1‖2

Online multiple gradient descent (strongly/Semi-stronglyconvex and smooth functions)

T∑

t=1


T ,S∗T ))

Online multiple Newton update (self-concordant functions)

Future Work

A deep understanding of functional variation, path-length,squared path-length

Beyond dynamic regret

Static regret of semi-strongly convex functions



A DAML



Reference I

Besbes, O., Gur, Y., and Zeevi, A. (2015).

Non-stationary stochastic optimization.Operations Research, 63(5):1227–1244.

Freund, Y. and Schapire, R. E. (1999).

Large margin classification using the perceptron algorithm.Machine Learning, 37(3):277–296.

Gong, P. and Ye, J. (2014).

Linear convergence of variance-reduced stochastic gradient without strong convexity.ArXiv e-prints, arXiv:1406.1102.

Hazan, E., Agarwal, A., and Kale, S. (2007).

Logarithmic regret algorithms for online convex optimization.Machine Learning, 69(2-3):169–192.

Langford, J., Li, L., and Zhang, T. (2009).

Sparse online learning via truncated gradient.Journal of Machine Learning Research, 10:777–801.

Luo, Z.-Q. and Tseng, P. (1993).

Error bounds and convergence analysis of feasible descent methods: a general approach.Annals of Operations Research, 46-47(1):157–178.

Mokhtari, A., Shahrampour, S., Jadbabaie, A., and Ribeiro, A. (2016).

Online optimization in dynamic environments: Improved regret rates for strongly convex problems.ArXiv e-prints, arXiv:1603.04954.


Thanks!


A DAML



Reference II

Necoara, I., Nesterov, Y., and Glineur, F. (2015).

Linear convergence of first order methods for non-strongly convex optimization.ArXiv e-prints, arXiv:1504.06298.

Nemirovski, A. (2004).

Interior point polynomial time methods in convex programming.Lecture notes, Technion – Israel Institute of Technology.

Shalev-Shwartz, S. (2011).

Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194.

Srebro, N., Sridharan, K., and Tewari, A. (2010).

Smoothness, low noise and fast rates.In Advances in Neural Information Processing Systems 23, pages 2199–2207.

Yang, T., Zhang, L., Jin, R., and Yi, J. (2016).

Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient.In Proceedings of the 33rd International Conference on Machine Learning.

Zhang, L., Yang, T., Yi, J., Jin, R., and Zhou, Z.-H. (2016).

Improved dynamic regret for non-degeneracy functions.ArXiv e-prints, arXiv:1608.03933.

Zinkevich, M. (2003).

Online convex programming and generalized infinitesimal gradient ascent.In Proceedings of the 20th International Conference on Machine Learning, pages 928–936.



A DAML


online learning in dynamic environment · introduction dynamic environment conclusion online...

Documents