tutorial : part 1 · tutorial: part 1 optimization for machine learning eladhazan princeton...
TRANSCRIPT
![Page 1: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/1.jpg)
Tutorial:PART1
Optimizationformachinelearning
Elad HazanPrincetonUniversity
+helpfromSanjeevArora,Yoram Singer
![Page 2: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/2.jpg)
Distributionover{a} ∈ 𝑅%
label𝑏 = 𝑓)*+*,-.-+/(𝑎)
Chair/car
MLparadigm
Thistutorial- trainingthemachine• Efficiency• generalization
Machine
![Page 3: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/3.jpg)
Agenda
1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Offline/online/stochasticgradientdescent
2. Regularization• AdaGrad andoptimalregularization
3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization
NOTtouchupon:• Parallelism/distributedcomputation(asynchronousoptimization,HOGWILDetc.),Bayesianinferenceingraphicalmodels,Markov-chain-monte-carlo,Partialinformationandbanditalgorithms
![Page 4: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/4.jpg)
Mathematicaloptimization
Input:function𝑓:𝐾 ↦ 𝑅,for𝐾 ⊆ 𝑅8
Output:minimizer𝑥 ∈ 𝐾,suchthat 𝑓 𝑥 ≤ 𝑓 𝑦 ∀𝑦 ∈ 𝐾
Accessingf?(values,differentials,…)
GenerallyNP-hard,givenfullaccesstofunction.
![Page 5: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/5.jpg)
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Learning=optimizationoverdata(a.k.a.EmpiricalRiskMinimization)
Fittingtheparametersofthemodel(“training”)=optimizationproblem:
arg minB∈CD
1𝑚 G ℓI 𝑥, 𝑎I , 𝑏I
IKL.M,+ 𝑅 𝑥
m=#ofexamples(a,b)=(features,labels)d=dimension
![Page 6: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/6.jpg)
Example:linearclassification
Givenasample𝑆 = 𝑎L,𝑏L , … , 𝑎,,𝑏, ,findhyperplane(throughtheoriginw.l.o.g)suchthat:
𝑥 = arg minB QL
# ofmistakes=
arg minB QL
𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑥Z𝑎I ≠ 𝑏I |
arg min] QL
L,∑ ℓ(𝑥,𝑎I, 𝑏I)I forℓ 𝑥, 𝑎I, 𝑏I = _1𝑥
`𝑎 ≠ 𝑏0𝑥`𝑎 = 𝑏
NPhard!
![Page 7: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/7.jpg)
Sumofsignsà globaloptimizationNP-hard!butlocallyverifiable…
Localpropertythatensuresglobaloptimality?
![Page 8: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/8.jpg)
Convexity
Afunction𝑓: 𝑅8 ↦ 𝑅 isconvexifandonlyif:
𝑓12 𝑥 +
12𝑦 ≤
12𝑓 𝑥 +
12𝑓 𝑦
• Informally:smileyJ
• Alternativedefinition:
f y ≥ f x + 𝛻𝑓(𝑥)`(𝑦 − 𝑥)
𝑥 𝑦
![Page 9: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/9.jpg)
Convexsets
SetKisconvexifandonly if:
𝑥, 𝑦 ∈ 𝐾 ⇒ (½𝑥 + ½𝑦) ∈ 𝐾
![Page 10: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/10.jpg)
Lossfunctionsℓ 𝑥, 𝑎I, 𝑏I = ℓ(𝑥Z𝑎I ⋅ 𝑏I)
![Page 11: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/11.jpg)
Convexrelaxationsforlinear(&kernel)classification
1. Ridge/linearregressionℓ 𝑥`𝑎I,𝑦I = 𝑥`𝑎I − 𝑏I l
2. SVM ℓ 𝑥`𝑎I,𝑦I = max{0,1 − 𝑏I 𝑥`𝑎I}3. Logisticregression ℓ 𝑥`𝑎I,𝑦I = log(1 + 𝑒qrs⋅Bt*s)
𝑥 = arg minB QL
𝑖s. 𝑡. 𝑠𝑖𝑔𝑛(𝑥Z𝑎I ≠ 𝑏I |
![Page 12: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/12.jpg)
Wehave:castlearningasmathematicaloptimization,arguedconvexityisalgorithmicallyimportant
Nextè algorithms!
![Page 13: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/13.jpg)
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
Gradientdescent,constrainedset
p1p* p2p3
�[rf(x)]i = � @
@xif(x)
![Page 14: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/14.jpg)
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
Theorem:forstepsize𝜂 = yz Z
𝑓1𝑇G𝑥..
≤ minB∗∈x
𝑓 𝑥∗ +𝐷𝐺𝑇
Where:• G=upperboundonnormofgradients
|𝛻𝑓 𝑥. | ≤ 𝐺
• D=diameterofconstraintset∀𝑥, 𝑦 ∈ 𝐾. |𝑥 − 𝑦| ≤ 𝐷
Convergenceofgradientdescent
![Page 15: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/15.jpg)
Proof:1. Observation 1:
x∗ − y�uL l = x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l 𝛻𝑓(𝑥.) l
2. Observation 2: x∗ − 𝑥.uL l ≤ x∗ − y.uL l
This is the Pythagorean theorem:
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
![Page 16: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/16.jpg)
Proof:1. Observation 1:
x∗ − y�uL l = x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l 𝛻𝑓(𝑥.) l
2. Observation 2: x∗ − 𝑥.uL l ≤ x∗ − y.uL l
Thus: x∗ − x�uL l ≤ x∗ − x� l − 2𝜂𝛻𝑓(𝑥.)(𝑥. − 𝑥∗) + 𝜂l𝐺l
And hence:
𝑓(1𝑇G𝑥.)− 𝑓 𝑥∗ ≤
.
1𝑇G 𝑓 𝑥. − 𝑓 𝑥∗
.
≤1𝑇G𝛻𝑓 𝑥. 𝑥. − 𝑥∗
.
≤1𝑇G
12𝜂 x∗ − x�uL l − x∗ − x� l
.
+𝜂2 𝐺
l
≤1
𝑇 ⋅ 2𝜂 𝐷l+
𝜂2𝐺
l ≤𝐷𝐺𝑇
𝑦.uL ← 𝑥. − 𝜂𝛻𝑓 𝑥.𝑥.uL = argmin
]∈x|𝑦.uL − 𝑥|
![Page 17: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/17.jpg)
Recap
Theorem:forstepsize𝜂 = yz Z
𝑓1𝑇G𝑥.
.
≤ minB∗∈x
𝑓 𝑥∗ +𝐷𝐺𝑇
Thus,toget𝜖-approximatesolution,applyO L��
gradientiterations.
![Page 18: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/18.jpg)
GradientDescent- caveat
p1p* p2p3
ForERMproblems
arg minB∈CD
1𝑚 G ℓI 𝑥, 𝑎I, 𝑏I
IKL.M,
+ 𝑅 𝑥
1. Gradientdependsonalldata2. Whataboutgeneralization?
![Page 19: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/19.jpg)
Nextfewslides:
Simultaneousoptimizationandgeneralizationè Fasteroptimization!(singleexampleperiteration)
![Page 20: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/20.jpg)
Nature:i.i.d fromdistribution D overA×𝐵 = {(𝑎, 𝑏)}
Statistical(PAC)learning
learner:
Hypothesish
Loss,e.g.ℓ ℎ, 𝑎, 𝑏 = ℎ 𝑎 − 𝑏 l
h1
hN
(a1,b1) (aM,bM)
h2
Hypothesis classH:X->Yislearnableif∀𝜖, 𝛿 > 0 existsalgorithms.t. afterseeingmexamples,for𝑚 = 𝑝𝑜𝑙𝑦(𝛿, 𝜖, 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛(𝐻))findsh s.t. w.p.1- δ:
err(h) minh⇤2H
err(h⇤) + ✏
𝑒𝑟𝑟 ℎ = 𝔼*,r∼y[ℓ(ℎ, 𝑎, 𝑏 ]
![Page 21: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/21.jpg)
Iteratively,fort = 1,2, … , 𝑇
Player:ℎ. ∈ 𝐻Adversary:(𝑎.,𝑏.) ∈ 𝐴
Lossℓ(ℎ.,(𝑎., 𝑏.))
Goal:minimize(average,expected)regret:
Vanishing regretà generalizationinPACsetting!(online2batch)
Fromthispointonwards:𝑓. 𝑥 = ℓ(𝑥, 𝑎., 𝑏.) =lossforoneexample
Canweminimizeregretefficiently?
Morepowerfulsetting:OnlineLearninginGames
AxB
H
1
T
"X
t
`(ht, (at, bt)� minh⇤2H
X
t
`(h⇤, (at, bt))
#�!T!1
0
![Page 22: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/22.jpg)
Onlinegradientdescent[Zinkevich ‘05]
x
t+1 = argminx2K
kyt+1 � x
t
k
Theorem:Regret=∑ 𝑓. 𝑥. − ∑ 𝑓. 𝑥∗ =.. 𝑂 𝑇
yt+1 = xt � ⌘rft(xt)
![Page 23: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/23.jpg)
Observation1:
Observation2:(Pythagoras)
Thus:
Convexity:
Analysis
kyt+1 � x
⇤k2 = kxt � x
⇤k2 � 2⌘rt(x⇤ � xt) + ⌘
2krtk2
kxt+1 � x
⇤k2 kxt � x
⇤k2 � 2⌘rt(x⇤ � xt) + ⌘
2krtk2
kxt+1 � x
⇤k kyt+1 � x
⇤k
X
t
[ft(xt)� ft(x⇤)]
X
t
rt(xt � x⇤)
1
⌘(kxt � x⇤k2 � kxt+1 � x⇤k2) + ⌘
X
t
krtk2
1
⌘kx1 � x⇤k2 + ⌘TG = O(
pT )
𝛻. ≔ 𝛻𝑓.(𝑥.)
![Page 24: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/24.jpg)
• 2lossfunctions,T iterations:• 𝐾 = −1,1 , 𝑓L 𝑥 = 𝑥, 𝑓l 𝑥 = −𝑥• Secondexpertloss=first*-1
• Expectedloss=0(anyalgorithm)• Regret=(comparedtoeither-1or1)
Lowerbound
Regret = ⌦(pT )
E[|#10s�#(�1)0s|] = ⌦(pT )
![Page 25: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/25.jpg)
What is Optimization
But generally speaking...
We’re screwed.! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3−2
−10
12
3
−3−2
−10
12
3−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Stochasticgradientdescent
Learningproblemarg minB∈CD
𝐹 𝑥 =𝐸(*s ,rs) ℓI 𝑥,𝑎I, 𝑏Irandomexample:𝑓. 𝑥 = ℓI 𝑥,𝑎I, 𝑏I
1. Wehaveproved:(foranysequenceof𝛻. )
1𝑇G𝛻.`𝑥..
≤ minB∗∈x
1𝑇G𝛻.`𝑥∗
.
+𝐷𝐺𝑇
2. Taking(conditional)expectation:
𝐸 𝐹1𝑇G𝑥..
− minB∗∈x
𝐹 𝑥∗ ≤ 𝐸1𝑇G𝛻.`(𝑥. − 𝑥∗).
] ≤𝐷𝐺𝑇
Oneexampleperstep,sameconvergenceasGD,&givesdirectgeneralization!(formallyneedsmartingales)O 8
�� vs.O ,8�� totalrunningtimefor𝜖 generalizationerror.
![Page 26: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/26.jpg)
Stochasticvs.fullgradientdescent
![Page 27: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/27.jpg)
Regularization&GradientDescent++
![Page 28: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/28.jpg)
• Statisticallearningtheory/Occam’srazor:#ofexamplesneededtolearnhypothesisclass~it’s“dimension”• VCdimension• Fat-shatteringdimension• Rademacher width• Margin/normoflinear/kernelclassifier
• PACtheory:Regularization <->reducecomplexity• Regretminimization:Regularization<->stability
Why“regularize”?
![Page 29: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/29.jpg)
• Mostnatural:
𝑥. = argminB∈x
G𝑓I 𝑥.qL
IKL
• Provablyworks[Kalai-Vempala’05]:
𝑥.� = arg minB∈x
G𝑓I 𝑥.
IKL= 𝑥.uL
• Soif𝑥. ≈𝑥.uL,wegetaregretbound• Butinstability 𝑥. − 𝑥.uL canbelarge!
Minimizeregret:best-in-hindsight
Regret =X
t
f
t
(xt
)� minx
⇤2K
X
t
f
t
(x⇤)
![Page 30: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/30.jpg)
FixingFTL:Follow-The-Regularized-Leader(FTRL)
• Linearize:replaceft byalinearfunction,𝛻𝑓. 𝑥. Z𝑥• Addregularization:
𝑥. = argminB∈x
G 𝛻.`𝑥 +1𝜂 𝑅 𝑥
IKL….qL
• R(x)isastronglyconvexfunction,ensuresstability:
𝛻.` 𝑥. − 𝑥.uL = 𝑂(𝜂)
![Page 31: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/31.jpg)
FTRLvs.gradientdescent
• 𝑅 𝑥 = Ll∥ 𝑥 ∥l
• EssentiallyOGD:startingwithy1 =0,for t=1,2,…
xt =Q
K(yt)
yt+1 = yt � ⌘rft(xt)
x
t
= argminx2K
Pt�1i=1 rf
i
(xi
)>x+ 1⌘
R(x)
=Q
K
⇣�⌘
Pt�1i=1 rf
i
(xi
)⌘
![Page 32: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/32.jpg)
FTRLvs.MultiplicativeWeights• Expertssetting:𝐾 = Δ% distributionsoverexperts• 𝑓. 𝑥 = 𝑐.Z𝑥,wherect isthevectoroflosses• 𝑅 𝑥 =∑ 𝑥I log 𝑥II :negativeentropy
• GivestheMultiplicativeWeightsmethod!
x
t
= arg min
x2K
Pt�1i=1 rf
i
(x
i
)
>x+
1⌘
R(x)
= exp
⇣�⌘
Pt�1i=1 ci
⌘/Z
tEntrywiseexponential
Normalizationconstant
![Page 33: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/33.jpg)
FTRL⇔ OnlineMirrorDescent
x
t
= arg minx2K
Pt�1i=1 rf
i
(xi
)>x+ 1⌘
R(x)
xt =QR
K(yt)
yt+1 = (rR)�1(rR(yt)� ⌘rft(xt))
Bregman Projection:Q
R
K
(y) = argminx2K
B
R
(xky)
B
R
(xky) := R(x)�R(y)�rR(y)>(x� y)
![Page 34: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/34.jpg)
AdaptiveRegularization:AdaGrad
• Considergeneralizedlinearmodel,predictionisfunctionof𝑎Z𝑥𝛻𝑓. 𝑥 = ℓ 𝑎.,𝑏., 𝑥 𝑎.
• OGDupdate:𝑥.uL = 𝑥. − 𝜂𝛻. = 𝑥. − 𝜂ℓ 𝑎., 𝑏. , 𝑥 𝑎.• features treatedequallyinupdatingparametervector
• Intypicaltextclassificationtasks,featurevectorsat areverysparse,Slowlearning!
• Adaptiveregularization:per-feature learningrates
![Page 35: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/35.jpg)
Optimalregularization
• ThegeneralRFTLform
𝑥. = argminB∈x
G 𝑓I 𝑥 +1𝜂 𝑅 𝑥
IKL….qL
• Whichregularizer topick?• AdaGrad:treatthisasalearningproblem!Familyofregularizations:
𝑅 𝑥 = 𝑥 ¤l 𝑠. 𝑡. 𝐴 ≽ 0, 𝑇𝑟𝑎𝑐𝑒 𝐴 = 𝑑
• Objectiveinmatrixworld:bestregretinhindsight!
![Page 36: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/36.jpg)
AdaGrad (diagonalform)
• Set𝑥L ∈ 𝐾arbitrarily• Fort =1,2,…,
1. use𝑥. obtainft2. compute𝑥.uL asfollows:
• Regretbound: [Duchi,Hazan,Singer ‘10]
𝑂 ∑ ∑ 𝛻. ,Il
.I ,canbe 𝑑 betterthanSGD
• Infrequentlyoccurring,orsmall-scale,featureshavesmallinfluenceonregret(andtherefore,convergencetooptimalparameter)
G
t
= diag(P
t
i=1 rf
i
(xi
)rf
i
(xi
)>)
y
t+1 = x
t
� ⌘G
�1/2t
rf
t
(xt
)
x
t+1 = argminx2K
(yt+1 � x)>G
t
(yt+1 � x)
![Page 37: Tutorial : PART 1 · Tutorial: PART 1 Optimization for machine learning EladHazan Princeton University + help from Sanjeev Arora, Yoram Singer](https://reader030.vdocuments.site/reader030/viewer/2022040511/5e5b7e0d9f19eb1d70310277/html5/thumbnails/37.jpg)
Agenda
1. Learningasmathematicaloptimization• Stochasticoptimization,ERM,onlineregretminimization• Offline/stochastic/onlinegradientdescent
2. Regularization• AdaGrad andoptimalregularization
3. GradientDescent++• Frank-Wolfe,acceleration,variancereduction,secondordermethods,non-convexoptimization