machine learning - cse.iitkgp.ac.in
TRANSCRIPT
Algorithm for SVM training
β’ SVM dual objective function:
β’ One could use gradient projection method.
β’ But gradient descent is slow.
Algorithm for SVM training
β’ Number of variables = number of examples.
β’ Larger data harder the problem. π(π3).
β’ But most of the times, solution is sparse.
β’ Hence use decomposition methods:β’ Iteratively solve sub-problems till KKT conditions are
satisfied.
β’ Reference: Working Set Selection Using Second Order Information for Training Support Vector Machines. Rong-En Fan. Pai-Hsuen Chen, Chih-Jen Lin. JMLR 2005.
KKT conditions
β’ A vector πΌ is a stationary point if there exist ππ , ππ , πsuch that:
β’ Where π»π πΌ = ππΌ β π. Or:
Online Learning
β’ Traditional machine learning assumption is that all the data points are available in the beginning.
β’ May not be the case:β’ Internet portal does not have all the data.
β’ A βworkingβ model is required even when there is not enough data.
Perceptron (Rosenblatt 1962)
β’ Linear model:
β’ Where:
β’ Perceptron error:π€ππ π₯π π‘π < 0
Perceptron error:
Define: π₯πβ² = π‘ππ(π₯π).
Error: π€ππ₯πβ² < 0
Perceptron algorithm:
1. Initialize π€0 randomly.
2. For each training data point (π₯π, π‘π), update:π€ β π€ + ππ₯π
β²
if there is an error.
Perceptron convergence theorem
β’ Given a linearly separable dataset π· ={(π₯π, π‘π)|π = 1β¦π} such that wβTπ₯π
β² > 0, π =1β¦π, for some π€β. The perceptron learning algorithm converges in a finite updates.
β’ Reference:http://www.cems.uvm.edu/~rsnapp/teaching/cs295ml/notes/perceptron.pdf
Perceptron convergence proof
β’ π€(π): weight at the ππ‘β update.
β’ π₯β²(π): π‘ππ(π₯π) at the ππ‘β update.
β’ We will show that:
π΄π2 β€ π€ π β π€ 02β€ π΅π
β’ Hence the updates must end after ππππ₯ = π΅/π΄updates.
Perceptron convergence proof
Proof for lower bound:
β’ The updates are of form:π€ π = π€ π β 1 + ππ₯β² π , π = 1β¦ππππ₯
β’ Adding, we get:
π€ π β π€ 0 = π
π=1
ππππ₯
π₯β²(π)
β’ Hence:
wβT(π€ π β π€ 0 ) = ππ€βπ(
π=1
ππππ₯
π₯β² π )
Perceptron convergence proof
β’ Let:π = min
π₯β²π€βππ₯β² > 0
β’ Then:π€βπ π€ π β π€ 0 > πππ
β’ Hence, using Cauchy Schwartz inequality:π€β2π€ π β π€ 0
2β₯ πππ 2
β’ Thus:
π€ π β π€ 02β₯ππ
π€β
2
π2
Perceptron convergence proof
Proof for upper bound:
β’ Subtracting π€(0) from the updates:π€ 1 β π€ 0 = ππ₯β² 1π€ π β π€ 0 = π€ π β 1 β π€ 0 + ππ₯β²(π)
β’ Or:
π€ 1 β π€ 02= π2 π₯β² 1
2
π€ π β π€ 02= π€ π β 1 β π€ 0
2
+2π π€ π β 1 β π€ 0ππ₯β² π + π2 π₯β² π
2
Perceptron convergence proof
β’ Since, update was done at π₯β²(π):π€ π β 1 π₯β² π < 0
β’ Hence:
π€ π β π€ 02β€ π€ π β 1 β π€ 0
2
β2ππ€ 0 ππ₯β² π + π2 π₯β² π2
β’ Adding:
π€ π β π€ 02β€ π2 π₯β² 1
2+β―+ π₯β² π
2
β2ππ€ 0 π(π₯β² 2 +β―+ π₯β² π )
Perceptron convergence proof
β’ Let:
π = maxπ₯β²π₯β²2
π = 2minπ₯β²π€ 0 ππ₯β² < 0
β’ Then:
π€ π β π€ 02β€ π2π β ππ π
β’ Hence proved.
Stochastic gradient descent
β’ Reference: http://alex.smola.org/teaching/10-701-15/math.html
β’ Given dataset π· = { π₯1, π¦1 , β¦ , π₯π, π¦π }
β’ Loss function: πΏ π, π· =1
π π=1π π(π; π₯π , π¦π)
β’ For linear models: π π; π₯π , π¦π = π(π¦π , πππ π₯π )
β’ Assumption π· is drawn IID from some distribution π«.
β’ Problem:minππΏ(π, π·)
Stochastic gradient descent
β’ Input: π·
β’ Output: π
Algorithm:
β’ Initialize π0
β’ For π‘ = 1,β¦ , πππ‘+1 = ππ‘ β ππ‘π»ππ(π¦π‘ , π
ππ π₯π‘ )
β’ π = π‘=1π ππ‘π
π‘
π‘=1π ππ‘
.
SGD convergence
β’ Expected loss: π π = πΈπ«[π(π¦, πππ π₯ ]
β’ Optimal Expected loss: π β = π πβ = minππ (π)
β’ Convergence:
πΈ π π π β π β β€π 2 + πΏ2 π‘=1
π ππ‘2
2 π‘=1π ππ‘
β’ Where: π = π0 β πβ
β’ πΏ = maxπ»π(π¦, πππ π₯ )
SGD convergence proof
β’ Define ππ‘ = ππ‘ β πβ and ππ‘ = π»ππ(π¦π‘ , π
ππ π₯π‘ )
β’ ππ‘+12 = ππ‘
2 + ππ‘2 ππ‘
2 β 2ππ‘ ππ‘ β πβ πππ‘
β’ Taking expectation w.r.t π«, π and using π β βπ ππ‘ β₯ ππ‘
π(πβ β ππ‘), we get:πΈ π ππ‘+1
2 β ππ‘2 β€ ππ‘
2πΏ2 + 2ππ‘(π β β πΈ π[π π
π‘ ])
β’ Taking sum over π‘ = 1,β¦ , π and usingπΈ π ππ‘+1
2 β π02
β€ πΏ2
π‘=0
πβ1
ππ‘2 + 2
π‘=0
πβ1
ππ‘(π β β πΈ π[π π
π‘ ])
SGD convergence proof
β’ Using convexity of π :
π‘=0
πβ1
ππ‘ πΈ π π π β€ πΈ π[
π‘=0
πβ1
ππ‘π ππ‘ ]
β’ Substituting in the expression from previous slide:πΈ π ππ‘+1
2 β π02
β€ πΏ2
π‘=0
πβ1
ππ‘2 + 2
π‘=0
πβ1
ππ‘(π β β πΈ π[π π ])
β’ Rearranging the terms proves the result.