machine learning - cse.iitkgp.ac.in

24
Machine Learning Sourangshu Bhattacharya

Upload: others

Post on 15-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Machine LearningSourangshu Bhattacharya

Algorithm for SVM training

β€’ SVM dual objective function:

β€’ One could use gradient projection method.

β€’ But gradient descent is slow.

Algorithm for SVM training

β€’ Number of variables = number of examples.

β€’ Larger data harder the problem. 𝑂(𝑛3).

β€’ But most of the times, solution is sparse.

β€’ Hence use decomposition methods:β€’ Iteratively solve sub-problems till KKT conditions are

satisfied.

β€’ Reference: Working Set Selection Using Second Order Information for Training Support Vector Machines. Rong-En Fan. Pai-Hsuen Chen, Chih-Jen Lin. JMLR 2005.

SMO Algorithm

KKT conditions

β€’ A vector 𝛼 is a stationary point if there exist πœ†π‘– , πœ‡π‘– , 𝑏such that:

β€’ Where 𝛻𝑓 𝛼 = 𝑄𝛼 βˆ’ 𝑒. Or:

KKT conditions

β€’ Or:

β€’ Where:

KKT conditions

β€’ Or:

β€’ where

Working set selection

β€’ Select(i,j) using maximum violating pair:

Online Learning

β€’ Traditional machine learning assumption is that all the data points are available in the beginning.

β€’ May not be the case:β€’ Internet portal does not have all the data.

β€’ A β€œworking” model is required even when there is not enough data.

Perceptron (Rosenblatt 1962)

β€’ Linear model:

β€’ Where:

β€’ Perceptron error:π‘€π‘‡πœ™ π‘₯𝑛 𝑑𝑛 < 0

Perceptron error:

Define: π‘₯𝑛′ = π‘‘π‘›πœ™(π‘₯𝑛).

Error: 𝑀𝑇π‘₯𝑛′ < 0

Perceptron algorithm:

1. Initialize 𝑀0 randomly.

2. For each training data point (π‘₯𝑛, 𝑑𝑛), update:𝑀 ← 𝑀 + πœ‚π‘₯𝑛

β€²

if there is an error.

Perceptron - Illustration

Perceptron convergence theorem

β€’ Given a linearly separable dataset 𝐷 ={(π‘₯𝑛, 𝑑𝑛)|𝑛 = 1…𝑁} such that wβˆ—Tπ‘₯𝑛

β€² > 0, 𝑛 =1…𝑁, for some π‘€βˆ—. The perceptron learning algorithm converges in a finite updates.

β€’ Reference:http://www.cems.uvm.edu/~rsnapp/teaching/cs295ml/notes/perceptron.pdf

Perceptron convergence proof

β€’ 𝑀(π‘˜): weight at the π‘˜π‘‘β„Ž update.

β€’ π‘₯β€²(π‘˜): π‘‘π‘›πœ™(π‘₯𝑛) at the π‘˜π‘‘β„Ž update.

β€’ We will show that:

π΄π‘˜2 ≀ 𝑀 π‘˜ βˆ’ 𝑀 02≀ π΅π‘˜

β€’ Hence the updates must end after π‘˜π‘šπ‘Žπ‘₯ = 𝐡/𝐴updates.

Perceptron convergence proof

Proof for lower bound:

β€’ The updates are of form:𝑀 π‘˜ = 𝑀 π‘˜ βˆ’ 1 + πœ‚π‘₯β€² π‘˜ , π‘˜ = 1β€¦π‘˜π‘šπ‘Žπ‘₯

β€’ Adding, we get:

𝑀 π‘˜ βˆ’ 𝑀 0 = πœ‚

π‘˜=1

π‘˜π‘šπ‘Žπ‘₯

π‘₯β€²(π‘˜)

β€’ Hence:

wβˆ—T(𝑀 π‘˜ βˆ’ 𝑀 0 ) = πœ‚π‘€βˆ—π‘‡(

π‘˜=1

π‘˜π‘šπ‘Žπ‘₯

π‘₯β€² π‘˜ )

Perceptron convergence proof

β€’ Let:π‘Ž = min

π‘₯β€²π‘€βˆ—π‘‡π‘₯β€² > 0

β€’ Then:π‘€βˆ—π‘‡ 𝑀 π‘˜ βˆ’ 𝑀 0 > πœ‚π‘Žπ‘˜

β€’ Hence, using Cauchy Schwartz inequality:π‘€βˆ—2𝑀 π‘˜ βˆ’ 𝑀 0

2β‰₯ πœ‚π‘Žπ‘˜ 2

β€’ Thus:

𝑀 π‘˜ βˆ’ 𝑀 02β‰₯πœ‚π‘Ž

π‘€βˆ—

2

π‘˜2

Perceptron convergence proof

Proof for upper bound:

β€’ Subtracting 𝑀(0) from the updates:𝑀 1 βˆ’ 𝑀 0 = πœ‚π‘₯β€² 1𝑀 π‘˜ βˆ’ 𝑀 0 = 𝑀 π‘˜ βˆ’ 1 βˆ’ 𝑀 0 + πœ‚π‘₯β€²(π‘˜)

β€’ Or:

𝑀 1 βˆ’ 𝑀 02= πœ‚2 π‘₯β€² 1

2

𝑀 π‘˜ βˆ’ 𝑀 02= 𝑀 π‘˜ βˆ’ 1 βˆ’ 𝑀 0

2

+2πœ‚ 𝑀 π‘˜ βˆ’ 1 βˆ’ 𝑀 0𝑇π‘₯β€² π‘˜ + πœ‚2 π‘₯β€² π‘˜

2

Perceptron convergence proof

β€’ Since, update was done at π‘₯β€²(π‘˜):𝑀 π‘˜ βˆ’ 1 π‘₯β€² π‘˜ < 0

β€’ Hence:

𝑀 π‘˜ βˆ’ 𝑀 02≀ 𝑀 π‘˜ βˆ’ 1 βˆ’ 𝑀 0

2

βˆ’2πœ‚π‘€ 0 𝑇π‘₯β€² π‘˜ + πœ‚2 π‘₯β€² π‘˜2

β€’ Adding:

𝑀 π‘˜ βˆ’ 𝑀 02≀ πœ‚2 π‘₯β€² 1

2+β‹―+ π‘₯β€² π‘˜

2

βˆ’2πœ‚π‘€ 0 𝑇(π‘₯β€² 2 +β‹―+ π‘₯β€² π‘˜ )

Perceptron convergence proof

β€’ Let:

𝑀 = maxπ‘₯β€²π‘₯β€²2

πœ‡ = 2minπ‘₯′𝑀 0 𝑇π‘₯β€² < 0

β€’ Then:

𝑀 π‘˜ βˆ’ 𝑀 02≀ πœ‚2𝑀 βˆ’ πœ‚πœ‡ π‘˜

β€’ Hence proved.

Stochastic gradient descent

β€’ Reference: http://alex.smola.org/teaching/10-701-15/math.html

β€’ Given dataset 𝐷 = { π‘₯1, 𝑦1 , … , π‘₯π‘š, π‘¦π‘š }

β€’ Loss function: 𝐿 πœƒ, 𝐷 =1

𝑁 𝑖=1𝑁 𝑙(πœƒ; π‘₯𝑖 , 𝑦𝑖)

β€’ For linear models: 𝑙 πœƒ; π‘₯𝑖 , 𝑦𝑖 = 𝑙(𝑦𝑖 , πœƒπ‘‡πœ™ π‘₯𝑖 )

β€’ Assumption 𝐷 is drawn IID from some distribution 𝒫.

β€’ Problem:minπœƒπΏ(πœƒ, 𝐷)

Stochastic gradient descent

β€’ Input: 𝐷

β€’ Output: πœƒ

Algorithm:

β€’ Initialize πœƒ0

β€’ For 𝑑 = 1,… , π‘‡πœƒπ‘‘+1 = πœƒπ‘‘ βˆ’ πœ‚π‘‘π›»πœƒπ‘™(𝑦𝑑 , πœƒ

π‘‡πœ™ π‘₯𝑑 )

β€’ πœƒ = 𝑑=1𝑇 πœ‚π‘‘πœƒ

𝑑

𝑑=1𝑇 πœ‚π‘‘

.

SGD convergence

β€’ Expected loss: 𝑠 πœƒ = 𝐸𝒫[𝑙(𝑦, πœƒπ‘‡πœ™ π‘₯ ]

β€’ Optimal Expected loss: π‘ βˆ— = 𝑠 πœƒβˆ— = minπœƒπ‘ (πœƒ)

β€’ Convergence:

𝐸 πœƒ 𝑠 πœƒ βˆ’ π‘ βˆ— ≀𝑅2 + 𝐿2 𝑑=1

𝑇 πœ‚π‘‘2

2 𝑑=1𝑇 πœ‚π‘‘

β€’ Where: 𝑅 = πœƒ0 βˆ’ πœƒβˆ—

β€’ 𝐿 = max𝛻𝑙(𝑦, πœƒπ‘‡πœ™ π‘₯ )

SGD convergence proof

β€’ Define π‘Ÿπ‘‘ = πœƒπ‘‘ βˆ’ πœƒβˆ— and 𝑔𝑑 = π›»πœƒπ‘™(𝑦𝑑 , πœƒ

π‘‡πœ™ π‘₯𝑑 )

β€’ π‘Ÿπ‘‘+12 = π‘Ÿπ‘‘

2 + πœ‚π‘‘2 𝑔𝑑

2 βˆ’ 2πœ‚π‘‘ πœƒπ‘‘ βˆ’ πœƒβˆ— 𝑇𝑔𝑑

β€’ Taking expectation w.r.t 𝒫, πœƒ and using π‘ βˆ— βˆ’π‘  πœƒπ‘‘ β‰₯ 𝑔𝑑

𝑇(πœƒβˆ— βˆ’ πœƒπ‘‘), we get:𝐸 πœƒ π‘Ÿπ‘‘+1

2 βˆ’ π‘Ÿπ‘‘2 ≀ πœ‚π‘‘

2𝐿2 + 2πœ‚π‘‘(π‘ βˆ— βˆ’ 𝐸 πœƒ[𝑠 πœƒ

𝑑 ])

β€’ Taking sum over 𝑑 = 1,… , 𝑇 and using𝐸 πœƒ π‘Ÿπ‘‘+1

2 βˆ’ π‘Ÿ02

≀ 𝐿2

𝑑=0

π‘‡βˆ’1

πœ‚π‘‘2 + 2

𝑑=0

π‘‡βˆ’1

πœ‚π‘‘(π‘ βˆ— βˆ’ 𝐸 πœƒ[𝑠 πœƒ

𝑑 ])

SGD convergence proof

β€’ Using convexity of 𝑠:

𝑑=0

π‘‡βˆ’1

πœ‚π‘‘ 𝐸 πœƒ 𝑠 πœƒ ≀ 𝐸 πœƒ[

𝑑=0

π‘‡βˆ’1

πœ‚π‘‘π‘  πœƒπ‘‘ ]

β€’ Substituting in the expression from previous slide:𝐸 πœƒ π‘Ÿπ‘‘+1

2 βˆ’ π‘Ÿ02

≀ 𝐿2

𝑑=0

π‘‡βˆ’1

πœ‚π‘‘2 + 2

𝑑=0

π‘‡βˆ’1

πœ‚π‘‘(π‘ βˆ— βˆ’ 𝐸 πœƒ[𝑠 πœƒ ])

β€’ Rearranging the terms proves the result.