04 machine learning - supervised linear classifier
TRANSCRIPT
![Page 1: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/1.jpg)
Machine Learning for Data MiningLinear Classifiers
Andres Mendez-Vazquez
May 23, 2016
1 / 85
![Page 2: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/2.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
2 / 85
![Page 3: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/3.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
3 / 85
![Page 4: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/4.jpg)
What is it?
First than anything, we have a parametric model!!!Here, we have an hyperplane as a model:
g(x) = wT x + w0 (1)
In the case of R2
We have the following function:
g (x) = w1x1 + w2x2 + w0 (2)
4 / 85
![Page 5: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/5.jpg)
What is it?
First than anything, we have a parametric model!!!Here, we have an hyperplane as a model:
g(x) = wT x + w0 (1)
In the case of R2
We have the following function:
g (x) = w1x1 + w2x2 + w0 (2)
4 / 85
![Page 6: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/6.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
5 / 85
![Page 7: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/7.jpg)
Splitting The Space R2
Using a simple straight line
Class
Class
6 / 85
![Page 8: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/8.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
7 / 85
![Page 9: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/9.jpg)
Defining a Decision Surface
The equation g (x) = 0 defines a decision surfaceSeparating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplaneGiven x1 and x2 are both on the decision surface:
wT x1 + w0 = 0wT x2 + w0 = 0
Thus
wT x1 + w0 = wT x2 + w0 (3)
8 / 85
![Page 10: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/10.jpg)
Defining a Decision Surface
The equation g (x) = 0 defines a decision surfaceSeparating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplaneGiven x1 and x2 are both on the decision surface:
wT x1 + w0 = 0wT x2 + w0 = 0
Thus
wT x1 + w0 = wT x2 + w0 (3)
8 / 85
![Page 11: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/11.jpg)
Defining a Decision Surface
The equation g (x) = 0 defines a decision surfaceSeparating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplaneGiven x1 and x2 are both on the decision surface:
wT x1 + w0 = 0wT x2 + w0 = 0
Thus
wT x1 + w0 = wT x2 + w0 (3)
8 / 85
![Page 12: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/12.jpg)
Defining a Decision Surface
Thus
wT (x1 − x2) = 0 (4)
Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT
is normal to the hyperplane.
Something Notable
Properties
9 / 85
![Page 13: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/13.jpg)
Defining a Decision Surface
Thus
wT (x1 − x2) = 0 (4)
Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT
is normal to the hyperplane.
Something Notable
Properties
9 / 85
![Page 14: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/14.jpg)
Defining a Decision Surface
Thus
wT (x1 − x2) = 0 (4)
Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT
is normal to the hyperplane.
Something Notable
Properties
9 / 85
![Page 15: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/15.jpg)
Therefore
The space is split in two regions (Example in R3) by the hyperplane H
10 / 85
![Page 16: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/16.jpg)
Some Properties of the Hyperplane
Given that g (x) > 0 if x ∈ R1
11 / 85
![Page 17: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/17.jpg)
It is moreWe can say the following
Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows
x = xp + rw
‖w‖
Wherexp is the normal projection of x onto H.r is the desired distance
I Positive, if x is in the positive sideI Negative, if x is in the negative side
12 / 85
![Page 18: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/18.jpg)
It is moreWe can say the following
Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows
x = xp + rw
‖w‖
Wherexp is the normal projection of x onto H.r is the desired distance
I Positive, if x is in the positive sideI Negative, if x is in the negative side
12 / 85
![Page 19: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/19.jpg)
It is moreWe can say the following
Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows
x = xp + rw
‖w‖
Wherexp is the normal projection of x onto H.r is the desired distance
I Positive, if x is in the positive sideI Negative, if x is in the negative side
12 / 85
![Page 20: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/20.jpg)
It is moreWe can say the following
Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows
x = xp + rw
‖w‖
Wherexp is the normal projection of x onto H.r is the desired distance
I Positive, if x is in the positive sideI Negative, if x is in the negative side
12 / 85
![Page 21: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/21.jpg)
It is moreWe can say the following
Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows
x = xp + rw
‖w‖
Wherexp is the normal projection of x onto H.r is the desired distance
I Positive, if x is in the positive sideI Negative, if x is in the negative side
12 / 85
![Page 22: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/22.jpg)
It is moreWe can say the following
Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows
x = xp + rw
‖w‖
Wherexp is the normal projection of x onto H.r is the desired distance
I Positive, if x is in the positive sideI Negative, if x is in the negative side
12 / 85
![Page 23: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/23.jpg)
It is moreWe can say the following
Any x ∈ R1 is on the positive side of H.Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x tothe hyperplane HFirst, we express any x as follows
x = xp + rw
‖w‖
Wherexp is the normal projection of x onto H.r is the desired distance
I Positive, if x is in the positive sideI Negative, if x is in the negative side
12 / 85
![Page 24: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/24.jpg)
We have something like this
We have then
13 / 85
![Page 25: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/25.jpg)
NowSince g (xp) = 0We have that
g (x) = g
(xp + r
w
‖w‖
)= wT
(xp + r
w
‖w‖
)+ w0
= wT xp + w0 + rwT w
‖w‖
= g (xp) + r‖w‖2
‖w‖= r ‖w‖
Then, we have
r = g (x)‖w‖
(5)
14 / 85
![Page 26: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/26.jpg)
NowSince g (xp) = 0We have that
g (x) = g
(xp + r
w
‖w‖
)= wT
(xp + r
w
‖w‖
)+ w0
= wT xp + w0 + rwT w
‖w‖
= g (xp) + r‖w‖2
‖w‖= r ‖w‖
Then, we have
r = g (x)‖w‖
(5)
14 / 85
![Page 27: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/27.jpg)
NowSince g (xp) = 0We have that
g (x) = g
(xp + r
w
‖w‖
)= wT
(xp + r
w
‖w‖
)+ w0
= wT xp + w0 + rwT w
‖w‖
= g (xp) + r‖w‖2
‖w‖= r ‖w‖
Then, we have
r = g (x)‖w‖
(5)
14 / 85
![Page 28: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/28.jpg)
NowSince g (xp) = 0We have that
g (x) = g
(xp + r
w
‖w‖
)= wT
(xp + r
w
‖w‖
)+ w0
= wT xp + w0 + rwT w
‖w‖
= g (xp) + r‖w‖2
‖w‖= r ‖w‖
Then, we have
r = g (x)‖w‖
(5)
14 / 85
![Page 29: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/29.jpg)
NowSince g (xp) = 0We have that
g (x) = g
(xp + r
w
‖w‖
)= wT
(xp + r
w
‖w‖
)+ w0
= wT xp + w0 + rwT w
‖w‖
= g (xp) + r‖w‖2
‖w‖= r ‖w‖
Then, we have
r = g (x)‖w‖
(5)
14 / 85
![Page 30: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/30.jpg)
NowSince g (xp) = 0We have that
g (x) = g
(xp + r
w
‖w‖
)= wT
(xp + r
w
‖w‖
)+ w0
= wT xp + w0 + rwT w
‖w‖
= g (xp) + r‖w‖2
‖w‖= r ‖w‖
Then, we have
r = g (x)‖w‖
(5)
14 / 85
![Page 31: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/31.jpg)
In particular
The distance from the origin to H
r = g (0)‖w‖
= wT (0) + w0‖w‖
= w0‖w‖
(6)
RemarksIf w0 > 0, the origin is on the positive side of H.If w0 < 0, the origin is on the negative side of H.If w0 = 0, the hyperplane has the homogeneous form wT x andhyperplane passes through the origin.
15 / 85
![Page 32: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/32.jpg)
In particular
The distance from the origin to H
r = g (0)‖w‖
= wT (0) + w0‖w‖
= w0‖w‖
(6)
RemarksIf w0 > 0, the origin is on the positive side of H.If w0 < 0, the origin is on the negative side of H.If w0 = 0, the hyperplane has the homogeneous form wT x andhyperplane passes through the origin.
15 / 85
![Page 33: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/33.jpg)
In particular
The distance from the origin to H
r = g (0)‖w‖
= wT (0) + w0‖w‖
= w0‖w‖
(6)
RemarksIf w0 > 0, the origin is on the positive side of H.If w0 < 0, the origin is on the negative side of H.If w0 = 0, the hyperplane has the homogeneous form wT x andhyperplane passes through the origin.
15 / 85
![Page 34: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/34.jpg)
In particular
The distance from the origin to H
r = g (0)‖w‖
= wT (0) + w0‖w‖
= w0‖w‖
(6)
RemarksIf w0 > 0, the origin is on the positive side of H.If w0 < 0, the origin is on the negative side of H.If w0 = 0, the hyperplane has the homogeneous form wT x andhyperplane passes through the origin.
15 / 85
![Page 35: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/35.jpg)
In addition...
If we do the following
g (x) = w0 +d∑
i=1wixi =
d∑i=0
wixi (7)
By making
x0 = 1 and y =
1x1...xd
=
1
x
Wherey is called an augmented feature vector.
16 / 85
![Page 36: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/36.jpg)
In addition...
If we do the following
g (x) = w0 +d∑
i=1wixi =
d∑i=0
wixi (7)
By making
x0 = 1 and y =
1x1...xd
=
1
x
Wherey is called an augmented feature vector.
16 / 85
![Page 37: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/37.jpg)
In addition...
If we do the following
g (x) = w0 +d∑
i=1wixi =
d∑i=0
wixi (7)
By making
x0 = 1 and y =
1x1...xd
=
1
x
Wherey is called an augmented feature vector.
16 / 85
![Page 38: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/38.jpg)
In a similar way
We have the augmented weight vector
waug =
w0w1...wd
=
w0
w
RemarksThe addition of a constant component to x preserves all the distancerelationship between samples.The resulting y vectors, all lie in a d-dimensional subspace which isthe x-space itself.
17 / 85
![Page 39: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/39.jpg)
In a similar way
We have the augmented weight vector
waug =
w0w1...wd
=
w0
w
RemarksThe addition of a constant component to x preserves all the distancerelationship between samples.The resulting y vectors, all lie in a d-dimensional subspace which isthe x-space itself.
17 / 85
![Page 40: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/40.jpg)
In a similar way
We have the augmented weight vector
waug =
w0w1...wd
=
w0
w
RemarksThe addition of a constant component to x preserves all the distancerelationship between samples.The resulting y vectors, all lie in a d-dimensional subspace which isthe x-space itself.
17 / 85
![Page 41: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/41.jpg)
More Remarks
In additionThe hyperplane decision surface H defined by wT
augy = 0 passesthrough the origin in y-space.Even though the corresponding hyperplane H can be in any positionof the x-space.
The distance from y to H is |wTaugy|
‖waug‖ or |g(x)|‖waug‖ .
Since ‖waug‖ > ‖w‖This distance is less or at least equal to the distance from x to H.
This mapping is quite usefulBecause we only need to find a weight vector waug instead of finding theweight vector w and the threshold w0.
18 / 85
![Page 42: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/42.jpg)
More Remarks
In additionThe hyperplane decision surface H defined by wT
augy = 0 passesthrough the origin in y-space.Even though the corresponding hyperplane H can be in any positionof the x-space.
The distance from y to H is |wTaugy|
‖waug‖ or |g(x)|‖waug‖ .
Since ‖waug‖ > ‖w‖This distance is less or at least equal to the distance from x to H.
This mapping is quite usefulBecause we only need to find a weight vector waug instead of finding theweight vector w and the threshold w0.
18 / 85
![Page 43: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/43.jpg)
More Remarks
In additionThe hyperplane decision surface H defined by wT
augy = 0 passesthrough the origin in y-space.Even though the corresponding hyperplane H can be in any positionof the x-space.
The distance from y to H is |wTaugy|
‖waug‖ or |g(x)|‖waug‖ .
Since ‖waug‖ > ‖w‖This distance is less or at least equal to the distance from x to H.
This mapping is quite usefulBecause we only need to find a weight vector waug instead of finding theweight vector w and the threshold w0.
18 / 85
![Page 44: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/44.jpg)
More Remarks
In additionThe hyperplane decision surface H defined by wT
augy = 0 passesthrough the origin in y-space.Even though the corresponding hyperplane H can be in any positionof the x-space.
The distance from y to H is |wTaugy|
‖waug‖ or |g(x)|‖waug‖ .
Since ‖waug‖ > ‖w‖This distance is less or at least equal to the distance from x to H.
This mapping is quite usefulBecause we only need to find a weight vector waug instead of finding theweight vector w and the threshold w0.
18 / 85
![Page 45: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/45.jpg)
More Remarks
In additionThe hyperplane decision surface H defined by wT
augy = 0 passesthrough the origin in y-space.Even though the corresponding hyperplane H can be in any positionof the x-space.
The distance from y to H is |wTaugy|
‖waug‖ or |g(x)|‖waug‖ .
Since ‖waug‖ > ‖w‖This distance is less or at least equal to the distance from x to H.
This mapping is quite usefulBecause we only need to find a weight vector waug instead of finding theweight vector w and the threshold w0.
18 / 85
![Page 46: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/46.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
19 / 85
![Page 47: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/47.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
20 / 85
![Page 48: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/48.jpg)
Initial Supposition
Suppose, we haven samples x1,x2, ...,xn some labeled ω1 and some labeled ω2.
We want a vector weight w such thatwT xi > 0, if xi ∈ ω1.wT xi < 0, if xi ∈ ω2.
We suggest the following normalizationWe replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
![Page 49: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/49.jpg)
Initial Supposition
Suppose, we haven samples x1,x2, ...,xn some labeled ω1 and some labeled ω2.
We want a vector weight w such thatwT xi > 0, if xi ∈ ω1.wT xi < 0, if xi ∈ ω2.
We suggest the following normalizationWe replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
![Page 50: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/50.jpg)
Initial Supposition
Suppose, we haven samples x1,x2, ...,xn some labeled ω1 and some labeled ω2.
We want a vector weight w such thatwT xi > 0, if xi ∈ ω1.wT xi < 0, if xi ∈ ω2.
We suggest the following normalizationWe replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
![Page 51: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/51.jpg)
Initial Supposition
Suppose, we haven samples x1,x2, ...,xn some labeled ω1 and some labeled ω2.
We want a vector weight w such thatwT xi > 0, if xi ∈ ω1.wT xi < 0, if xi ∈ ω2.
We suggest the following normalizationWe replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
![Page 52: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/52.jpg)
The Usefulness of the Normalization
Once the normalization is doneWe only need for a weight vector w such that wT xi > 0 for all thesamples.
The name of this weight vectorIt is called a separating vector or solution vector.
22 / 85
![Page 53: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/53.jpg)
The Usefulness of the Normalization
Once the normalization is doneWe only need for a weight vector w such that wT xi > 0 for all thesamples.
The name of this weight vectorIt is called a separating vector or solution vector.
22 / 85
![Page 54: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/54.jpg)
Here, we have the solution region for w
Do not confuse this region with the decision region!!!
separat
ing plan
e
solution space
Remark: w is not unique!!! We can have different w’s solving theproblem
23 / 85
![Page 55: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/55.jpg)
Here, we have the solution region for w
Do not confuse this region with the decision region!!!
separat
ing plan
e
solution space
Remark: w is not unique!!! We can have different w’s solving theproblem
23 / 85
![Page 56: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/56.jpg)
Here, we have the solution region for w undernormalizationDo not confuse this region with the decision region!!!
"separat
ing" pla
ne
solution space
Remark: w is not unique!!!24 / 85
![Page 57: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/57.jpg)
Here, we have the solution region for w undernormalizationDo not confuse this region with the decision region!!!
"separat
ing" pla
ne
solution space
Remark: w is not unique!!!24 / 85
![Page 58: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/58.jpg)
How do we get this w?
In order to be able to do thisWe need to impose constraints to the problem.
Possible constraints!!!To find a unit-length weight vector that maximizes the minimumdistance from the samples to the separating plane.To find the minimum-length weight vector satisfying wT xi ≥ b for alli where b is a constant called the margin.Here the solution space resulting from the intersections of thehalf-spaces such that wT xi ≥ b > 0 lies within the previous solutionspace!!!
25 / 85
![Page 59: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/59.jpg)
How do we get this w?
In order to be able to do thisWe need to impose constraints to the problem.
Possible constraints!!!To find a unit-length weight vector that maximizes the minimumdistance from the samples to the separating plane.To find the minimum-length weight vector satisfying wT xi ≥ b for alli where b is a constant called the margin.Here the solution space resulting from the intersections of thehalf-spaces such that wT xi ≥ b > 0 lies within the previous solutionspace!!!
25 / 85
![Page 60: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/60.jpg)
How do we get this w?
In order to be able to do thisWe need to impose constraints to the problem.
Possible constraints!!!To find a unit-length weight vector that maximizes the minimumdistance from the samples to the separating plane.To find the minimum-length weight vector satisfying wT xi ≥ b for alli where b is a constant called the margin.Here the solution space resulting from the intersections of thehalf-spaces such that wT xi ≥ b > 0 lies within the previous solutionspace!!!
25 / 85
![Page 61: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/61.jpg)
How do we get this w?
In order to be able to do thisWe need to impose constraints to the problem.
Possible constraints!!!To find a unit-length weight vector that maximizes the minimumdistance from the samples to the separating plane.To find the minimum-length weight vector satisfying wT xi ≥ b for alli where b is a constant called the margin.Here the solution space resulting from the intersections of thehalf-spaces such that wT xi ≥ b > 0 lies within the previous solutionspace!!!
25 / 85
![Page 62: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/62.jpg)
We have thenA new boundary by a distance b
‖xi‖
solution region
26 / 85
![Page 63: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/63.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
27 / 85
![Page 64: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/64.jpg)
Gradient Descent
For this, we will define a criterion function J (w)A classic optimization
The basic procedure is as follow1 Start with a random weight vector w (1).2 Compute the gradient vector ∇J (w (1)).3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:1 i.e. along the negative of the gradient.2 By using the following equation:
w (k + 1) = w (k)− η (k)∇J (w (k)) (8)
28 / 85
![Page 65: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/65.jpg)
Gradient Descent
For this, we will define a criterion function J (w)A classic optimization
The basic procedure is as follow1 Start with a random weight vector w (1).2 Compute the gradient vector ∇J (w (1)).3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:1 i.e. along the negative of the gradient.2 By using the following equation:
w (k + 1) = w (k)− η (k)∇J (w (k)) (8)
28 / 85
![Page 66: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/66.jpg)
Gradient Descent
For this, we will define a criterion function J (w)A classic optimization
The basic procedure is as follow1 Start with a random weight vector w (1).2 Compute the gradient vector ∇J (w (1)).3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:1 i.e. along the negative of the gradient.2 By using the following equation:
w (k + 1) = w (k)− η (k)∇J (w (k)) (8)
28 / 85
![Page 67: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/67.jpg)
Gradient Descent
For this, we will define a criterion function J (w)A classic optimization
The basic procedure is as follow1 Start with a random weight vector w (1).2 Compute the gradient vector ∇J (w (1)).3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:1 i.e. along the negative of the gradient.2 By using the following equation:
w (k + 1) = w (k)− η (k)∇J (w (k)) (8)
28 / 85
![Page 68: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/68.jpg)
Gradient Descent
For this, we will define a criterion function J (w)A classic optimization
The basic procedure is as follow1 Start with a random weight vector w (1).2 Compute the gradient vector ∇J (w (1)).3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:1 i.e. along the negative of the gradient.2 By using the following equation:
w (k + 1) = w (k)− η (k)∇J (w (k)) (8)
28 / 85
![Page 69: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/69.jpg)
Gradient Descent
For this, we will define a criterion function J (w)A classic optimization
The basic procedure is as follow1 Start with a random weight vector w (1).2 Compute the gradient vector ∇J (w (1)).3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:1 i.e. along the negative of the gradient.2 By using the following equation:
w (k + 1) = w (k)− η (k)∇J (w (k)) (8)
28 / 85
![Page 70: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/70.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 71: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/71.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 72: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/72.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 73: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/73.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 74: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/74.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 75: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/75.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 76: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/76.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 77: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/77.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 78: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/78.jpg)
What is η (k)?Hereη (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like thisAlgorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 02 do k = k + 13 w = w − η (k)∇J (w)4 until η (k)∇J (w) < θ
5 return w
Problem!!! How to choose the learning rate?If η (k) is too small, convergence is quite slow!!!If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
![Page 79: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/79.jpg)
Using the Taylor’s second-order expansion around valuew (k)
We do the following
J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have∇J is the vector of partial derivatives ∂J
∂wievaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J∂wi∂wj
evaluated at w (k).
30 / 85
![Page 80: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/80.jpg)
Using the Taylor’s second-order expansion around valuew (k)
We do the following
J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have∇J is the vector of partial derivatives ∂J
∂wievaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J∂wi∂wj
evaluated at w (k).
30 / 85
![Page 81: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/81.jpg)
Using the Taylor’s second-order expansion around valuew (k)
We do the following
J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have∇J is the vector of partial derivatives ∂J
∂wievaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J∂wi∂wj
evaluated at w (k).
30 / 85
![Page 82: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/82.jpg)
Using the Taylor’s second-order expansion around valuew (k)
We do the following
J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have∇J is the vector of partial derivatives ∂J
∂wievaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J∂wi∂wj
evaluated at w (k).
30 / 85
![Page 83: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/83.jpg)
Then
We substitute (Eq. 8) into (Eq. 9)
w (k + 1)−w (k) = η (k)∇J (w (k)) (10)
We have then
J (w (k + 1)) ∼=J (w (k)) +∇JT (−η (k)∇J (w (k))) + ...
12 (−η (k)∇J (w (k)))T H (−η (k)∇J (w (k)))
Finally, we have
J (w (k + 1)) ∼= J (w (k))− η (k) ‖∇J‖2 + 12η
2 (k)∇JT H∇J (11)
31 / 85
![Page 84: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/84.jpg)
Then
We substitute (Eq. 8) into (Eq. 9)
w (k + 1)−w (k) = η (k)∇J (w (k)) (10)
We have then
J (w (k + 1)) ∼=J (w (k)) +∇JT (−η (k)∇J (w (k))) + ...
12 (−η (k)∇J (w (k)))T H (−η (k)∇J (w (k)))
Finally, we have
J (w (k + 1)) ∼= J (w (k))− η (k) ‖∇J‖2 + 12η
2 (k)∇JT H∇J (11)
31 / 85
![Page 85: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/85.jpg)
Then
We substitute (Eq. 8) into (Eq. 9)
w (k + 1)−w (k) = η (k)∇J (w (k)) (10)
We have then
J (w (k + 1)) ∼=J (w (k)) +∇JT (−η (k)∇J (w (k))) + ...
12 (−η (k)∇J (w (k)))T H (−η (k)∇J (w (k)))
Finally, we have
J (w (k + 1)) ∼= J (w (k))− η (k) ‖∇J‖2 + 12η
2 (k)∇JT H∇J (11)
31 / 85
![Page 86: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/86.jpg)
Derive with respect to η (k) and make the result equal tozero
We have then
−‖∇J‖2 + η (k)∇JT H∇J = 0 (12)
Finally
η (k) = ‖∇J‖2
∇JT H∇J(13)
Remark This is the optimal step size!!!
Problem!!!Calculating H can be quite expansive!!!
32 / 85
![Page 87: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/87.jpg)
Derive with respect to η (k) and make the result equal tozero
We have then
−‖∇J‖2 + η (k)∇JT H∇J = 0 (12)
Finally
η (k) = ‖∇J‖2
∇JT H∇J(13)
Remark This is the optimal step size!!!
Problem!!!Calculating H can be quite expansive!!!
32 / 85
![Page 88: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/88.jpg)
Derive with respect to η (k) and make the result equal tozero
We have then
−‖∇J‖2 + η (k)∇JT H∇J = 0 (12)
Finally
η (k) = ‖∇J‖2
∇JT H∇J(13)
Remark This is the optimal step size!!!
Problem!!!Calculating H can be quite expansive!!!
32 / 85
![Page 89: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/89.jpg)
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)Then, we can have the following functionf (η (k)) = w (k)− η (k)∇J (w (k))
We can optimized using linear search methods
Linear Search MethodsBacktracking linear searchBisection methodGolden ratioEtc.
33 / 85
![Page 90: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/90.jpg)
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)Then, we can have the following functionf (η (k)) = w (k)− η (k)∇J (w (k))
We can optimized using linear search methods
Linear Search MethodsBacktracking linear searchBisection methodGolden ratioEtc.
33 / 85
![Page 91: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/91.jpg)
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)Then, we can have the following functionf (η (k)) = w (k)− η (k)∇J (w (k))
We can optimized using linear search methods
Linear Search MethodsBacktracking linear searchBisection methodGolden ratioEtc.
33 / 85
![Page 92: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/92.jpg)
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)Then, we can have the following functionf (η (k)) = w (k)− η (k)∇J (w (k))
We can optimized using linear search methods
Linear Search MethodsBacktracking linear searchBisection methodGolden ratioEtc.
33 / 85
![Page 93: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/93.jpg)
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)Then, we can have the following functionf (η (k)) = w (k)− η (k)∇J (w (k))
We can optimized using linear search methods
Linear Search MethodsBacktracking linear searchBisection methodGolden ratioEtc.
33 / 85
![Page 94: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/94.jpg)
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)Then, we can have the following functionf (η (k)) = w (k)− η (k)∇J (w (k))
We can optimized using linear search methods
Linear Search MethodsBacktracking linear searchBisection methodGolden ratioEtc.
33 / 85
![Page 95: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/95.jpg)
Example: Golden Ratio
Imagine that you have a linear function f : L→ R
Where: Chose a and b such that a+ba = a
b (The Golden Ratio).
34 / 85
![Page 96: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/96.jpg)
Example: Golden Ratio
Imagine that you have a linear function f : L→ R
Where: Chose a and b such that a+ba = a
b (The Golden Ratio).
34 / 85
![Page 97: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/97.jpg)
The process is as follow
Given f1, f2, f3, wheref1 = f (x1)f2 = f (x2)f3 = f (x3)
We have thenif f2 is smaller than f1 and f3 then the minimum lies in [x1, x3]
Now, we generate x4 with f4 = f (x4)In the largest subinterval!!! [x2, x3]
35 / 85
![Page 98: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/98.jpg)
The process is as follow
Given f1, f2, f3, wheref1 = f (x1)f2 = f (x2)f3 = f (x3)
We have thenif f2 is smaller than f1 and f3 then the minimum lies in [x1, x3]
Now, we generate x4 with f4 = f (x4)In the largest subinterval!!! [x2, x3]
35 / 85
![Page 99: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/99.jpg)
The process is as follow
Given f1, f2, f3, wheref1 = f (x1)f2 = f (x2)f3 = f (x3)
We have thenif f2 is smaller than f1 and f3 then the minimum lies in [x1, x3]
Now, we generate x4 with f4 = f (x4)In the largest subinterval!!! [x2, x3]
35 / 85
![Page 100: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/100.jpg)
Finally
Two casesIf f4a > f2 then the minimum lies between x1 and x4 and the newtriplet is x1, x2 and x4.If f4b < f2 then the minimum lies between x2 and x3 and the newtriplet is x2, x4 and x3.
ThenRepeat the procedure!!!
For more, please read the paper“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
![Page 101: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/101.jpg)
Finally
Two casesIf f4a > f2 then the minimum lies between x1 and x4 and the newtriplet is x1, x2 and x4.If f4b < f2 then the minimum lies between x2 and x3 and the newtriplet is x2, x4 and x3.
ThenRepeat the procedure!!!
For more, please read the paper“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
![Page 102: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/102.jpg)
Finally
Two casesIf f4a > f2 then the minimum lies between x1 and x4 and the newtriplet is x1, x2 and x4.If f4b < f2 then the minimum lies between x2 and x3 and the newtriplet is x2, x4 and x3.
ThenRepeat the procedure!!!
For more, please read the paper“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
![Page 103: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/103.jpg)
Finally
Two casesIf f4a > f2 then the minimum lies between x1 and x4 and the newtriplet is x1, x2 and x4.If f4b < f2 then the minimum lies between x2 and x3 and the newtriplet is x2, x4 and x3.
ThenRepeat the procedure!!!
For more, please read the paper“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
![Page 104: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/104.jpg)
We have another method...
Derive the second Taylor expansion by w
J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k))
We get
∇J + Hw −Hw (k) = 0 (14)
Thus
Hw = Hw (k)−∇JH−1Hw = H−1Hw (k)−H−1∇J
w = w (k)−H−1∇J
37 / 85
![Page 105: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/105.jpg)
We have another method...
Derive the second Taylor expansion by w
J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k))
We get
∇J + Hw −Hw (k) = 0 (14)
Thus
Hw = Hw (k)−∇JH−1Hw = H−1Hw (k)−H−1∇J
w = w (k)−H−1∇J
37 / 85
![Page 106: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/106.jpg)
We have another method...
Derive the second Taylor expansion by w
J (w) = J (w (k)) +∇JT (w −w (k)) + 12 (w −w (k))T H (w −w (k))
We get
∇J + Hw −Hw (k) = 0 (14)
Thus
Hw = Hw (k)−∇JH−1Hw = H−1Hw (k)−H−1∇J
w = w (k)−H−1∇J
37 / 85
![Page 107: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/107.jpg)
The Newton-Raphson Algorithm
We have the following algorithmAlgorithm 2 (Newton descent)
1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ
5 return w
38 / 85
![Page 108: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/108.jpg)
The Newton-Raphson Algorithm
We have the following algorithmAlgorithm 2 (Newton descent)
1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ
5 return w
38 / 85
![Page 109: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/109.jpg)
The Newton-Raphson Algorithm
We have the following algorithmAlgorithm 2 (Newton descent)
1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ
5 return w
38 / 85
![Page 110: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/110.jpg)
The Newton-Raphson Algorithm
We have the following algorithmAlgorithm 2 (Newton descent)
1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ
5 return w
38 / 85
![Page 111: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/111.jpg)
The Newton-Raphson Algorithm
We have the following algorithmAlgorithm 2 (Newton descent)
1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ
5 return w
38 / 85
![Page 112: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/112.jpg)
The Newton-Raphson Algorithm
We have the following algorithmAlgorithm 2 (Newton descent)
1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ
5 return w
38 / 85
![Page 113: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/113.jpg)
The Newton-Raphson Algorithm
We have the following algorithmAlgorithm 2 (Newton descent)
1 begin initialize w, criterion θ2 do k = k + 13 w = w −H−1∇J (w)4 until H−1∇J (w) < θ
5 return w
38 / 85
![Page 114: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/114.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
39 / 85
![Page 115: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/115.jpg)
Initial Setup
ImportantWe get away from our initial normalization of the samples!!!
Now, we are going to use the method know asMinimum Squared Error
40 / 85
![Page 116: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/116.jpg)
Initial Setup
ImportantWe get away from our initial normalization of the samples!!!
Now, we are going to use the method know asMinimum Squared Error
40 / 85
![Page 117: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/117.jpg)
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!2 You require to label them.
We have a problem!!!Which is the problem?
We do not know the hyperplane!!!Thus, what distance each point has to the hyperplane?
41 / 85
![Page 118: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/118.jpg)
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!2 You require to label them.
We have a problem!!!Which is the problem?
We do not know the hyperplane!!!Thus, what distance each point has to the hyperplane?
41 / 85
![Page 119: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/119.jpg)
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!2 You require to label them.
We have a problem!!!Which is the problem?
We do not know the hyperplane!!!Thus, what distance each point has to the hyperplane?
41 / 85
![Page 120: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/120.jpg)
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!2 You require to label them.
We have a problem!!!Which is the problem?
We do not know the hyperplane!!!Thus, what distance each point has to the hyperplane?
41 / 85
![Page 121: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/121.jpg)
A Simple Solution For Our Quandary
Label the Classesω1 =⇒ +1ω2 =⇒ −1
We produce the following labels1 if x ∈ ω1 then yideal = gideal (x) = +1.2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
![Page 122: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/122.jpg)
A Simple Solution For Our Quandary
Label the Classesω1 =⇒ +1ω2 =⇒ −1
We produce the following labels1 if x ∈ ω1 then yideal = gideal (x) = +1.2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
![Page 123: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/123.jpg)
A Simple Solution For Our Quandary
Label the Classesω1 =⇒ +1ω2 =⇒ −1
We produce the following labels1 if x ∈ ω1 then yideal = gideal (x) = +1.2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
![Page 124: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/124.jpg)
A Simple Solution For Our Quandary
Label the Classesω1 =⇒ +1ω2 =⇒ −1
We produce the following labels1 if x ∈ ω1 then yideal = gideal (x) = +1.2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
![Page 125: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/125.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
43 / 85
![Page 126: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/126.jpg)
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT x + w0 + ε (15)
Where the εIt has a ε ∼ N
(µ, σ2)
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + ε (16)
44 / 85
![Page 127: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/127.jpg)
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT x + w0 + ε (15)
Where the εIt has a ε ∼ N
(µ, σ2)
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + ε (16)
44 / 85
![Page 128: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/128.jpg)
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT x + w0 + ε (15)
Where the εIt has a ε ∼ N
(µ, σ2)
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + ε (16)
44 / 85
![Page 129: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/129.jpg)
Thus, we have
What to do?
ε = ynoise − gideal (x) (17)
Graphically
45 / 85
![Page 130: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/130.jpg)
Thus, we haveWhat to do?
ε = ynoise − gideal (x) (17)
Graphically
45 / 85
![Page 131: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/131.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
46 / 85
![Page 132: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/132.jpg)
Sum Over All ErrorsWe can do the following
J (w) =N∑
i=1ε2i =
N∑i=1
(yi − gideal (x))2 (18)
Remark: Know as least squares (Fitting the vertical offset!!!)
GeneralizeIf
The dimensionality of each sample (data point) is d,You can extend each vector sample to be xT = (1,x′),We have:
N∑i=1
(yi − xT w
)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)
47 / 85
![Page 133: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/133.jpg)
Sum Over All ErrorsWe can do the following
J (w) =N∑
i=1ε2i =
N∑i=1
(yi − gideal (x))2 (18)
Remark: Know as least squares (Fitting the vertical offset!!!)
GeneralizeIf
The dimensionality of each sample (data point) is d,You can extend each vector sample to be xT = (1,x′),We have:
N∑i=1
(yi − xT w
)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)
47 / 85
![Page 134: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/134.jpg)
Sum Over All ErrorsWe can do the following
J (w) =N∑
i=1ε2i =
N∑i=1
(yi − gideal (x))2 (18)
Remark: Know as least squares (Fitting the vertical offset!!!)
GeneralizeIf
The dimensionality of each sample (data point) is d,You can extend each vector sample to be xT = (1,x′),We have:
N∑i=1
(yi − xT w
)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)
47 / 85
![Page 135: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/135.jpg)
Sum Over All ErrorsWe can do the following
J (w) =N∑
i=1ε2i =
N∑i=1
(yi − gideal (x))2 (18)
Remark: Know as least squares (Fitting the vertical offset!!!)
GeneralizeIf
The dimensionality of each sample (data point) is d,You can extend each vector sample to be xT = (1,x′),We have:
N∑i=1
(yi − xT w
)2= (y −Xw)T (y −Xw) = ‖y −Xw‖22 (19)
47 / 85
![Page 136: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/136.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
48 / 85
![Page 137: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/137.jpg)
What is X
It is the Data Matrix
X =
1 (x1)1 · · · (x1)j · · · (x1)d...
......
1 (xi)1 (xi)j (xi)d...
......
1 (xN )1 · · · (xN )j · · · (xN )d
(20)
We know the followingdxTAx
dx= Ax+ATx,
dAx
dx= A (21)
49 / 85
![Page 138: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/138.jpg)
What is X
It is the Data Matrix
X =
1 (x1)1 · · · (x1)j · · · (x1)d...
......
1 (xi)1 (xi)j (xi)d...
......
1 (xN )1 · · · (xN )j · · · (xN )d
(20)
We know the followingdxTAx
dx= Ax+ATx,
dAx
dx= A (21)
49 / 85
![Page 139: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/139.jpg)
Note about other representations
We could have xT = (x1, x2, ..., xd, 1) thus
X =
(x1)1 · · · (x1)j · · · (x1)d 1...
......
(xi)1 (xi)j (xi)d 1...
......
(xN )1 · · · (xN )j · · · (xN )d 1
(22)
50 / 85
![Page 140: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/140.jpg)
We can expand our quadratic formula!!!
Thus
(y −Xw)T (y −Xw) = yT y−wT XT y−yT Xw + wT XT Xw (23)
Making Possible to have by deriving with respect to w and assumingthat XT X is invertible
w =(XT X
)−1XT y (24)
Note:XT X is always positive semi-definite. If it is also invertible, it ispositive definite.
Thus, How we get the discriminant function?Any Ideas?
51 / 85
![Page 141: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/141.jpg)
We can expand our quadratic formula!!!
Thus
(y −Xw)T (y −Xw) = yT y−wT XT y−yT Xw + wT XT Xw (23)
Making Possible to have by deriving with respect to w and assumingthat XT X is invertible
w =(XT X
)−1XT y (24)
Note:XT X is always positive semi-definite. If it is also invertible, it ispositive definite.
Thus, How we get the discriminant function?Any Ideas?
51 / 85
![Page 142: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/142.jpg)
We can expand our quadratic formula!!!
Thus
(y −Xw)T (y −Xw) = yT y−wT XT y−yT Xw + wT XT Xw (23)
Making Possible to have by deriving with respect to w and assumingthat XT X is invertible
w =(XT X
)−1XT y (24)
Note:XT X is always positive semi-definite. If it is also invertible, it ispositive definite.
Thus, How we get the discriminant function?Any Ideas?
51 / 85
![Page 143: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/143.jpg)
The Final Discriminant Function
Very Simple!!!
g(x) = xT w = xT(XT X
)−1XT y (25)
52 / 85
![Page 144: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/144.jpg)
Pseudo-inverse of a Matrix
DefinitionSuppose that A ∈ Rm×n and rank (A) = m. We call the matrix
A+ =(ATA
)−1AT
the pseudo inverse of A.
ReasonA+ inverts A on its image
What?If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence:
A+w = A+Av =(ATA
)−1ATAv
53 / 85
![Page 145: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/145.jpg)
Pseudo-inverse of a Matrix
DefinitionSuppose that A ∈ Rm×n and rank (A) = m. We call the matrix
A+ =(ATA
)−1AT
the pseudo inverse of A.
ReasonA+ inverts A on its image
What?If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence:
A+w = A+Av =(ATA
)−1ATAv
53 / 85
![Page 146: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/146.jpg)
Pseudo-inverse of a Matrix
DefinitionSuppose that A ∈ Rm×n and rank (A) = m. We call the matrix
A+ =(ATA
)−1AT
the pseudo inverse of A.
ReasonA+ inverts A on its image
What?If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence:
A+w = A+Av =(ATA
)−1ATAv
53 / 85
![Page 147: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/147.jpg)
What lives where?
We haveX ∈ RN×(d+1)
Image (X) = span{
Xcol1 , ..., Xcol
d+1
}xi ∈ Rd
w ∈ Rd+1
Xcoli ,y ∈ RN
Basically y, the list of desired inputs the is being protected into
span{
Xcol1 , ..., Xcol
d+1
}(26)
by the projection operator X(XT X
)−1XT .
54 / 85
![Page 148: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/148.jpg)
What lives where?
We haveX ∈ RN×(d+1)
Image (X) = span{
Xcol1 , ..., Xcol
d+1
}xi ∈ Rd
w ∈ Rd+1
Xcoli ,y ∈ RN
Basically y, the list of desired inputs the is being protected into
span{
Xcol1 , ..., Xcol
d+1
}(26)
by the projection operator X(XT X
)−1XT .
54 / 85
![Page 149: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/149.jpg)
What lives where?
We haveX ∈ RN×(d+1)
Image (X) = span{
Xcol1 , ..., Xcol
d+1
}xi ∈ Rd
w ∈ Rd+1
Xcoli ,y ∈ RN
Basically y, the list of desired inputs the is being protected into
span{
Xcol1 , ..., Xcol
d+1
}(26)
by the projection operator X(XT X
)−1XT .
54 / 85
![Page 150: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/150.jpg)
What lives where?
We haveX ∈ RN×(d+1)
Image (X) = span{
Xcol1 , ..., Xcol
d+1
}xi ∈ Rd
w ∈ Rd+1
Xcoli ,y ∈ RN
Basically y, the list of desired inputs the is being protected into
span{
Xcol1 , ..., Xcol
d+1
}(26)
by the projection operator X(XT X
)−1XT .
54 / 85
![Page 151: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/151.jpg)
What lives where?
We haveX ∈ RN×(d+1)
Image (X) = span{
Xcol1 , ..., Xcol
d+1
}xi ∈ Rd
w ∈ Rd+1
Xcoli ,y ∈ RN
Basically y, the list of desired inputs the is being protected into
span{
Xcol1 , ..., Xcol
d+1
}(26)
by the projection operator X(XT X
)−1XT .
54 / 85
![Page 152: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/152.jpg)
What lives where?
We haveX ∈ RN×(d+1)
Image (X) = span{
Xcol1 , ..., Xcol
d+1
}xi ∈ Rd
w ∈ Rd+1
Xcoli ,y ∈ RN
Basically y, the list of desired inputs the is being protected into
span{
Xcol1 , ..., Xcol
d+1
}(26)
by the projection operator X(XT X
)−1XT .
54 / 85
![Page 153: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/153.jpg)
Geometric Interpretation
We have1 The image of the mapping w to Xw is a linear subspace of RN .2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image spaceimage (X) = span
{Xcol
1 , ..., Xcold+1
}.
3 Each w defines one point Xw =∑d
j=0wjXcolj .
4 w is the point which minimizes the distance d (y, image (X)).
55 / 85
![Page 154: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/154.jpg)
Geometric Interpretation
We have1 The image of the mapping w to Xw is a linear subspace of RN .2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image spaceimage (X) = span
{Xcol
1 , ..., Xcold+1
}.
3 Each w defines one point Xw =∑d
j=0wjXcolj .
4 w is the point which minimizes the distance d (y, image (X)).
55 / 85
![Page 155: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/155.jpg)
Geometric Interpretation
We have1 The image of the mapping w to Xw is a linear subspace of RN .2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image spaceimage (X) = span
{Xcol
1 , ..., Xcold+1
}.
3 Each w defines one point Xw =∑d
j=0wjXcolj .
4 w is the point which minimizes the distance d (y, image (X)).
55 / 85
![Page 156: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/156.jpg)
Geometric Interpretation
We have1 The image of the mapping w to Xw is a linear subspace of RN .2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image spaceimage (X) = span
{Xcol
1 , ..., Xcold+1
}.
3 Each w defines one point Xw =∑d
j=0wjXcolj .
4 w is the point which minimizes the distance d (y, image (X)).
55 / 85
![Page 157: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/157.jpg)
Geometrically
Ahhhh!!!
56 / 85
![Page 158: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/158.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
57 / 85
![Page 159: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/159.jpg)
Multi-Class Solution
What to do?1 We might reduce the problem to c− 1 two-class problems.2 We might use c(c−1)
2 linear discriminants, one for every pair of classes.
However
58 / 85
![Page 160: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/160.jpg)
Multi-Class Solution
What to do?1 We might reduce the problem to c− 1 two-class problems.2 We might use c(c−1)
2 linear discriminants, one for every pair of classes.
However
58 / 85
![Page 161: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/161.jpg)
Multi-Class Solution
What to do?1 We might reduce the problem to c− 1 two-class problems.2 We might use c(c−1)
2 linear discriminants, one for every pair of classes.
However
58 / 85
![Page 162: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/162.jpg)
What to do?
Define c linear discriminant functions
gi (x) = wT x + wi0 for i = 1, ..., c (27)
This is known as a linear machineRule: if gk (x) > gj (x) for all j 6= k =⇒ x ∈ ωk
Nice Properties (It can be proved!!!)1 Decision Regions are Singly Connected.2 Decision Regions are Convex.
59 / 85
![Page 163: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/163.jpg)
What to do?
Define c linear discriminant functions
gi (x) = wT x + wi0 for i = 1, ..., c (27)
This is known as a linear machineRule: if gk (x) > gj (x) for all j 6= k =⇒ x ∈ ωk
Nice Properties (It can be proved!!!)1 Decision Regions are Singly Connected.2 Decision Regions are Convex.
59 / 85
![Page 164: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/164.jpg)
What to do?
Define c linear discriminant functions
gi (x) = wT x + wi0 for i = 1, ..., c (27)
This is known as a linear machineRule: if gk (x) > gj (x) for all j 6= k =⇒ x ∈ ωk
Nice Properties (It can be proved!!!)1 Decision Regions are Singly Connected.2 Decision Regions are Convex.
59 / 85
![Page 165: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/165.jpg)
Proof of Properties
Proof
Actually quite simpleGiven
y = λxA + (1− λ) xB
with λ ∈ (0, 1).
60 / 85
![Page 166: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/166.jpg)
Proof of Properties
Proof
Actually quite simpleGiven
y = λxA + (1− λ) xB
with λ ∈ (0, 1).
60 / 85
![Page 167: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/167.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 168: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/168.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 169: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/169.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 170: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/170.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 171: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/171.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 172: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/172.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 173: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/173.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 174: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/174.jpg)
Proof of Properties
We know that
gk (y) = wT (λxA + (1− λ) xB) + w0
= λwT xA + λw0 + (1− λ) wT xB + (1− λ)w0
= λgk (xA) + (1− λ) gk (xA)> λgj (xA) + (1− λ) gj (xA)> gj (λxA + (1− λ) xB)> gj (y)
For all j 6= k
Or...y belongs to an area k defined by the rule!!!This area is Convex and Singly Connected because the definition ofy.
61 / 85
![Page 175: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/175.jpg)
However!!!
No so nice properties!!!It limits the power of classification for multi-objective function.
62 / 85
![Page 176: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/176.jpg)
How do we train this Linear Machine?
We know that each ωk class is described bygk (x) = wT
k x + w0 where k = 1, ..., c
We then design a single machine
g (x) = W T x (28)
63 / 85
![Page 177: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/177.jpg)
How do we train this Linear Machine?
We know that each ωk class is described bygk (x) = wT
k x + w0 where k = 1, ..., c
We then design a single machine
g (x) = W T x (28)
63 / 85
![Page 178: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/178.jpg)
Where
We have the following
W T =
1 w11 w12 · · · w1d
1 w21 w22 · · · w2d
1 w31 w32 · · · w3d...
......
...1 wc1 wc2 · · · wcd
(29)
What about the labels?OK, we know how to do with 2 classes, What about many classes?
64 / 85
![Page 179: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/179.jpg)
Where
We have the following
W T =
1 w11 w12 · · · w1d
1 w21 w22 · · · w2d
1 w31 w32 · · · w3d...
......
...1 wc1 wc2 · · · wcd
(29)
What about the labels?OK, we know how to do with 2 classes, What about many classes?
64 / 85
![Page 180: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/180.jpg)
How do we train this Linear Machine?
Use a vector ti with dimensionality c to identify each element at eachclassWe have then the following dataset
{xi, ti} for i = 1, 2, ..., N
We build the following Matrix of Vectors
T =
tT
1tT
2...
tTN
(30)
65 / 85
![Page 181: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/181.jpg)
How do we train this Linear Machine?
Use a vector ti with dimensionality c to identify each element at eachclassWe have then the following dataset
{xi, ti} for i = 1, 2, ..., N
We build the following Matrix of Vectors
T =
tT
1tT
2...
tTN
(30)
65 / 85
![Page 182: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/182.jpg)
Thus, we create the following Matrix
A Matrix containing all the required information
XW − T (31)
Where we have the following vector[xT
i w1,xTi w2,x
Ti w3, ...,x
Ti wc
](32)
Remark: It is the vector result of multiplication of row i of X againstW on XW .
That is compared to the vector tTi on T by using the subtraction of
vectors
εi =[xT
i w1,xTi w2,x
Ti w3, ...,x
Ti wc
]− tT
i (33)
66 / 85
![Page 183: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/183.jpg)
Thus, we create the following Matrix
A Matrix containing all the required information
XW − T (31)
Where we have the following vector[xT
i w1,xTi w2,x
Ti w3, ...,x
Ti wc
](32)
Remark: It is the vector result of multiplication of row i of X againstW on XW .
That is compared to the vector tTi on T by using the subtraction of
vectors
εi =[xT
i w1,xTi w2,x
Ti w3, ...,x
Ti wc
]− tT
i (33)
66 / 85
![Page 184: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/184.jpg)
Thus, we create the following Matrix
A Matrix containing all the required information
XW − T (31)
Where we have the following vector[xT
i w1,xTi w2,x
Ti w3, ...,x
Ti wc
](32)
Remark: It is the vector result of multiplication of row i of X againstW on XW .
That is compared to the vector tTi on T by using the subtraction of
vectors
εi =[xT
i w1,xTi w2,x
Ti w3, ...,x
Ti wc
]− tT
i (33)
66 / 85
![Page 185: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/185.jpg)
What do we want?
We want the quadratic error12ε
2i
This specific quadratic errors are at the diagonal of the matrix
(XW − T )T (XW − T )
We can use the trace function to generate the desired total error of
J (·) = 12
N∑i=1
ε2i (34)
67 / 85
![Page 186: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/186.jpg)
What do we want?
We want the quadratic error12ε
2i
This specific quadratic errors are at the diagonal of the matrix
(XW − T )T (XW − T )
We can use the trace function to generate the desired total error of
J (·) = 12
N∑i=1
ε2i (34)
67 / 85
![Page 187: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/187.jpg)
What do we want?
We want the quadratic error12ε
2i
This specific quadratic errors are at the diagonal of the matrix
(XW − T )T (XW − T )
We can use the trace function to generate the desired total error of
J (·) = 12
N∑i=1
ε2i (34)
67 / 85
![Page 188: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/188.jpg)
Then
The trace allows to express the total error
J (W ) = 12Trace
{(XW − T )T (XW − T )
}(35)
Thus, we have by the same derivative method
W =(XT X
)XT T = X+T (36)
68 / 85
![Page 189: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/189.jpg)
Then
The trace allows to express the total error
J (W ) = 12Trace
{(XW − T )T (XW − T )
}(35)
Thus, we have by the same derivative method
W =(XT X
)XT T = X+T (36)
68 / 85
![Page 190: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/190.jpg)
How we train this Linear Machine?
Thus, we obtain the discriminant
g (x) = W T x = T T(X+
)x (37)
69 / 85
![Page 191: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/191.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
70 / 85
![Page 192: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/192.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 193: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/193.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 194: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/194.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 195: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/195.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 196: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/196.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 197: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/197.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 198: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/198.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 199: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/199.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 200: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/200.jpg)
Issues with Least SquaresRobustness
1 Least squares works only if X has full column rank, i.e. if XT X isinvertible.
2 If XT X almost not invertible, least squares is numerically unstable.1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data1 Modern problems: Many dimensions/features/predictors (possibly
thousands).2 Only a few of these may be important:
1 It needs some form of feature selection.2 Possible some type of regularization
Why?1 Treats all dimensions equally2 Relevant dimensions are averaged with irrelevant ones
71 / 85
![Page 201: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/201.jpg)
Issues with Least Squares
Problem with OutliersNo Outliers Outliers
72 / 85
![Page 202: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/202.jpg)
Issues with Least Squares
What about the Linear Machine?Please, run the algorithm and tell me...
73 / 85
![Page 203: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/203.jpg)
What to Do About Numerical Stability?
RegularityA matrix which is not invertible is also called a singular matrix. A matrixwhich is invertible (not singular) is called regular.
In computationsIntuitions:
1 A singular matrix maps an entire linear subspace into a single point.2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!Large positive eigenvalues ⇒ the mapping is large!!!Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
![Page 204: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/204.jpg)
What to Do About Numerical Stability?
RegularityA matrix which is not invertible is also called a singular matrix. A matrixwhich is invertible (not singular) is called regular.
In computationsIntuitions:
1 A singular matrix maps an entire linear subspace into a single point.2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!Large positive eigenvalues ⇒ the mapping is large!!!Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
![Page 205: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/205.jpg)
What to Do About Numerical Stability?
RegularityA matrix which is not invertible is also called a singular matrix. A matrixwhich is invertible (not singular) is called regular.
In computationsIntuitions:
1 A singular matrix maps an entire linear subspace into a single point.2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!Large positive eigenvalues ⇒ the mapping is large!!!Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
![Page 206: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/206.jpg)
What to Do About Numerical Stability?
RegularityA matrix which is not invertible is also called a singular matrix. A matrixwhich is invertible (not singular) is called regular.
In computationsIntuitions:
1 A singular matrix maps an entire linear subspace into a single point.2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!Large positive eigenvalues ⇒ the mapping is large!!!Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
![Page 207: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/207.jpg)
What to Do About Numerical Stability?
RegularityA matrix which is not invertible is also called a singular matrix. A matrixwhich is invertible (not singular) is called regular.
In computationsIntuitions:
1 A singular matrix maps an entire linear subspace into a single point.2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!Large positive eigenvalues ⇒ the mapping is large!!!Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
![Page 208: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/208.jpg)
Outline
1 IntroductionThe Simplest FunctionsSplitting the SpaceThe Decision Surface
2 Developing an Initial SolutionGradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable CaseBasic Method
Minimum Squared Error ProcedureThe Error IdeaThe Final Error EquationThe Data MatrixMulti-Class SolutionIssues with Least Squares!!!What about Numerical Stability?
75 / 85
![Page 209: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/209.jpg)
What to Do About Numerical Stability?
All this comes from the following statementA positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0
Consequence for StatisticsIf a statistical prediction involves the inverse of an almost-singular matrix,the predictions become unreliable (high variance).
76 / 85
![Page 210: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/210.jpg)
What to Do About Numerical Stability?
All this comes from the following statementA positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0
Consequence for StatisticsIf a statistical prediction involves the inverse of an almost-singular matrix,the predictions become unreliable (high variance).
76 / 85
![Page 211: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/211.jpg)
What can be done?Ridge RegressionRidge regression is a modification of least squares. It tries to make leastsquares more robust if XT X is almost singular.
The solution
wRidge =(XT X + λI
)−1XT y (38)
where λ is a tunning parameter
Thus, we can do the following given that XT X is positive definiteAssume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvaluesλ1, λ2, ..., λd+1: (
XT X + λI)ξi = (λi + λ) ξi (39)
i.e. λi + λ is an eigenvalue for(XT X + λI
)77 / 85
![Page 212: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/212.jpg)
What can be done?Ridge RegressionRidge regression is a modification of least squares. It tries to make leastsquares more robust if XT X is almost singular.
The solution
wRidge =(XT X + λI
)−1XT y (38)
where λ is a tunning parameter
Thus, we can do the following given that XT X is positive definiteAssume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvaluesλ1, λ2, ..., λd+1: (
XT X + λI)ξi = (λi + λ) ξi (39)
i.e. λi + λ is an eigenvalue for(XT X + λI
)77 / 85
![Page 213: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/213.jpg)
What can be done?Ridge RegressionRidge regression is a modification of least squares. It tries to make leastsquares more robust if XT X is almost singular.
The solution
wRidge =(XT X + λI
)−1XT y (38)
where λ is a tunning parameter
Thus, we can do the following given that XT X is positive definiteAssume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvaluesλ1, λ2, ..., λd+1: (
XT X + λI)ξi = (λi + λ) ξi (39)
i.e. λi + λ is an eigenvalue for(XT X + λI
)77 / 85
![Page 214: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/214.jpg)
What does this mean?
Something NotableYou can control the singularity by detecting the smallest eigenvalue.
ThusWe add an appropriate tunning value λ.
78 / 85
![Page 215: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/215.jpg)
What does this mean?
Something NotableYou can control the singularity by detecting the smallest eigenvalue.
ThusWe add an appropriate tunning value λ.
78 / 85
![Page 216: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/216.jpg)
Thus, what we need to do?
Process1 Find the eigenvalues of XT X
2 If all of them are bigger than zero we are fine!!!3 Find the smallest one, then tune if necessary.4 Build wRidge =
(XT X + λI
)−1XT y.
79 / 85
![Page 217: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/217.jpg)
Thus, what we need to do?
Process1 Find the eigenvalues of XT X
2 If all of them are bigger than zero we are fine!!!3 Find the smallest one, then tune if necessary.4 Build wRidge =
(XT X + λI
)−1XT y.
79 / 85
![Page 218: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/218.jpg)
Thus, what we need to do?
Process1 Find the eigenvalues of XT X
2 If all of them are bigger than zero we are fine!!!3 Find the smallest one, then tune if necessary.4 Build wRidge =
(XT X + λI
)−1XT y.
79 / 85
![Page 219: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/219.jpg)
Thus, what we need to do?
Process1 Find the eigenvalues of XT X
2 If all of them are bigger than zero we are fine!!!3 Find the smallest one, then tune if necessary.4 Build wRidge =
(XT X + λI
)−1XT y.
79 / 85
![Page 220: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/220.jpg)
What about Thousands of Features?
There is a technique for thatLeast Absolute Shrinkage and Selection Operator (LASSO) invented byRobert Tibshirani that uses L1 =
∑di=1 |wi|.
The Least Squared Error takes the form ofN∑
i=1
(yi − xT w
)2+
d∑i=1|wi| (40)
However
You have other regularizations as L2 =√∑d
i=1 |wi|2
80 / 85
![Page 221: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/221.jpg)
What about Thousands of Features?
There is a technique for thatLeast Absolute Shrinkage and Selection Operator (LASSO) invented byRobert Tibshirani that uses L1 =
∑di=1 |wi|.
The Least Squared Error takes the form ofN∑
i=1
(yi − xT w
)2+
d∑i=1|wi| (40)
However
You have other regularizations as L2 =√∑d
i=1 |wi|2
80 / 85
![Page 222: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/222.jpg)
What about Thousands of Features?
There is a technique for thatLeast Absolute Shrinkage and Selection Operator (LASSO) invented byRobert Tibshirani that uses L1 =
∑di=1 |wi|.
The Least Squared Error takes the form ofN∑
i=1
(yi − xT w
)2+
d∑i=1|wi| (40)
However
You have other regularizations as L2 =√∑d
i=1 |wi|2
80 / 85
![Page 223: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/223.jpg)
Graphically
The first area correspond to the L1 regularization and the second one?
81 / 85
![Page 224: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/224.jpg)
Graphically
Yes the circle defined as L2 =√∑d
i=1 |wi|2
82 / 85
![Page 225: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/225.jpg)
The seminal paper by Robert Tibshirani
An initial study of this regularization can be seen in“Regression Shrinkage and Selection via the LASSO” by Robert Tibshirani
- 1996
83 / 85
![Page 226: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/226.jpg)
This out the scope of this class
However, it is worth noticing that the most efficient method forsolving LASSO problems is“Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie,
Holger Ho and Robert Tibshirani
NeverthelessIt will be a great seminar paper!!!
84 / 85
![Page 227: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/227.jpg)
This out the scope of this class
However, it is worth noticing that the most efficient method forsolving LASSO problems is“Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie,
Holger Ho and Robert Tibshirani
NeverthelessIt will be a great seminar paper!!!
84 / 85
![Page 228: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/228.jpg)
ExercisesDuda and HartChapter 5
1, 3, 4, 7, 13, 17
BishopChapter 4
4.1, 4.4, 4.7,
TheodoridisChapter 3 - Problems
Using python 3.6Chapter 3 - Computer Experiments
Using python 3.1Using python and Newton 3.2
85 / 85
![Page 229: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/229.jpg)
ExercisesDuda and HartChapter 5
1, 3, 4, 7, 13, 17
BishopChapter 4
4.1, 4.4, 4.7,
TheodoridisChapter 3 - Problems
Using python 3.6Chapter 3 - Computer Experiments
Using python 3.1Using python and Newton 3.2
85 / 85
![Page 230: 04 Machine Learning - Supervised Linear Classifier](https://reader031.vdocuments.site/reader031/viewer/2022021920/586e84db1a28aba0038b61b1/html5/thumbnails/230.jpg)
ExercisesDuda and HartChapter 5
1, 3, 4, 7, 13, 17
BishopChapter 4
4.1, 4.4, 4.7,
TheodoridisChapter 3 - Problems
Using python 3.6Chapter 3 - Computer Experiments
Using python 3.1Using python and Newton 3.2
85 / 85