ee-559 { deep learning 3a. linear classi ers, perceptron · the perceptron classi cation rule boils...
TRANSCRIPT
EE-559 – Deep learning
3a. Linear classifiers, perceptron
Francois Fleuret
https://fleuret.org/dlc/
[version of: May 17, 2018]
ÉCOLE POLYTECHNIQUEFÉDÉRALE DE LAUSANNE
A bit of history, the perceptron
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 2 / 43
The first mathematical model for a neuron was the Threshold Logic Unit, withBoolean inputs and outputs:
f (x) = 1{w ∑i xi+b≥0}.
It can in particular implement
or(u, v) = 1{u+v−0.5≥0} (w = 1, b = −0.5)
and(u, v) = 1{u+v−1.5≥0} (w = 1, b = −1.5)
not(u) = 1{−u+0.5≥0} (w = −1, b = 0.5)
Hence, any Boolean function can be build with such units.
(McCulloch and Pitts, 1943)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 3 / 43
The first mathematical model for a neuron was the Threshold Logic Unit, withBoolean inputs and outputs:
f (x) = 1{w ∑i xi+b≥0}.
It can in particular implement
or(u, v) = 1{u+v−0.5≥0} (w = 1, b = −0.5)
and(u, v) = 1{u+v−1.5≥0} (w = 1, b = −1.5)
not(u) = 1{−u+0.5≥0} (w = −1, b = 0.5)
Hence, any Boolean function can be build with such units.
(McCulloch and Pitts, 1943)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 3 / 43
The first mathematical model for a neuron was the Threshold Logic Unit, withBoolean inputs and outputs:
f (x) = 1{w ∑i xi+b≥0}.
It can in particular implement
or(u, v) = 1{u+v−0.5≥0} (w = 1, b = −0.5)
and(u, v) = 1{u+v−1.5≥0} (w = 1, b = −1.5)
not(u) = 1{−u+0.5≥0} (w = −1, b = 0.5)
Hence, any Boolean function can be build with such units.
(McCulloch and Pitts, 1943)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 3 / 43
The perceptron is very similar
f (x) =
1 if∑i
wi xi + b ≥ 0
0 otherwise
but the inputs are real values and the weights can be different.
This model was originally motivated by biology, with wi being the synapticweights, and xi and f firing rates.
It is a (very) crude biological model.
(Rosenblatt, 1957)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 4 / 43
The perceptron is very similar
f (x) =
1 if∑i
wi xi + b ≥ 0
0 otherwise
but the inputs are real values and the weights can be different.
This model was originally motivated by biology, with wi being the synapticweights, and xi and f firing rates.
It is a (very) crude biological model.
(Rosenblatt, 1957)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 4 / 43
To make things simpler we take responses ±1. Let
σ(x) =
{1 if x ≥ 0−1 otherwise.
−1
1
The perceptron classification rule boils down to
f (x) = σ(w · x + b).
For neural networks, the function σ that follows a linear operator is called theactivation function.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 5 / 43
To make things simpler we take responses ±1. Let
σ(x) =
{1 if x ≥ 0−1 otherwise.
−1
1
The perceptron classification rule boils down to
f (x) = σ(w · x + b).
For neural networks, the function σ that follows a linear operator is called theactivation function.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 5 / 43
We can represent this “neuron” as follows:
Value
Parameter
Operation
x2
x1
x3
×
×
×
w1
w2
w3
Σ
b
σ y
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 6 / 43
We can also use tensor operations, as in
f (x) = σ(w · x + b).
x ·
w
+
b
σ y
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 7 / 43
Given a training set
(xn, yn) ∈ RD × {−1, 1}, n = 1, . . . ,N,
a very simple scheme to train such a linear operator for classification is theperceptron algorithm:
1. Start with w0 = 0,
2. while ∃nk s.t. ynk(wk · xnk
)≤ 0, update wk+1 = wk + ynk xnk .
The bias b can be introduced as one of the ws by adding a constant componentto x equal to 1.
(Rosenblatt, 1957)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 8 / 43
Given a training set
(xn, yn) ∈ RD × {−1, 1}, n = 1, . . . ,N,
a very simple scheme to train such a linear operator for classification is theperceptron algorithm:
1. Start with w0 = 0,
2. while ∃nk s.t. ynk(wk · xnk
)≤ 0, update wk+1 = wk + ynk xnk .
The bias b can be introduced as one of the ws by adding a constant componentto x equal to 1.
(Rosenblatt, 1957)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 8 / 43
Given a training set
(xn, yn) ∈ RD × {−1, 1}, n = 1, . . . ,N,
a very simple scheme to train such a linear operator for classification is theperceptron algorithm:
1. Start with w0 = 0,
2. while ∃nk s.t. ynk(wk · xnk
)≤ 0, update wk+1 = wk + ynk xnk .
The bias b can be introduced as one of the ws by adding a constant componentto x equal to 1.
(Rosenblatt, 1957)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 8 / 43
def train_perceptron(x, y, nb_epochs_max):
w = Tensor(x.size (1)).zero_()
for e in range(nb_epochs_max):
nb_changes = 0
for i in range(x.size (0)):
if x[i].dot(w) * y[i] <= 0:
w = w + y[i] * x[i]
nb_changes = nb_changes + 1
if nb_changes == 0: break;
return w
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 9 / 43
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 10 / 43
This crude algorithm works often surprisingly well. With MNIST’s “0”s asnegative class, and “1”s as positive one.
epoch 0 nb_changes 64 train_error 0.23% test_error 0.19%
epoch 1 nb_changes 24 train_error 0.07% test_error 0.00%
epoch 2 nb_changes 10 train_error 0.06% test_error 0.05%
epoch 3 nb_changes 6 train_error 0.03% test_error 0.14%
epoch 4 nb_changes 5 train_error 0.03% test_error 0.09%
epoch 5 nb_changes 4 train_error 0.02% test_error 0.14%
epoch 6 nb_changes 3 train_error 0.01% test_error 0.14%
epoch 7 nb_changes 2 train_error 0.00% test_error 0.14%
epoch 8 nb_changes 0 train_error 0.00% test_error 0.14%
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 11 / 43
This crude algorithm works often surprisingly well. With MNIST’s “0”s asnegative class, and “1”s as positive one.
epoch 0 nb_changes 64 train_error 0.23% test_error 0.19%
epoch 1 nb_changes 24 train_error 0.07% test_error 0.00%
epoch 2 nb_changes 10 train_error 0.06% test_error 0.05%
epoch 3 nb_changes 6 train_error 0.03% test_error 0.14%
epoch 4 nb_changes 5 train_error 0.03% test_error 0.09%
epoch 5 nb_changes 4 train_error 0.02% test_error 0.14%
epoch 6 nb_changes 3 train_error 0.01% test_error 0.14%
epoch 7 nb_changes 2 train_error 0.00% test_error 0.14%
epoch 8 nb_changes 0 train_error 0.00% test_error 0.14%
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 11 / 43
This crude algorithm works often surprisingly well. With MNIST’s “0”s asnegative class, and “1”s as positive one.
epoch 0 nb_changes 64 train_error 0.23% test_error 0.19%
epoch 1 nb_changes 24 train_error 0.07% test_error 0.00%
epoch 2 nb_changes 10 train_error 0.06% test_error 0.05%
epoch 3 nb_changes 6 train_error 0.03% test_error 0.14%
epoch 4 nb_changes 5 train_error 0.03% test_error 0.09%
epoch 5 nb_changes 4 train_error 0.02% test_error 0.14%
epoch 6 nb_changes 3 train_error 0.01% test_error 0.14%
epoch 7 nb_changes 2 train_error 0.00% test_error 0.14%
epoch 8 nb_changes 0 train_error 0.00% test_error 0.14%
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 11 / 43
We can get a convergence result under two assumptions:
γ
w∗
R
·
1. The xn are in a sphere of radius R:∃R > 0, ∀n, ‖xn‖ ≤ R.
2. The two populations can be separated with a margin γ > 0.∃w∗, ‖w∗‖ = 1, ∃γ > 0, ∀n, yn (xn · w∗) ≥ γ/2.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 12 / 43
We can get a convergence result under two assumptions:
γ
w∗
R
·
1. The xn are in a sphere of radius R:∃R > 0, ∀n, ‖xn‖ ≤ R.
2. The two populations can be separated with a margin γ > 0.∃w∗, ‖w∗‖ = 1, ∃γ > 0, ∀n, yn (xn · w∗) ≥ γ/2.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 12 / 43
We can get a convergence result under two assumptions:
γ
w∗
R
·
1. The xn are in a sphere of radius R:∃R > 0, ∀n, ‖xn‖ ≤ R.
2. The two populations can be separated with a margin γ > 0.∃w∗, ‖w∗‖ = 1, ∃γ > 0, ∀n, yn (xn · w∗) ≥ γ/2.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 12 / 43
To prove the convergence, let us make the assumption that there still is amisclassified sample at iteration k, and wk+1 is the weight vector updated withit. We have
wk+1 · w∗ =(wk + ynk xnk
)· w∗
= wk · w∗ + ynk (xnk · w∗)
≥ wk · w∗ + γ/2
≥ (k + 1) γ/2.
Since‖wk‖‖w∗‖ ≥ wk · w∗,
we get
‖wk‖2 ≥(wk · w∗
)2/‖w∗‖2
≥ k2γ2/4.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 13 / 43
To prove the convergence, let us make the assumption that there still is amisclassified sample at iteration k, and wk+1 is the weight vector updated withit. We have
wk+1 · w∗ =(wk + ynk xnk
)· w∗
= wk · w∗ + ynk (xnk · w∗)
≥ wk · w∗ + γ/2
≥ (k + 1) γ/2.
Since‖wk‖‖w∗‖ ≥ wk · w∗,
we get
‖wk‖2 ≥(wk · w∗
)2/‖w∗‖2
≥ k2γ2/4.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 13 / 43
And
‖wk+1‖2 = wk+1 · wk+1
=(wk + ynk xnk
)·(wk + ynk xnk
)= wk · wk + 2 ynk w
k · xnk︸ ︷︷ ︸≤0
+ ‖xnk ‖2︸ ︷︷ ︸
≤R2
≤ ‖wk‖2 + R2
≤ (k + 1)R2.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 14 / 43
Putting these two results together, we get
k2γ2/4 ≤ ‖wk‖2 ≤ k R2
hencek ≤ 4R2/γ2,
hence no misclassified sample can remain after⌊4R2/γ2
⌋iterations.
This result makes sense:
• The bound does not change if the population is scaled, and
• the larger the margin, the more quickly the algorithm classifies all thesamples correctly.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 15 / 43
Putting these two results together, we get
k2γ2/4 ≤ ‖wk‖2 ≤ k R2
hencek ≤ 4R2/γ2,
hence no misclassified sample can remain after⌊4R2/γ2
⌋iterations.
This result makes sense:
• The bound does not change if the population is scaled, and
• the larger the margin, the more quickly the algorithm classifies all thesamples correctly.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 15 / 43
The perceptron stops as soon as it finds a separating boundary.
Other algorithms maximize the distance of samples to the decision boundary,which improves robustness to noise.
Support Vector Machines (SVM) achieve this by minimizing
L(w , b) = λ‖w‖2 +1
N
∑n
max(0, 1− yn(w · xn + b)),
which is convex and has a global optimum.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 16 / 43
The perceptron stops as soon as it finds a separating boundary.
Other algorithms maximize the distance of samples to the decision boundary,which improves robustness to noise.
Support Vector Machines (SVM) achieve this by minimizing
L(w , b) = λ‖w‖2 +1
N
∑n
max(0, 1− yn(w · xn + b)),
which is convex and has a global optimum.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 16 / 43
L(w , b) = λ‖w‖2 +1
N
∑n
max(0, 1− yn(w · xn + b))
2‖w‖2
Support vectors
Minimizing max(0, 1− yn(w · xn + b)) pushes the nth sample beyond the planew · x + b = yn, and minimizing ‖w‖2 increases the distance between thew · x + b = ±1.
At convergence, only a small number of samples matter, the “support vectors”.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 17 / 43
L(w , b) = λ‖w‖2 +1
N
∑n
max(0, 1− yn(w · xn + b))
2‖w‖2
Support vectors
Minimizing max(0, 1− yn(w · xn + b)) pushes the nth sample beyond the planew · x + b = yn, and minimizing ‖w‖2 increases the distance between thew · x + b = ±1.
At convergence, only a small number of samples matter, the “support vectors”.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 17 / 43
L(w , b) = λ‖w‖2 +1
N
∑n
max(0, 1− yn(w · xn + b))
2‖w‖2
Support vectors
Minimizing max(0, 1− yn(w · xn + b)) pushes the nth sample beyond the planew · x + b = yn, and minimizing ‖w‖2 increases the distance between thew · x + b = ±1.
At convergence, only a small number of samples matter, the “support vectors”.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 17 / 43
The termmax(0, 1− α)
is the so called “hinge loss”
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 18 / 43
Probabilistic interpretation of linear classifiers
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 19 / 43
The Linear Discriminant Analysis (LDA) algorithm provides a nice bridgebetween these linear classifiers and probabilistic modeling.
Consider the following class populations
∀y ∈ {0, 1}, x ∈ RD ,
µX |Y=y (x) =1√
(2π)D |Σ|exp
(−
1
2(x −my )Σ−1(x −my )T
).
That is, they are Gaussian with the same covariance matrix Σ. This is thehomoscedasticity assumption.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 20 / 43
The Linear Discriminant Analysis (LDA) algorithm provides a nice bridgebetween these linear classifiers and probabilistic modeling.
Consider the following class populations
∀y ∈ {0, 1}, x ∈ RD ,
µX |Y=y (x) =1√
(2π)D |Σ|exp
(−
1
2(x −my )Σ−1(x −my )T
).
That is, they are Gaussian with the same covariance matrix Σ. This is thehomoscedasticity assumption.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 20 / 43
We have
P(Y = 1 | X = x)
=µX |Y=1(x)P(Y = 1)
µX (x)
=µX |Y=1(x)P(Y = 1)
µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)
=1
1 +µX|Y=0(x)
µX|Y=1(x)P(Y=0)P(Y=1)
.
It follows that, with
σ(x) =1
1 + e−x,
we get
P(Y = 1 | X = x) = σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)
).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43
We have
P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)
µX (x)
=µX |Y=1(x)P(Y = 1)
µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)
=1
1 +µX|Y=0(x)
µX|Y=1(x)P(Y=0)P(Y=1)
.
It follows that, with
σ(x) =1
1 + e−x,
we get
P(Y = 1 | X = x) = σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)
).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43
We have
P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)
µX (x)
=µX |Y=1(x)P(Y = 1)
µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)
=1
1 +µX|Y=0(x)
µX|Y=1(x)P(Y=0)P(Y=1)
.
It follows that, with
σ(x) =1
1 + e−x,
we get
P(Y = 1 | X = x) = σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)
).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43
We have
P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)
µX (x)
=µX |Y=1(x)P(Y = 1)
µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)
=1
1 +µX|Y=0(x)
µX|Y=1(x)P(Y=0)P(Y=1)
.
It follows that, with
σ(x) =1
1 + e−x,
we get
P(Y = 1 | X = x) = σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)
).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43
We have
P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)
µX (x)
=µX |Y=1(x)P(Y = 1)
µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)
=1
1 +µX|Y=0(x)
µX|Y=1(x)P(Y=0)P(Y=1)
.
It follows that, with
σ(x) =1
1 + e−x,
we get
P(Y = 1 | X = x) = σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)
).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43
We have
P(Y = 1 | X = x) =µX |Y=1(x)P(Y = 1)
µX (x)
=µX |Y=1(x)P(Y = 1)
µX |Y=0(x)P(Y = 0) + µX |Y=1(x)P(Y = 1)
=1
1 +µX|Y=0(x)
µX|Y=1(x)P(Y=0)P(Y=1)
.
It follows that, with
σ(x) =1
1 + e−x,
we get
P(Y = 1 | X = x) = σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)
).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 21 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)
= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)
= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)
= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
So with our Gaussians µX |Y=y of same Σ, we get
P(Y = 1 | X = x)
= σ
(log
µX |Y=1(x)
µX |Y=0(x)+ log
P(Y = 1)
P(Y = 0)︸ ︷︷ ︸a
)
= σ(logµX |Y=1(x)− logµX |Y=0(x) + a
)= σ
(−
1
2(x −m1)Σ−1(x −m1)T +
1
2(x −m0)Σ−1(x −m0)T + a
)= σ
(−
1
2xΣ−1xT + m1Σ−1xT −
1
2m1Σ−1mT
1
+1
2xΣ−1xT −m0Σ−1xT +
1
2m0Σ−1mT
0 + a
)= σ
((m1 −m0)Σ−1︸ ︷︷ ︸
w
xT +1
2
(m0Σ−1mT
0 −m1Σ−1mT1
)+ a︸ ︷︷ ︸
b
)
= σ(w · x + b).
The homoscedasticity makes the second-order terms vanish.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 22 / 43
µX |Y=0 µX |Y=1
P(Y = 1 | X = x)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43
µX |Y=0 µX |Y=1
P(Y = 1 | X = x)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43
µX |Y=0 µX |Y=1
P(Y = 1 | X = x)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43
µX |Y=0 µX |Y=1
P(Y = 1 | X = x)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 23 / 43
Note that the (logistic) sigmoid function
σ(x) =1
1 + e−x,
looks like a “soft heavyside”
0
1
So the overall modelf (x ;w , b) = σ(w · x + b)
looks very similar to the perceptron.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 24 / 43
Note that the (logistic) sigmoid function
σ(x) =1
1 + e−x,
looks like a “soft heavyside”
0
1
So the overall modelf (x ;w , b) = σ(w · x + b)
looks very similar to the perceptron.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 24 / 43
We can use the model from LDA
f (x ;w , b) = σ(w · x + b)
but instead of modeling the densities and derive the values of w and b, directlycompute them by maximizing their probability given the training data.
First, to simplify the next slide, note that we have
1− σ(x) = 1−1
1 + e−x= σ(−x),
hence if Y takes value in {−1, 1} then
∀y ∈ {−1, 1}, P(Y = y | X = x) = σ(y(w · x + b)).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 25 / 43
We can use the model from LDA
f (x ;w , b) = σ(w · x + b)
but instead of modeling the densities and derive the values of w and b, directlycompute them by maximizing their probability given the training data.
First, to simplify the next slide, note that we have
1− σ(x) = 1−1
1 + e−x= σ(−x),
hence if Y takes value in {−1, 1} then
∀y ∈ {−1, 1}, P(Y = y | X = x) = σ(y(w · x + b)).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 25 / 43
We have
log µW ,B(w , b | D = d)
= logµD(d |W = w ,B = b)µW ,B(w , b)
µD(d)
= log µD(d |W = w ,B = b) + log µW ,B(w , b)− log Z
=∑n
log σ(yn(w · xn + b)) + log µW ,B(w , b)− log Z ′
This is the logistic regression, whose loss aims at minimizing
− log σ(ynf (xn))
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 26 / 43
We have
log µW ,B(w , b | D = d)
= logµD(d |W = w ,B = b)µW ,B(w , b)
µD(d)
= log µD(d |W = w ,B = b) + log µW ,B(w , b)− log Z
=∑n
log σ(yn(w · xn + b)) + log µW ,B(w , b)− log Z ′
This is the logistic regression, whose loss aims at minimizing
− log σ(ynf (xn))
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 26 / 43
Although the probabilistic and Bayesian formulations may be helpful in certaincontexts, the bulk of deep learning is disconnected from such modeling.
We will come back sometime to a probabilistic interpretation, but most of themethods will be envisioned from the signal-processing and optimization angles.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 27 / 43
Although the probabilistic and Bayesian formulations may be helpful in certaincontexts, the bulk of deep learning is disconnected from such modeling.
We will come back sometime to a probabilistic interpretation, but most of themethods will be envisioned from the signal-processing and optimization angles.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 27 / 43
Multi-dimensional output
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 28 / 43
We can combine multiple linear predictors into a “layer” that takes severalinputs and produces several outputs:
∀i = 1, . . . ,M, yi = σ
N∑j=1
wi,jxj + bi
where bi is the “bias” of the i-th unit, and wi,1, . . . ,wi,N are its weights.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 29 / 43
With M = 2 and N = 3, we can picture such a layer as
x1
x2
x3
×
×
×
×
×
×
w1,1
w1,2
w1,3
w2,1
w2,2
w2,3
Σ
Σ
b1
b2
σ
σ
y1
y2
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 30 / 43
If we forget the historical interpretation as “neurons”, we can use a cleareralgebraic / tensorial formulation:
y = σ (wx + b)
where x ∈ RN , w ∈ RM×N , b ∈ RM , y ∈ RM , and σ denotes a component-wiseextension of the R→ R mapping:
σ : (y1, . . . , yM) 7→ (σ(y1), . . . , σ(yM)).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 31 / 43
With “×” for the matrix-vector product, the “tensorial block figure” remainsalmost identical to that of the single neuron.
x ×
w
+
b
σ y
RN RM RM RM
RM×N RM
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 32 / 43
With “×” for the matrix-vector product, the “tensorial block figure” remainsalmost identical to that of the single neuron.
x ×
w
+
b
σ y
RN RM RM RM
RM×N RM
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 32 / 43
Limitations of linear predictors, feature design
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 33 / 43
The main weakness of linear predictors is their lack of capacity. Forclassification, the populations have to be linearly separable.
“xor”
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 34 / 43
The main weakness of linear predictors is their lack of capacity. Forclassification, the populations have to be linearly separable.
“xor”
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 34 / 43
The main weakness of linear predictors is their lack of capacity. Forclassification, the populations have to be linearly separable.
“xor”
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 34 / 43
The xor example can be solved by pre-processing the data to make the twopopulations linearly separable:
Φ : (xu , xv ) 7→ (xu , xv , xuxv ).
So we can model the xor with
f (x) = σ(w Φ(x) + b).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 35 / 43
The xor example can be solved by pre-processing the data to make the twopopulations linearly separable:
Φ : (xu , xv ) 7→ (xu , xv , xuxv ).
So we can model the xor with
f (x) = σ(w Φ(x) + b).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 35 / 43
The xor example can be solved by pre-processing the data to make the twopopulations linearly separable:
Φ : (xu , xv ) 7→ (xu , xv , xuxv ).
So we can model the xor with
f (x) = σ(w Φ(x) + b).
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 35 / 43
x Φ ×
w
+
b
σ y
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 36 / 43
This is similar to the polynomial regression. If we have
Φ : x 7→ (1, x , x2, . . . , xD)
andα = (α0, . . . , αD)
thenD∑
d=0
αdxd = α · Φ(x).
By increasing D, we can approximate any continuous real function on acompact space (Stone-Weierstrass theorem).
It means that we can make the capacity as high as we want.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 37 / 43
This is similar to the polynomial regression. If we have
Φ : x 7→ (1, x , x2, . . . , xD)
andα = (α0, . . . , αD)
thenD∑
d=0
αdxd = α · Φ(x).
By increasing D, we can approximate any continuous real function on acompact space (Stone-Weierstrass theorem).
It means that we can make the capacity as high as we want.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 37 / 43
We can apply the same to a more realistic binary classification problem:MNIST’s “8” vs. the other classes with a perceptron.
The original 28× 28 features are supplemented with the products of pairs offeatures taken at random.
0
1
2
3
4
5
6
7
103 104
Err
or (%
)
Nb. of features
Train errorTest error
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 38 / 43
We can apply the same to a more realistic binary classification problem:MNIST’s “8” vs. the other classes with a perceptron.
The original 28× 28 features are supplemented with the products of pairs offeatures taken at random.
0
1
2
3
4
5
6
7
103 104
Err
or (%
)
Nb. of features
Train errorTest error
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 38 / 43
Remember the bias-variance tradeoff.
E((Y − y)2) = (E(Y )− y)2︸ ︷︷ ︸Bias
+ V(Y )︸ ︷︷ ︸Variance
.
The right class of models reduces the bias more and increases the variance less.
Beside increasing capacity to reduce the bias, “feature design” may also be away of reducing capacity without hurting the bias, or with improving it.
In particular, good features should be invariant to perturbations of the signalknown to keep the value to predict unchanged.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 39 / 43
Remember the bias-variance tradeoff.
E((Y − y)2) = (E(Y )− y)2︸ ︷︷ ︸Bias
+ V(Y )︸ ︷︷ ︸Variance
.
The right class of models reduces the bias more and increases the variance less.
Beside increasing capacity to reduce the bias, “feature design” may also be away of reducing capacity without hurting the bias, or with improving it.
In particular, good features should be invariant to perturbations of the signalknown to keep the value to predict unchanged.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 39 / 43
Remember the bias-variance tradeoff.
E((Y − y)2) = (E(Y )− y)2︸ ︷︷ ︸Bias
+ V(Y )︸ ︷︷ ︸Variance
.
The right class of models reduces the bias more and increases the variance less.
Beside increasing capacity to reduce the bias, “feature design” may also be away of reducing capacity without hurting the bias, or with improving it.
In particular, good features should be invariant to perturbations of the signalknown to keep the value to predict unchanged.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 39 / 43
Training points
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43
Votes (K=11)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43
Prediction (K=11)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43
Training points
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43
Votes, radial feature (K=11)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43
Prediction, radial feature (K=11)
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 40 / 43
A classical example is the “Histogram of Oriented Gradient” descriptors (HOG),initially designed for person detection.
Roughly: divide the image in 8× 8 blocks, compute in each the distribution ofedge orientations over 9 bins.
Dalal and Triggs (2005) combined them with a SVM, and Dollar et al. (2009)extended them with other modalities into the “channel features”.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 41 / 43
Many methods (perceptron, SVM, k-means, PCA, etc.) only require tocompute κ(x , x ′) = Φ(x) · Φ(x ′) for any (x , x ′).
So one needs to specify κ alone, and may keep Φ undefined.
This is the kernel trick, which we will not talk about in this course.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 42 / 43
Many methods (perceptron, SVM, k-means, PCA, etc.) only require tocompute κ(x , x ′) = Φ(x) · Φ(x ′) for any (x , x ′).
So one needs to specify κ alone, and may keep Φ undefined.
This is the kernel trick, which we will not talk about in this course.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 42 / 43
Training a model composed of manually engineered features and a parametricmodel such as logistic regression is now referred to as “shallow learning”.
The signal goes through a single processing trained from data.
Francois Fleuret EE-559 – Deep learning / 3a. Linear classifiers, perceptron 43 / 43
The end
References
N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InConference on Computer Vision and Pattern Recognition (CVPR), pages 886–893, 2005.
P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In British MachineVision Conference, pages 91.1–91.11, 2009.
W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.The bulletin of mathematical biophysics, 5(4):115–133, 1943.
F. Rosenblatt. The perceptron–A perceiving and recognizing automaton. Technical Report85-460-1, Cornell Aeronautical Laboratory, 1957.