perceptrons and optimal hyperplanes · example: majority-vote function •definition: majority-vote...
TRANSCRIPT
![Page 1: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/1.jpg)
Perceptrons and Optimal Hyperplanes
![Page 2: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/2.jpg)
Example: Majority-Vote Function
• Definition: Majority-Vote Function fmajority
– N binary attributes, i.e. x {0,1}N
– If more than N/2 attributes in x are true, then fmajority(x)=1, else fmajority(x)=-1.
• How can we represent this function as a decision tree?
– Huge and awkward tree!
• Is there an “easier” representation of fmajority?
![Page 3: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/3.jpg)
Example: Spam Filtering
• Instance Space X: – Feature vector of word occurrences => binary features – N features (N typically > 50000)
• Target Concept c: – Spam (+1) / Ham (-1)
• Type of function to learn: – Set of Spam words S, Set of Ham words H – Classify as Spam (+1), if more Spam words than Ham
words in example.
viagra learning the dating lottery spam?
![Page 4: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/4.jpg)
Example: Spam Filtering
• Use weight vector w=(+1, -1, 0, +1, +1) – Compute sign(wx)
• More generally, we can use real valued weights to express “spamminess” of word. • w=(+10,-1,-0.3,+1,+5)
• Which vector is most likely to be spam with this weighting? A=x1, B=x2, C=x3
viagra learning the dating lottery spam?
![Page 5: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/5.jpg)
Linear Classification Rules • Hypotheses of the form
– unbiased:
– biased: – Parameter vector w, scalar b
• Hypothesis space H – –
• Notation –
–
–
![Page 6: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/6.jpg)
Geometry of Hyperplane Classifiers
• Linear Classifiers divide instance space as hyperplane • One side positive, other side negative
![Page 7: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/7.jpg)
Homogeneous Coordinates
X = (x1, x2) W = (w1, w2, b)
X = (x1, x2, 1) W = (w1, w2, w3)
![Page 8: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/8.jpg)
1
0
![Page 9: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/9.jpg)
(Batch) Perceptron Algorithm
Training Epoch
![Page 10: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/10.jpg)
Example: Perceptron
Training Data: Updates to weight vector:
3
![Page 11: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/11.jpg)
•Init: w=0, =1
•(w0 x1) = 0 incorrect
w1 = w0 + y1 x1 = 0+ 1*1*(1,2) = (1,2)
hw1x1 = (w0+1*1*x1) * x1 = hw0(x1)+ 1 * 1 * (x1*x1) = 0 + 5
•(w1x2) = (1,2) (3,1) = 5 correct
•(w1 x3) = (1,2) (-1,-1) = -3 correct
•(w1 x4) = (1,2) (-1,1) = 1 incorrect
•w2 = (1,2) + y4 x4 = (1,2) - (-1,1) = (2,1)
hw2 x4 = (w1+1*-1*x4) * x4 = hw1(x4) + 1 * -1 * (x4 * x4) = -1
![Page 12: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/12.jpg)
Example: Reuters Text Classification
“optimal hyperplane”
![Page 13: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/13.jpg)
Optimal Hyperplanes Assumption: Training examples are linearly separable.
![Page 14: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/14.jpg)
Hard-Margin Separation
Goal: Find hyperplane with the largest distance to the closest training examples.
Support Vectors: Examples with minimal distance (i.e. margin).
Optimization Problem (Primal):
d d
d
![Page 15: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/15.jpg)
Why min ½w·w?
• Maximizing δ and constraining w is equivalent to constraining δ and minimizing w
– We want maximum margin δ,
• we don’t care about w
• But because δ=wx, just requiring maximum δ will yield large w…
– So we ask for maximum δ but constrain w
• This is equivalent to constraining δ and minimizing w
![Page 16: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/16.jpg)
Non-Separable Training Data
Limitations of hard-margin formulation
– For some training data, there is no separating hyperplane.
– Complete separation (i.e. zero training error) can lead to suboptimal prediction error.
![Page 17: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/17.jpg)
Slack
![Page 18: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/18.jpg)
Soft-Margin Separation Idea: Maximize margin and minimize training
error. Soft-Margin OP (Primal): Hard-Margin OP (Primal):
• Slack variable ξi measures by how much (xi,yi) fails to achieve margin δ
• Σξi is upper bound on number of training errors
• C is a parameter that controls trade-off between margin and training error.
![Page 19: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/19.jpg)
Soft-Margin OP (Primal):
A B
Which of these two classifiers was produced using a larger value of C?
![Page 20: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/20.jpg)
Controlling Soft-Margin Separation
•Σξi is upper bound on number of training errors
•C is a parameter that controls trade-off between margin and training error.
Soft-Margin OP (Primal):
![Page 21: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/21.jpg)
Example Reuters “acq”: Varying C
![Page 22: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/22.jpg)
Example: Margin in High-Dimension x1 x2 x3 x4 x5 x6 x7 y
Example 1 1 0 0 1 0 0 0 1
Example 2 1 0 0 0 1 0 0 1
Example 3 0 1 0 0 0 1 0 -1
Example 4 0 1 0 0 0 0 1 -1
w1 w2 w3 w4 w5 w6 w7 b
Hyperplane 1 1 1 0 0 0 0 0 2
Hyperplane 2 0 0 0 1 1 -1 -1 0
Hyperplane 3 1 -1 1 0 0 0 0 0
Hyperplane 4 1 -1 0 0 0 0 0 0
Hyperplane 5 0.95 -0.95 0 0.05 0.05 -0.05 -0.05 0
![Page 23: Perceptrons and Optimal Hyperplanes · Example: Majority-Vote Function •Definition: Majority-Vote Function f majority –N binary attributes, i.e. x {0,1}N –If more than N/2 attributes](https://reader030.vdocuments.site/reader030/viewer/2022032614/5b5d66547f8b9ad21d8e184c/html5/thumbnails/23.jpg)