announcements homework 4 is due on this thursday (02/27/2004) project proposal is due on 03/02
Post on 20-Dec-2015
223 views
TRANSCRIPT
![Page 1: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/1.jpg)
Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02
![Page 2: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/2.jpg)
Unconstrained Optimization
Rong Jin
![Page 3: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/3.jpg)
Logistic Regression2
2 1 1
1( ) ( ) log
1 exp ( )N m
reg train train ii il D l D w s w
y b x w
The optimization problem is to find weights w and b that maximizes the above log-likelihood
How to do it efficiently ?
![Page 4: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/4.jpg)
Gradient Ascent Compute the gradient
Increase weights w and threshold b in the gradient direction
21 1
21 1
log ( | )
log ( | )
where is learning rate.
N mi i ii i
N mi i ii i
w w p y x s ww
c c p y x s wc
21 1 1
21 1 1
log ( | ) (1 ( | ))
log ( | ) (1 ( | ))
N m Ni i i i i i ii i i
N m Ni i i i i ii i i
p y x s w sw x y p y xw
p y x s w y p y xb
![Page 5: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/5.jpg)
Problem with Gradient Ascent Difficult to find the appropriate step
size Small slow convergence Large oscillation or “bubbling”
Convergence conditions Robbins-Monroe conditions
Along with “regular” objective function will ensure convergence
20 0
, t tt t
![Page 6: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/6.jpg)
Newton Method Utilizing the second order derivative Expand the objective function to the second order around x0
The minimum point is Newton method for optimization
Guarantee to converge when the objective function is convex
0 0
20 0 0( ) ( ) ( ) ( )
2'( ) , ''( )
x x x x
bf x f x a x x x x
a f x b f x
0 /x x a b
'( )
''( )
old
old
new old x x
x x
f xx x
f x
![Page 7: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/7.jpg)
Multivariate Newton Method Object function comprises of multiple variables
Example: logistic regression model
Text categorization: thousands of words thousands of variables Multivariate Newton Method
Multivariate function:
First order derivative a vector
Second order derivative Hessian matrix Hessian matrix is mxm matrix Each element in Hessian matrix is defined as:
21 1
1( ) log
1 exp ( )N m
reg train ii il D s w
y b x w
21 2
,( , ,..., )m
i ji j
f x x x
x x
H
1 2( ) ( , ,..., )mf x f x x x
1 1, ,...,
m
f f f f
x x x x
![Page 8: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/8.jpg)
Multivariate Newton Method Updating equation:
Hessian matrix for logistic regression model
Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000 100 million entries Even worse, we have compute the inverse of Hessian matrix H-1
1 ( )new old f xx x
x
H
1( | )(1 ( | ))
n Ti i i i m mi
p y x p y x x x s H I
![Page 9: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/9.jpg)
Quasi-Newton Method Approximate the Hessian matrix H with another B matrix:
B is update iteratively (BFGS):
Utilizing derivatives of previous iterations
1 ( )new old f xx x
x
B
1
1 1,
T Tk k k k k k
k k T Tk k k k k
k k k k k k
p p y y
p B p y p
p x x y g g
B BB B
![Page 10: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/10.jpg)
Limited-Memory Quasi-Newton Quasi-Newton
Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix large storage
Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix
B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)
1
1 1,
T Tk k k k k k
k k T Tk k k k k
k k k k k k
p p y y
p B p y p
p x x y g g
B BB B
{ , }k kp y
{ , }k kp y
![Page 11: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/11.jpg)
Linear Conjugate Gradient Method Consider optimizing the quadratic function
Conjugate vectors The set of vector {p1, p2, …, pl} is said to be conjugate with respect to
a matrix A if
Important property The quadratic function can be optimized by simply optimizing the
function along individual direction in the conjugate set. Optimal solution:
k is the minimizer along the kth conjugate direction
* arg min2
TT
x
x xx b x
A
0, for any Ti jp p i j A
1 1 2 2 ... l lx p p p
![Page 12: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/12.jpg)
Example Minimize the following function
Matrix A
Conjugate direction
Optimization First direction, x1 = x2=x:
Second direction, x1 =- x2=x:
Solution: x1 = x2=1
-4
-2
0
2
4
-3
-2
-1
0
1
2
3-10
0
10
20
302 21 2 1 2 1 2 1 2( , )f x x x x x x x x
1 0.5
0.5 1A
1 21 1
,1 1
p p
21 2 1( , ) 2 Minimizer 1f x x x x
21 2 2( , ) 2 Minimizer 0f x x x
![Page 13: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/13.jpg)
How to Efficiently Find a Set of Conjugate Directions Iterative procedure
Given conjugate directions {p1,p2,…, pk-1}
Set pk as follows:
Theorem: The direction generated in the above step is conjugate to all previous directions {p1,p2,…, pk-1}, i.e.,
Note: compute the k direction pk only requires the previous direction pk-1
11
1 1
( ), , where
k
Tk k
k k k k k kTx xk k
r Ap f xp r p r
xp Ap
, for any [1, 2,..., 1]Tk ip p i k A
![Page 14: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/14.jpg)
Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a
quadratic objective function, it can be applied directly to other nonlinear functions
Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)
More robust than FR-CG
Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix
![Page 15: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/15.jpg)
Generalizing Decision Trees
+ +
a decision tree with simple data partition
+
a decision tree using classifiers for data partition
+
Each node is a linear classifier
Attribute 1
Attribute 2
classifier
![Page 16: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/16.jpg)
Generalized Decision Trees Each node is a linear classifier Pro:
Usually result in shallow trees Introducing nonlinearity into linear classifiers (e.g. logistic
regression) Overcoming overfitting issues through the regularization mechanism
within the classifier. Better way to deal with real-value attributes
Example: Neural network Hierarchical Mixture Expert Model
![Page 17: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/17.jpg)
Example
Kernel method
x=0
x=0
x=0
Generalized Tree
+ +
![Page 18: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/18.jpg)
Hierarchical Mixture Expert Model (HME)
Group 1
g1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
Group 2
g2(x)
m1,2(x) m2,1(x) m2,2(x)
X
y
• Ask r(x): which group should be used for classifying input x ?
• If group 1 is chosen, which classifier m(x) should be used ?
• Classify input x using the chosen classifier m(x)
![Page 19: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/19.jpg)
Hierarchical Mixture Expert Model (HME)Probabilistic Description
Group 1
g1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
Group 2
g2(x)
m1,2(x) m2,1(x) m2,2(x)
X
y
1 11
1 12
2 21
2 22
( | ) ( , , | )
( 1 | ) ( | )( 1| )
( 1| ) ( | )
( 1| ) ( | )( 1| )
( 1| ) ( | )
g mp y x p y g m x
g x m y xr x
g x m y x
g x m y xr x
g x m y x
Two hidden variables
The hidden variable for groups:g = {1, 2}
The hidden variable for classifiers:m = {11, 12, 21, 22}
![Page 20: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/20.jpg)
Hierarchical Mixture Expert Model (HME)Example
Group 1
g1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
Group 2
g2(x)
m1,2(x) m2,1(x) m2,2(x)
X
y
r(+1|x) = ¾, r(-1|x) = ¼
g1(+1|x) = ¼, g1(-1|x) = ¾
g2(+1|x) = ½ , g2(-1|x) = ½
+1 -1
m1,1(x) ¼ ¾
m1,2(x) ¾ ¼
m2,1(x) ¼ ¾
m2,2(x) ¾ ¼
¾ ¼
¼ ¾ ½ ½
p(+1|x) = ?, p(-1|x) = ?
![Page 21: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/21.jpg)
Training HME In the training examples {xi, yi}
No information about r(x), g(x) for each example Random variables g, m are called hidden variables since
they are not exposed in the training data. How to train a model with hidden variable?
![Page 22: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/22.jpg)
Start with Random Guess …
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
Randomly Assignment
• Randomly assign points to each group and expert
• Learn classifiers r(x), g(x), m(x) using the randomly assigned points
{1,2,} {6,7} {3,4,5} {8,9}
{1}{6} {2}{7} {3}{9} {5,4}{8}
![Page 23: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/23.jpg)
Adjust Group Memeberships
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
• The key is to assign each data point to the group who classifies the data point correctly with the largest probability
• How ?{1,2} {6,7} {3,4,5} {8,9}
{1}{6} {2}{7} {3}{9} {5,4}{8}
![Page 24: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/24.jpg)
Adjust Group Memberships
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
• The key is to assign each data point to the group who classifies the data point correctly with the largest confidence
• Compute p(g=1|x,y) and p(g=2|x,y){1,2} {6,7} {3,4,5} {8,9}
{1}{6} {2}{7} {3}{9} {5,4}{8}
Posterior Prob. For Groups
Group 1 Group 2
1 0.8 0.2
2 0.4 0.6
3 0.3 0.7
4 0.1 0.9
5 0.65 0.35
![Page 25: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/25.jpg)
Adjust Memberships for Classifiers
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
{1,5} {6,7} {2,3,4} {8,9}
• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence
• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)
![Page 26: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/26.jpg)
Adjust Memberships for Classifiers
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
{1,5} {6,7} {2,3,4} {8,9}
Posterior Prob. For Classifiers
1 2 3 4 5
m1,10.7 0.1 0.15 0.1 0.05
m1,20.2 0.2 0.20 0.1 0.55
m2,10.05 0.5 0.60 0.1 0.3
m2,10.05 0.2 0.05 0.7 0.1
• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence
• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)
![Page 27: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/27.jpg)
Adjust Memberships for Classifiers
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
{1,5} {6,7} {2,3,4} {8,9}
Posterior Prob. For Classifiers
1 2 3 4 5
m1,10.7 0.1 0.15 0.1 0.05
m1,20.2 0.2 0.20 0.1 0.55
m2,10.05 0.5 0.60 0.1 0.3
m2,10.05 0.2 0.05 0.7 0.1
• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence
• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)
![Page 28: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/28.jpg)
Adjust Memberships for Classifiers
Posterior Prob. For Classifiers
1 2 3 4 5
m1,10.7 0.1 0.15 0.1 0.05
m1,20.2 0.2 0.20 0.1 0.55
m2,10.05 0.5 0.60 0.1 0.3
m2,10.05 0.2 0.05 0.7 0.1
• The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence
• Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
{1,5} {6,7} {2,3,4} {8,9}
{1}{6} {5}{7} {2,3}{9} {4}{8}
![Page 29: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/29.jpg)
Retrain The Model
• Retrain r(x), g(x), m(x) using the new memberships
xg1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
g2(x)
m1,2(x) m2,1(x) m2,2(x)
+: {1, 2, 3, 4, 5}
: {6, 7, 8, 9}
{1,5} {6,7} {2,3,4} {8,9}
{1}{6} {5}{7} {2,3}{9} {4}{8}
![Page 30: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/30.jpg)
Expectation Maximization Two things need to estimate
Logistic regression models for r(x;r), g(x; g) and m(x;m)
Unknown group memberships and expert memberships
p(g=1,2|x), p(m=11,12|x,g=1), p(m=21,22|x,g=2)
E-step
1. Estimate p(g=1|x, y), p(g=2|x, y) for training examples, given guessed r(x;r), g(x;g) and m(x;m)
2. Estimate p(m=11, 12|x, y) and p(m=21, 22|x, y) for all training examples, given guessed r(x;r), g(x;g) and m(x;m)
M-step
1. Train r(x;r) using weighted examples: for each x, p(g=1|x) fraction as a positive example, and p(g=2|x) fraction as a negative example
2. Train g1(x; g) using weighted examples: for each x, p(g=1|x)p(m=11|x,g=1) fraction as a positive example and p(g=1|x)p(m=12|x,g=1) fraction as a negative example. Training g2(x; g) similarly
3. Train m(x;m) with appropriately weighted examples
![Page 31: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/31.jpg)
Comparison of Different Classification Models The goal of all classifiers
Predicating class label y for an input x Estimate p(y|x)
Gaussian generative model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior p(x|y)
Describing the input patterns for each class y Difficult to estimate if x is of high dimensionality Naïve Bayes: p(x|y) ~ p(x1|y) p(x2|y)… p(xm|y)
Essentially a linear model Linear discriminative model
Directly estimate p(y|x) Focusing on finding the decision boundary
![Page 32: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/32.jpg)
Comparison of Different Classification Models Logistic regression model
A linear decision boundary: wx+b
A probabilistic model p(y|x)
Maximum likelihood approach for estimating weights w and threshold b
0 positive
0 negative
w x b
w x b
1( 1| )
1 exp( ( ))p y x
y w x b
( ) ( )
1 1
( ) ( )
1 1
( ) log ( | ) log ( | )
1 1log log
1 exp 1 exp
N Ntrain i ii i
N N
i ii i
l D p x p x
w x b w x b
1w x b
![Page 33: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/33.jpg)
Comparison of Different Classification Models Logistic regression model
Overfitting issue Example: text classification
Every word is assigned with a different weight Words that appears in only one document will be assigned with infinite
large weight Solution: regularization
( ) ( )
21 1
( ) ( ) 21 1 1
( ) log ( | ) log ( | )
1 1log log
1 exp 1 exp
N Ntrain i ii i
N N mji i j
i i
l D p x p x s w
s ww x b w x b
Regularization term
![Page 34: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/34.jpg)
Comparison of Different Classification Models Conditional exponential model
An extension of logistic regression model to multiple class case
A different set of weights wy and threshold b for each class y
Maximum entropy model Finding the simplest model that matches with the data
1( | ; ) exp( )
( )
( ) exp( )
y y
y yy
p y x c x wZ x
Z x c x w
1 1( | )
1 1
max ( | ) max ( | ) log ( | )
subject to
( | ) ( , ), ( | )=1
i
N Ni i i ii i yp y x p
N Ni i i i ii i y
H y x p y x p y x
p y x x x y y p y x
Maximize Entropy Prefer uniform distribution
Constraints Enforce the model to be consistent
with observed data
![Page 35: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/35.jpg)
Classification Margin
Comparison of Different Classification Models
Support vector machine Classification margin Maximum margin principle:
Separate data far away from the decision boundary
Two objectives Minimize the classification
error over training data Maximize the classification
margin Support vector
Only support vectors have impact on the location of decision boundary
denotes +1
denotes -1
Support Vectors
0w x b
![Page 36: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/36.jpg)
Comparison of Different Classification Models Separable case
Noisy case
* * 21
,
1 1
2 2
{ , }= argmin
subject to
1
1
....
1
mii
w b
N N
w b w
y w x b
y w x b
y w x b
* * 21 1
,
1 1 1 1
2 2 2 2
{ , }= argmin
subject to
1 , 0
1 , 0
....
1 , 0
m Ni ji j
w b
N N N N
w b w c
y w x b
y w x b
y w x b
Quadratic programming!
![Page 37: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/37.jpg)
Comparison of Classification Models Logistic regression model vs. support vector machine
21 1
,
1{ , }* arg max log
1 exp ( )
N mji j
w b i
w b s wy w x b
* * 21 1
,
1 1 1 1
{ , }= argmin
subject to
1 , 0
....
1 , 0
N mi ji j
w b
N N N N
w b c w
y w x b
y w x b
Log-likelihood can be viewed as a measurement
of accuracy
Identical terms
![Page 38: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/38.jpg)
Comparison of Different Classification Models
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 10
0.5
1
1.5
2
2.5
3
3.5
wx+b
Loss
Loss function for logistic regressionLoss function for SVM
Logistic regression differs from support vector machine only in the loss function
![Page 39: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/39.jpg)
0.5 1 1.5 2 2.5 3 3.5 4 4.50
10
20
30
40
50
60
X
Cou
nt
fitting curve for positive datafitting curve for negative datahistogram for negative datahistogram for positive data
Comparison of Different Classification Models
• Generative models have trouble at the decision boundary
• Classification boundary that achieves the least training error
• Classification boundary that achieves large margin
![Page 40: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/40.jpg)
Nonlinear Models Kernel methods
Add additional dimensions to help separate data
Efficiently computing the dot product in a high dimension space
Kernel method
x=0
x=0
( ) ( ) ( , )w x K w x Φ Φ
( )x x Φ
![Page 41: Announcements Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02](https://reader036.vdocuments.site/reader036/viewer/2022062407/56649d435503460f94a1e61f/html5/thumbnails/41.jpg)
Nonlinear Models
Decision trees Nonlinearly combine different
features through a tree structure
Hierarchical Mixture Model Replace each node with a
logistic regression model Nonlinearly combine multiple
linear models
+ + +
Group 1
g1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
Group 2
g2(x)
m1,2(x) m2,1(x) m2,2(x)