ces 514 – data mining lecture 8 classification (contd…)
Post on 20-Dec-2015
227 views
TRANSCRIPT
CES 514 – Data Mining Lecture 8
classification (contd…)
Example: PEBLS
PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)– Works with both continuous and nominal features For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)
– Each record is assigned a weight factor
– Number of nearest neighbor, k = 1
Example: PEBLS
Class
Marital Status
Single Married
Divorced
Yes 2 0 1
No 2 4 1
i
ii
n
n
n
nVVd
2
2
1
121 ),(
Distance between nominal attribute values:
d(Single,Married)
= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
d(Single,Divorced)
= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
d(Married,Divorced)
= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
d(Refund=Yes,Refund=No)
= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Class
Refund
Yes No
Yes 0 3
No 3 4
Example: PEBLS
d
iiiYX YXdwwYX
1
2),(),(
Tid Refund Marital Status
Taxable Income Cheat
X Yes Single 125K No
Y No Married 100K No 10
Distance between record X and record Y:
where:
correctly predicts X timesofNumber
predictionfor used is X timesofNumber Xw
wX 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
One Possible Solution
B1
Support Vector Machines
Another possible solution
B2
Support Vector Machines
Other possible solutions
B2
Support Vector Machines
Which one is better? B1 or B2? How do you define better?
B1
B2
Support Vector Machines
Find hyperplane maximizes the margin (e.g. B1 is better than B2.)
B1
B2
b11
b12
b21
b22
margin
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf 2||||
2 Margin
w
Support Vector Machines
We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
2||||
2 Margin
w
1bxw if1
1bxw if1)(
i
i
ixf
2
||||)(
2wwL
Overview of optimization
Simplest optimization problem: Maximize f(x) (one variable)
If the function has nice properties (such as differentiable), then we can use calculus to solve the problem. solve equation f’(x) = 0. Suppose a root is a. Then if f’’(a) < 0 then a is a maximum.
Tricky issues: • How to solve the equation f’(x) = 0?• what if there are many solutions? Each is a “local” optimum.
How to solve g(x) = 0
Even polynomial equations are very hard to solve.
Quadratic has a closed-form. What about higher-degrees?
Numerical techniques: (iteration)• bisection• secant• Newton-Raphson etc.
Challenges:• initial guess• rate of convergence?
Functions of several variables
Consider equation such as F(x,y) = 0
To find the maximum of F(x,y), we solve the equations
and 0xF 0
yF
If we can solve this system of equations, then we have found a local maximum or minimum of F.
We can solve the equations using numerical techniques similar to the one-dimensional case.
When is the solution maximum or minimum?
• Hessian:
• if the Hessian is positive definite in the neighborhood of a, then a is a minimum.
• if the Hessian is negative definite in the neighborhood of a, then a is a maximum.
• if it is neither, then a is a saddle point.
2
22
2
2
),,(
yf
xyf
yxf
xf
yxfH
Application - linear regression
Problem: given (x1,y1), … (xn, yn), find the best linear relation between x and y.
Assume y = Ax + B. To find A and B, we will minimize
Since this is a function of two variables, we can solve by setting and
2
1
)(),( BAxyBAE j
n
jj
0BE
0AE
Constrained optimization
Maximize f(x,y) subject to g(x,y) = c
Using Lagrange multiplier, the problem is formulated as maximizing:
h(x,y) = f(x,y) + (g(x,y) – c)
Now, solve the equations:
0
h
y
h
x
h
Support Vector Machines (contd)
What if the problem is not linearly separable?
Support Vector Machines
What if the problem is not linearly separable?– Introduce slack variables
Need to minimize:
Subject to:
ii
ii
1bxw if1
-1bxw if1)(
ixf
N
i
kiC
wwL
1
2
2
||||)(
Nonlinear Support Vector Machines
What if decision boundary is not linear?
Nonlinear Support Vector Machines
Transform data into higher dimensional space
Artificial Neural Networks (ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal to 1.
Artificial Neural Networks (ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Outputnode
Inputnodes
otherwise0
trueis if1)( where
)04.03.03.03.0( 321
zzI
XXXIY
Artificial Neural Networks (ANN)
Model is an assembly of inter-connected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
X1
X2
X3
Y
Black box
w1
t
Outputnode
Inputnodes
w2
w3
)( tXwIYi
ii Perceptron Model
)( tXwsignYi
ii
or
General Structure of ANN
Activationfunction
g(Si )Si Oi
I1
I2
I3
wi1
wi2
wi3
Oi
Neuron iInput Output
threshold, t
InputLayer
HiddenLayer
OutputLayer
x1 x2 x3 x4 x5
y
Training ANN means learning the weights of the neurons
Algorithm for learning ANN
Initialize the weights (w0, w1, …, wk)
Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples– Objective function:
– Find the weights wi’s that minimize the above objective function e.g., backpropagation algorithm
2),( i
iii XwfYE
WEKA
WEKA implementations
WEKA has implementation of all the major data mining algorithms including:
• decision trees (CART, C4.5 etc.)• naïve Bayes algorithm and all variants• nearest neighbor classifier• linear classifier• Support Vector Machine • clustering algorithms• boosting algorithms etc.