more classifiers
DESCRIPTION
More Classifiers. Agenda. Key concepts for all classifiers Precision vs recall Biased sample sets Linear classifiers Intro to neural networks. Recap: Decision Boundaries. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/1.jpg)
MORE CLASSIFIERS
![Page 2: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/2.jpg)
AGENDA Key concepts for all classifiers
Precision vs recall Biased sample sets
Linear classifiers Intro to neural networks
![Page 3: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/3.jpg)
RECAP: DECISION BOUNDARIES With continuous attributes, a decision
boundary is the surface in example space that splits positive from negative examples
x1>=20 x2
x1F
x2>=10
T
F
F
T
x2>=15
T F
T
![Page 4: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/4.jpg)
4
BEYOND ERROR RATES
![Page 5: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/5.jpg)
BEYOND ERROR RATE Predicting security risk
Predicting “low risk” for a terrorist, is far worse than predicting “high risk” for an innocent bystander (but maybe not 5 million of them)
Searching for images Returning irrelevant images is
worse than omitting relevant ones
5
![Page 6: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/6.jpg)
BIASED SAMPLE SETS Often there are orders of magnitude more
negative examples than positive E.g., all images of Kris on Facebook If I classify all images as “not Kris” I’ll have
>99.99% accuracy
Examples of Kris should count much more than non-Kris!
![Page 7: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/7.jpg)
FALSE POSITIVES
7x1
x2
True decision boundary Learned decision boundary
![Page 8: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/8.jpg)
FALSE POSITIVES
8x1
x2
New query
An example incorrectly predicted
to be positive
True decision boundary Learned decision boundary
![Page 9: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/9.jpg)
FALSE NEGATIVES
9x1
x2
New query
An example incorrectly predicted
to be negative
True decision boundary Learned decision boundary
![Page 10: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/10.jpg)
PRECISION VS. RECALL Precision
# of relevant documents retrieved / # of total documents retrieved
Recall # of relevant documents retrieved / # of total
relevant documents Numbers between 0 and 1
10
![Page 11: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/11.jpg)
PRECISION VS. RECALL Precision
# of true positives / (# true positives + # false positives)
Recall # of true positives / (# true positives + # false
negatives) A precise classifier is selective A classifier with high recall is inclusive
11
![Page 12: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/12.jpg)
REDUCING FALSE POSITIVE RATE
12x1
x2
True decision boundary Learned decision boundary
![Page 13: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/13.jpg)
REDUCING FALSE NEGATIVE RATE
13x1
x2
True decision boundary Learned decision boundary
![Page 14: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/14.jpg)
PRECISION-RECALL CURVES
14
Precision
Recall
Measure Precision vs Recall as the decision boundary is tuned
Perfect classifier
Actual performance
![Page 15: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/15.jpg)
PRECISION-RECALL CURVES
15
Precision
Recall
Measure Precision vs Recall as the decision boundary is tuned
Penalize false negatives
Penalize false positives
Equal weight
![Page 16: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/16.jpg)
PRECISION-RECALL CURVES
16
Precision
Recall
Measure Precision vs Recall as the decision boundary is tuned
![Page 17: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/17.jpg)
PRECISION-RECALL CURVES
17
Precision
Recall
Measure Precision vs Recall as the decision boundary is tuned
Better learningperformance
![Page 18: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/18.jpg)
OPTION 1: CLASSIFICATION THRESHOLDS Many learning algorithms (e.g., probabilistic
models, linear models) give real-valued output v(x) that needs thresholding for classification
v(x) > t => positive label given to xv(x) < t => negative label given to x
May want to tune threshold to get fewer false positives or false negatives
18
![Page 19: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/19.jpg)
OPTION 2: WEIGHTED DATASETS Weighted datasets: attach a weight w to
each example to indicate how important it is Instead of counting “# of errors”, count “sum of
weights of errors” Or construct a resampled dataset D’ where each
example is duplicated proportionally to its w As the relative weights of positive vs
negative examples is tuned from 0 to 1, the precision-recall curve is traced out
![Page 20: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/20.jpg)
LINEAR CLASSIFIERS : MOTIVATION Decision tree produces axis-aligned decision
boundaries Can we accurately classify data like this?
x2
x1
![Page 21: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/21.jpg)
PLANE GEOMETRY Any line in 2D can be expressed as the set of
solutions (x,y) to the equation ax+by+c=0 (an implicit surface) ax+by+c > 0 is one side of the line ax+by+c < 0 is the other ax+by+c = 0 is the line itself
y
x
b
a
![Page 22: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/22.jpg)
PLANE GEOMETRY In 3D, a plane can be expressed as the set of
solutions (x,y,z) to the equation ax+by+cz+d=0 ax+by+cz+d > 0 is one side of the plane ax+by+cz+d < 0 is the other side ax+by+cz+d = 0 is the plane itself
a b
c
z
x
y
![Page 23: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/23.jpg)
LINEAR CLASSIFIER In d dimensions,
c0+c1*x1+…+cd*xd =0 is a hyperplane. Idea:
Use c0+c1*x1+…+cd*xd > 0 to denote positive classifications
Use c0+c1*x1+…+cd*xd < 0 to denote negative classifications
![Page 24: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/24.jpg)
24
PERCEPTRON
S gxi
x1
xn
ywi
y = f(x,w) = g(Si=1,…,n wi xi)
+ +
+
++ -
-
--
-x1
x2
w1 x1 + w2 x2 = 0
g(u)
u
![Page 25: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/25.jpg)
25
A SINGLE PERCEPTRON CAN LEARN
S gxi
x1
xn
ywi
A disjunction of boolean literals x1 x2 x3
Majority function
![Page 26: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/26.jpg)
26
A SINGLE PERCEPTRON CAN LEARN
S gxi
x1
xn
ywi
A disjunction of boolean literals x1 x2 x3
Majority functionXOR?
![Page 27: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/27.jpg)
27
PERCEPTRON LEARNING RULE θ θ + x(i)(y(i)-g(θT x(i))) (g outputs either 0 or 1, y is either 0 or 1)
If output is correct, weights are unchanged If g is 0 but y is 1, then the value of g on
attribute i is increased If g is 1 but y is 0, then the value of g on
attribute i is decreased
Converges if data is linearly separable, but oscillates otherwise
![Page 28: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/28.jpg)
28
PERCEPTRON
S gxi
x1
xn
ywi
+ +
+ +
+ -
-
- -
-
?
y = f(x,w) = g(Si=1,…,n wi xi)
g(u)
u
![Page 29: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/29.jpg)
29
UNIT (NEURON)
S gxi
x1
xn
ywi
y = g(Si=1,…,n wi xi)g(u) = 1/[1 + exp(-u)]
![Page 30: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/30.jpg)
30
NEURAL NETWORK Network of interconnected neurons
S gxi
x1
xn
ywi
S gxi
x1
xn
ywi
Acyclic (feed-forward) vs. recurrent networks
![Page 31: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/31.jpg)
31
TWO-LAYER FEED-FORWARD NEURAL NETWORK
Inputs Hiddenlayer
Outputlayer
w1j w2k
![Page 32: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/32.jpg)
32
NETWORKS WITH HIDDEN LAYERS Can represent XORs, other nonlinear
functions Common neuron types:
Soft perceptron (sigmoid), radial basis functions, linear, …
As the number of hidden units increase, so does the network’s capacity to learn functions with more nonlinear features
How to train hidden layers?
![Page 33: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/33.jpg)
33
BACKPROPAGATION (PRINCIPLE) Treat the problem as one of minimizing
errors between the example label and the network output, given the example and network weights as input Error(xi,yi,w) = (yi – f(xi,w))2
Sum this error term over all examples E(w) = Si Error(xi,yi,w) = Si (yi – f(xi,w))2
Minimize errors using an optimization algorithm Stochastic gradient descent is typically used
![Page 34: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/34.jpg)
Gradient direction is orthogonal to the level sets (contours) of E,points in direction of steepest increase
![Page 35: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/35.jpg)
Gradient direction is orthogonal to the level sets (contours) of E,points in direction of steepest increase
![Page 36: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/36.jpg)
Gradient descent: iteratively move in direction
![Page 37: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/37.jpg)
Gradient descent: iteratively move in direction E
![Page 38: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/38.jpg)
Gradient descent: iteratively move in direction E
![Page 39: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/39.jpg)
Gradient descent: iteratively move in direction E
![Page 40: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/40.jpg)
Gradient descent: iteratively move in direction E
![Page 41: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/41.jpg)
Gradient descent: iteratively move in direction
![Page 42: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/42.jpg)
Gradient descent: iteratively move in direction
![Page 43: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/43.jpg)
STOCHASTIC GRADIENT DESCENT For each example (xi,yi), take a gradient
descent step to reduce the error for (xi,yi) only.
43
![Page 44: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/44.jpg)
STOCHASTIC GRADIENT DESCENT Objective function values (measured over all
examples) over time settle into local minimum
Step size must be reduced over time, e.g., O(1/t)
44
![Page 45: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/45.jpg)
NEURAL NETWORKS: PROS AND CONS Pros
Bioinspiration is nifty Can represent a wide variety of decision boundaries Complexity is easily tunable (number of hidden
nodes, topology) Easily extendable to regression tasks
Cons Haven’t gotten close to unlocking the power of the
human (or cat) brain Complex boundaries need lots of data Slow training Mostly lukewarm feelings in mainstream ML
(although the “deep learning” variant is en vogue now)
![Page 46: More Classifiers](https://reader036.vdocuments.site/reader036/viewer/2022062411/56816736550346895ddbe640/html5/thumbnails/46.jpg)
NEXT CLASS Another guest lecture