business systems intelligence: 5. classification 2 dr. brian mac namee (
Post on 20-Dec-2015
217 views
TRANSCRIPT
![Page 1: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/1.jpg)
Business Systems Intelligence:
5. Classification 2
Dr. B
rian Mac N
amee (w
ww
.comp.dit.ie/bm
acnamee)
![Page 2: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/2.jpg)
2of25
2of49 Acknowledgments
These notes are based (heavily) on those provided by the authors to
accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber
Some slides are also based on trainer’s kits provided by
More information about the book is available at:www-sal.cs.uiuc.edu/~hanj/bk2/
And information on SAS is available at:www.sas.com
![Page 3: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/3.jpg)
3of25
3of49 Classification & PredictionToday we will look at:
– What are classification & prediction?– Issues regarding classification and prediction– Classification techniques:
• Case based reasoning (k-nearest neighbour algorithm)• Decision tree induction• Bayesian classification• Neural networks• Support vector machines (SVM)• Classification based on from association rule mining concepts• Other classification methods
– Prediction– Classification accuracy
![Page 4: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/4.jpg)
4of25
4of49 ClassificationClassification:
– Predicts categorical class labels
Typical Applications– {CreditHistory, Salary} -> CreditApproval (Yes/No)
– {Temp, Humidity} --> Rain (Yes/No)
Mathematically
)(
:
}1,0{,}1,0{
xhy
YXh
YyXx n
![Page 5: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/5.jpg)
5of25
5of49 Linear ClassificationBinary Classification problem
The data above the red line belongs to class ‘x’
The data below red line belongs to class ‘o’
Examples – SVM, Perceptron, Probabilistic Classifiersx
xx
x
xx
x
x
x
x ooo
oo
o
o
o
o o
oo
o
![Page 6: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/6.jpg)
6of25
6of49 Discriminative ClassifiersAdvantages
– Prediction accuracy is generally high– Robust, works when training examples contain
errors– Fast evaluation of the learned target function
Criticism– Long training time– Difficult to understand the learned function
(weights)– Not easy to incorporate domain knowledge
![Page 7: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/7.jpg)
7of25
7of49 Artificial Neural Networks
A biologically inspired classification technique
Formed from interconnected layers of simple artificial neurons
ANN history:– 1943: McCulloch & Pitts– 1959: Rosenblatt (Perceptron)– 1959: Widrow & Hoff (ADALINE and
MADALINE)– 1969: Marvin Minsky and Seymour Papert's– 1974: Werbos (Backprop) – 1982: John Hopfield
![Page 8: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/8.jpg)
8of25
8of49 An Artifical Neuron
The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
f (x)
w0
w1
wn
x0
x1
xn
bias
)*()(0
n
iii xwbiasthreshxf
![Page 9: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/9.jpg)
9of25
9of49
ANN: Multi-Layer Perceptrons (MLPs)
Multi Layer Perceptrons (MLPs) are one of the best known ANN types
Composed of layers of fully interconnected artificial neurons
Training involves repeatedly presenting a series of training cases to the network and adjusting neurons’ weights and biases to minimise classification error
Typically the backpropogation of error algorithm is used for training
![Page 10: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/10.jpg)
10of25
10of49 MLP ExampleRemember our surfing example
An MLP can be built and trained to perform classification for this problem
Wind Speed
Wind Direction
Temperature
Wave Size
Wave Period
Good Surf
HiddenLayer
InputLayer
OutputLayer
![Page 11: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/11.jpg)
11of25
11of49 Network TrainingThe ultimate objective of training
– Obtain a set of weights that makes almost all of the tuples in the training data classified correctly
Steps– Initialize weights with random values – Feed the input tuples into the network one by one– For each unit
• Compute the net input to the unit as a linear combination of all the inputs to the unit
• Compute the output value using the activation function• Compute the error• Update the weights and the bias
![Page 12: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/12.jpg)
12of25
12of49 Summary of ANN ClassificationStrengths
– Fast classification– Very good generalization capacity
Weaknesses– No explanation capability – black box– Training can be slow – eager learning– Retraining is difficult
Lots of other network types, but MLP is probably the most common
![Page 13: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/13.jpg)
13of25
13of49 Support Vector Machines (SVM)In classification problems we try to create decision boundaries between classes
A choice must be made between possible boundaries
Class 1
Class 2
![Page 14: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/14.jpg)
14of25
14of49 SVMs (cont…)The decision boundary should be as far away from the data of both classes as possible
Class 1
Class 2
m
![Page 15: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/15.jpg)
15of25
15of49 Margins
Support Vectors
Small Margin Large Margin
![Page 16: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/16.jpg)
16of25
16of49
Given a set of points with labelThe SVM finds a hyperplane defined by the pair (w, b), where w is the normal to the plane and b is the distance from the origin
Where: • x - feature vector• b - bias, y- class label• ||w|| - margin
Linear Support Vector Machine
Nibwxy ii ,...,11)(
nix },{yi 11
![Page 17: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/17.jpg)
17of25
17of49 SVMs: The Clever Bit!What about when classes are not linearly separable?
Kernel functions and the kernel trick are used to transform data into a different linearly separable feature space
(.)( )
( )
( )( )( )
( )
( )( )
( )
( )
( )
( )( )
( )
( )
( )( )
( )
Feature spaceInput space
![Page 18: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/18.jpg)
18of25
18of49 SVMs: The Clever Bit! (cont...)What if the data is not linearly separable?
Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels)
-1 0 +1
+ +-
(1,0)(0,0)
(0,1) +
+-
![Page 19: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/19.jpg)
19of25
19of49 SVM Example
Exam
ple
of
Non-l
inear
SV
M
![Page 20: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/20.jpg)
20of25
20of49
ResultsSVM Example (cont…)
![Page 21: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/21.jpg)
21of25
21of49 Summary of SVM ClassificationStrengths
– Over-fitting is not common– Works well with high dimensional data– Fast classification– Good generalization capacity
Weaknesses– Retraining is difficult– No explanation capability– Slow training
At the cutting edge of machine learning
![Page 22: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/22.jpg)
22of25
22of49 SVM vs. ANNSVM
– Relatively new concept– Nice generalization
properties
– Hard to learn – learned in batch mode using quadratic programming techniques
– Using kernels can learn very complex functions
ANN– Quite old– Generalizes well but
doesn’t have strong mathematical foundation
– Can easily be learned in incremental fashion
– To learn complex functions – use multilayer perceptron (not that trivial)
![Page 23: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/23.jpg)
23of25
23of49 SVM Related Linkshttp://svm.dcs.rhbnc.ac.uk/
http://www.kernel-machines.org/
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light
BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-TaylorCambridge University Press
![Page 24: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/24.jpg)
24of25
24of49 Association-Based Classification Several methods for association-based classification
– ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98) • It mines high support and high confidence rules in the
form of “cond_set => y”, where y is a class label
– CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
• Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another
• Mine Eps based on minimum support and growth rate
![Page 25: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/25.jpg)
25of25
25of49 What Is Prediction?Prediction is similar to classification
– First, construct a model– Second, use model to predict unknown value
• Major method for prediction is regression– Linear and multiple regression– Non-linear regression
Prediction is different from classification– Classification refers to predict categorical class
label– Prediction models continuous-valued functions
![Page 26: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/26.jpg)
26of25
26of49
Regress Analysis and Log-Linear Models in Prediction
Linear regression: Y = + X– Two parameters, and , specify the line and
are to be estimated by using the data at hand– Using the least squares criterion to the known
values of Y1, Y2,…, X1, X2,….
Multiple regression: Y = b0 + b1X1 + b2X2– Many nonlinear functions can be transformed
into the aboveLog-linear models:
– The multi-way table of joint probabilities is approximated by a product of lower-order tables
– Probability: p(a, b, c, d) = ab acad bcd
![Page 27: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/27.jpg)
27of25
27of49 Prediction: Numerical Data
![Page 28: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/28.jpg)
28of25
28of49 Prediction: Categorical Data
![Page 29: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/29.jpg)
29of25
29of49
Concerns Over Classification Techniques
When choosing a technique for a specific classification problem we must consider the following issues:
– Classification accuracy– Training speed– Classification speed– Danger of over-fitting– Generalisation capacity– Implications for retraining– Explanation capability
![Page 30: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/30.jpg)
30of25
30of49 Evaluating Classification Accuracy
During development, and in testing before deploying a classifier in the wild, we need to be able to quantify the performance of the classifier
– How accurate is the classifier?– When the classifier is wrong, how is it wrong?
Useful to decide on which classifier (which parameters) to use and to estimate what the performance of the system will be
![Page 31: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/31.jpg)
31of25
31of49 Evaluating Classifiers (cont…)How we do this depends on how much data is availableIf there is unlimited data available then there is no problemUsually we have less data than we would like so we have to compromise
– Use hold-out testing sets– Cross validation
• K-fold cross validation• Leave-one-out validation
– Parallel live test
![Page 32: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/32.jpg)
32of25
32of49 Hold-Out Testing SetsSplit the available data into a training set and a test set
Train the classifier in the training set and evaluate based on the test set
A couple of drawbacks– We may not have enough data– We may happen upon an unfortunate split
Training Set Test Set
Total number of available examples
![Page 33: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/33.jpg)
33of25
33of49 K-Fold Cross ValidationDivide the entire data set into k folds
For each of k experiments, use kth fold for testing and everything else for training
Total number of available examples
Test SetK = 0
Test SetK = 1
Test SetK = 2
Test SetK = 3
![Page 34: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/34.jpg)
34of25
34of49 K-Fold Cross Validation (cont…)The accuracy of the system is calculated as the average error across the k folds
The main advantages of k-fold cross validation are that every example is used in testing at some stage and the problem of an unfortunate split is avoided
Any value can be used for k– 10 is most common– Depends on the data set
![Page 35: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/35.jpg)
35of25
35of49 Leave-One-Out Cross ValidationExtreme case of k-fold cross validation
With N data examples perform N experiments with N-1 training cases and 1 test case
Total number of available examples
K = 0
K = 1
K = 2
K = N
![Page 36: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/36.jpg)
36of25
36of49 Classifier AccuracyThe accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier
– Often also referred to as recognition rate– Error rate (or misclassification rate) is the
opposite of accuracy
![Page 37: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/37.jpg)
37of25
37of49
False Positives Vs False Negatives
While it is useful to generate the simple accuracy of a classifier, sometimes we need more
When is the classifier wrong?– False positives vs false negatives– Related to type I and type II errors in statistics
Often there is a different cost associated with false positives and false negatives
– Think about diagnosing diseases
![Page 38: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/38.jpg)
38of25
38of49 Confusion MatrixDevice used to illustrate how a classifier is performing in terms of false positives and false negativesGives us more information than a single accuracy figureAllows us to think about the cost of mistakesCan be extended to any number of classes
Classifier Result
Class A(yes)
Class B(no)
fnClass A(yes) Expected
Resultfp Class B
(no)
![Page 39: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/39.jpg)
39of25
39of49 Other Accuracy MeasuresSometimes a simple accuracy measure is not enough
pos
postysensitivit
_
neg
negtyspecificit
_
posfpost
postprecision
__
_
![Page 40: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/40.jpg)
40of25
40of49 ROC CurvesReceiver Operating Characteristic (ROC) curves were originally used to make sense of noisy radio signals
Can be used to help us talk about classifier performance and determine the best operating point for a classifier
![Page 41: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/41.jpg)
41of25
41of49 ROC Curves (cont…)
False Positives
Tru
e P
ositi
ves
0
1.0
1.0
For some great ROC curve examples have a look here
Consider how the relationship between true positives and false positives can change
We need to choose the best operating point
![Page 42: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/42.jpg)
42of25
42of49 ROC Curves (cont…)
False Positives
Tru
e P
ositi
ves
0
1.0
1.0
ROC curves can be used to compare classifiers
The greater the area under the curve the more accurate the classifier
![Page 43: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/43.jpg)
43of25
43of49 Over-FittingWhen we train a classifier we are trying to a learn a function approximated by the training data we happen to use
– What if the training data doesn’tcover the whole problem space?
We can learn the training data too closely which hampers the ability to generalise
This problem is known as overfitting
Depending on the type of classifier used there are different approaches to avoiding this
![Page 44: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/44.jpg)
44of25
44of49 EnsemblesIn order to improve classification accuracy we can aggregate the results of an ensemble of classifiers
Classifier0
Classifier1
Classifiern
Aggregation
![Page 45: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/45.jpg)
45of25
45of49 Bagging
Given a set S of s samples
Generate a bootstrap sample T from S– Cases in S may not appear in T or may appear
more than once
Repeat this sampling procedure, getting a sequence of k independent training sets
A corresponding sequence of classifiers C1,C2,…,Ck is constructed for each of these training sets, by using the same classification algorithm
![Page 46: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/46.jpg)
46of25
46of49 Bagging (cont…)
To classify an unknown sample X,let each classifier predict or vote
The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes
![Page 47: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/47.jpg)
47of25
47of49 Boosting Technique — AlgorithmAssign every example an equal weight 1/N
For t = 1, 2, …, T Do – Obtain a hypothesis (classifier) h(t) under w(t)– Calculate the error of h(t) and re-weight the examples
based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily
– Normalize w(t+1) to sum to 1 (weights assigned to different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set
![Page 48: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/48.jpg)
48of25
48of49 SummaryClassification is an extensively studied problem
– Mainly in statistics and machine learning
Classification is probably one of the most widely used data mining techniques
Scalability is still an important issue for database applications
Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..
![Page 49: Business Systems Intelligence: 5. Classification 2 Dr. Brian Mac Namee (](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d415503460f94a1b9b2/html5/thumbnails/49.jpg)
49of25
49of49 Questions?
?