data mining - university of southern...
TRANSCRIPT
1
1
Data Mining: Concepts and Techniques
(3rd ed.)
— Chapter 8 —
2
Chapter 8. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
2
Model Evaluation and Selection
Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
Use validation test set of class-labeled tuples instead of
training set when assessing accuracy
Methods for estimating a classifier’s accuracy:
Holdout method, random subsampling
Cross-validation
Bootstrap
Comparing classifiers:
Confidence intervals
Cost-benefit analysis and ROC Curves
3
Classifier Evaluation Metrics: Confusion Matrix
Actual class\Predicted class
buy_computer = yes
buy_computer = no
Total
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j
May have extra rows/columns to provide totals
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
4
3
Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified
Accuracy = (TP + TN)/All
Error rate: 1 – accuracy, or
Error rate = (FP + FN)/All
Class Imbalance Problem:
One class may be rare, e.g. fraud, or HIV-positive
Significant majority of the negative class and minority of the positive class
Sensitivity: True Positive recognition rate
Sensitivity = TP/P
Specificity: True Negative recognition rate
Specificity = TN/N
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All
5
Classifier Evaluation Metrics: Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive
Recall: completeness – what % of positive tuples did the classifier label as positive?
Perfect score is 1.0
Inverse relationship between precision & recall F measure (F1 or F-score): harmonic mean of precision and
recall, Fß: weighted measure of precision and recall
assigns ß times as much weight to recall as to precision
6
4
Classifier Evaluation Metrics: Example
7
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)
Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods
Holdout method Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation
Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies
obtained Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, each approximately equal size
At i-th iteration, use Di as test set and others as training set
8
5
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Works well with small data sets
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times to get overall accuracy of the
model.
9
Estimating Confidence Intervals: Classifier Models M1 vs. M2
Suppose we have 2 classifiers, M1 and M2, which one is better?
Use 10-fold cross-validation to obtain and
These mean error rates are just estimates of error on the true
population of future data cases
What if the difference between the 2 error rates is just
attributed to chance?
Use a test of statistical significance
Obtain confidence limits for our error estimates
10
6
Estimating Confidence Intervals: Null Hypothesis
Perform 10-fold cross-validation
Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
Use t-test (or Student’s t-test)
Null Hypothesis: M1 & M2 are the same
If we can reject null hypothesis, then
we conclude that the difference between M1 & M2 is
statistically significant
Chose model with lower error rate
11
Estimating Confidence Intervals: t-test
If only 1 test set available: pairwise comparison
For ith round of 10-fold cross-validation, the same cross partitioning is used to obtain err(M1)i and err(M2)i
Average over 10 rounds to get
t-test computes t-statistic with k-1 degrees of freedom:
If two test sets available: use non-paired t-test
where
and
where
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp. 12
7
Estimating Confidence Intervals: Table for t-distribution
Symmetric
Significance level, e.g., sig = 0.05 or 5% means M1 & M2 are significantly different for 95% of population
Confidence limit, z = sig/2
13
Estimating Confidence Intervals: Statistical Significance
Are M1 & M2 significantly different?
Compute t. Select significance level (e.g. sig = 5%)
Consult table for t-distribution: Find t value corresponding to k-1 degrees of freedom (here, 9)
t-distribution is symmetric: typically upper % points of distribution shown → look up value for confidence limit z=sig/2 (here, 0.025)
If t > z* or t < -z*, then t value lies in rejection region:
Reject null hypothesis that mean error rates of M1 & M2 are same
Conclude: statistically significant difference between M1 & M2
Otherwise, conclude that any difference is chance 14
8
Model Selection: ROC Curves
ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true positive rate and the false positive rate
The area under the ROC curve is a measure of the accuracy of the model
Rank the test tuples in decreasing order: the one that is most likely to belong to the positive class appears at the top of the list
The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model
Vertical axis represents the true positive rate
Horizontal axis rep. the false positive rate
The plot also shows a diagonal line
A model with perfect accuracy will have an area of 1.0
15
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules 16
9
17
Chapter 8. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
Ensemble Methods: Increasing the Accuracy
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of classifiers
Boosting: weighted vote with a collection of classifiers
Ensemble: combining a set of heterogeneous classifiers
18
10
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap)
A classifier model Mi is learned for each training set Di
Classification: classify an unknown sample X
Each classifier Mi returns its class prediction
The bagged classifier M* counts the votes and assigns the class with the most votes to X
Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction 19
Boosting
Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi
The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy
Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data
20
11
21
Adaboost (Freund and Schapire, 1997)
Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
Initially, all the weights of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
Tuples from D are sampled (with replacement) to form a training set Di of the same size
Each tuple’s chance of being selected is based on its weight
A classification model Mi is derived from Di
Its error rate is calculated using Di as a test set
If a tuple is misclassified, its weight is increased, o.w. it is decreased
Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples:
The weight of classifier Mi’s vote is
)(
)(1log
i
i
Merror
Merror
d
j
ji errwMerror )()( jX
Classification of Class-Imbalanced Data Sets
Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc.
Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data
Typical methods for imbalance data in 2-class classification:
Oversampling: re-sampling of data from positive class until there are an equal number of positive and negative tuples
Under-sampling: randomly eliminate tuples from negative class
Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors
Still difficult for class imbalance problem on multiclass tasks
22
12
23
Chapter 8. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
Summary (I)
Classification is a form of data analysis that extracts models
describing important data classes.
Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.
24
13
Summary (II)
Significance tests and ROC curves are useful for model selection.
There have been numerous comparisons of the different
classification methods; the matter remains a research topic
No single method has been found to be superior over all others
for all data sets
Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
25