spooky stuff in metric space. spooky stuff spooky stuff data mining in metric space rich caruana...
DESCRIPTION
Motivation #1TRANSCRIPT
Spooky Stuff in Metric SpaceSpooky Stuff in Metric Space
Spooky StuffSpooky StuffData Mining in Metric Space
Rich CaruanaAlex Niculescu
Cornell University
Motivation #1
Motivation #1: Pneumonia Risk Prediction
Motivation #1: Many Learning Algorithms Neural nets Logistic regression Linear perceptron K-nearest neighbor Decision trees ILP (Inductive Logic Programming) SVMs (Support Vector Machines) Bagging X Boosting X Rule learners (C2, …) Ripper Random Forests (forests of decision trees) Gaussian Processes Bayes Nets …
No one/few learning methods dominates the others
Motivation #2
Motivation #2: SLAC B/Bbar Particle accelerator generates B/Bbar particles Use machine learning to classify tracks as B or Bbar Domain specific performance measure: SLQ-Score 5% increase in SLQ can save $1M in accelerator time
SLAC researchers tried various DM/ML methods Good, but not great, SLQ performance We tried standard methods, got similar results We studied SLQ metric:
– similar to probability calibration– tried bagged probabilistic decision trees (good on C-Section)
Motivation #2: Bagged Probabilistic Trees
Draw N bootstrap samples of data Train tree on each sample ==> N trees Final prediction = average prediction of N trees
…
Average prediction(0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24
Motivation #2: Improves Calibration Order of Magnitude
Poor Calibration
Excellent Calibration
single tree
100 bagged trees
Motivation #2: Significantly Improves SLQ
100 bagged trees
single tree
Motivation #2
Can we automate this analysis of performance metrics so that it’s easier to recognize which metrics are similar to each other?
Motivation #3
Motivation #3
Threshold Metrics Rank/Ordering Metrics Probability Metrics
Model Accuracy F-Score Lift ROC Area Average Precision
Break Even Point
Squared Error
Cross-Entropy Calibration SAR Mean
SVM 0.8134 0.9092 0.9480 0.9621 0.9335 0.9377 0.8767 0.8778 0.9824 0.9055 0.9156
ANN 0.8769 0.8752 0.9487 0.9552 0.9167 0.9142 0.8532 0.8634 0.9881 0.8956 0.9102
BAG-DT 0.8114 0.8609 0.9465 0.9674 0.9416 0.9220 0.8588 0.8942 0.9744 0.9036 0.9086
BST-DT 0.8904 0.8986 0.9574 0.9778 0.9597 0.9427 0.6066 0.6107 0.9241 0.8710 0.8631
KNN 0.7557 0.8463 0.9095 0.9370 0.8847 0.8890 0.7612 0.7354 0.9843 0.8470 0.8559
DT 0.5261 0.7891 0.8503 0.8678 0.7674 0.7954 0.5564 0.6243 0.9647 0.7445 0.7491
BST-STMP 0.7319 0.7903 0.9046 0.9187 0.8610 0.8336 0.3038 0.2861 0.9410 0.6589 0.7303
Scary Stuff In ideal world:
– Learn model that predicts correct conditional probabilities (Bayes optimal)– Yield optimal performance on any reasonable metric
In real world: – Finite data– 0/1 targets instead of conditional probabilities– Hard to learn this ideal model– Don’t have good metrics for recognizing ideal model– Ideal model isn’t always needed
In practice:– Do learning using many different metrics: ACC, AUC, CXE, RMS, …– Each metric represents different tradeoffs– Because of this, usually important to optimize to appropriate metric
Scary Stuff
Scary Stuff
In this work we compare nine commonly used performance metrics by applying data mining to the results of a massive
empirical study
Goals:Goals:– Discover relationships between performance metricsDiscover relationships between performance metrics– Are the metrics really that different?Are the metrics really that different?– If you optimize to metric X, also get good perf on metric Y?If you optimize to metric X, also get good perf on metric Y?– Need to optimize to metric Y, which metric X should you optimize to?Need to optimize to metric Y, which metric X should you optimize to?– Which metrics are more/less Which metrics are more/less robustrobust??– Design new, better metrics?Design new, better metrics?
10 Binary Classification Performance Metrics
Threshold MetricsThreshold Metrics::– AccuracyAccuracy– F-ScoreF-Score– LiftLift
Ordering/Ranking MetricsOrdering/Ranking Metrics::– ROC AreaROC Area– Average PrecisionAverage Precision– Precision/Recall Break-Even PointPrecision/Recall Break-Even Point
Probability MetricsProbability Metrics::– Root-Mean-Squared-ErrorRoot-Mean-Squared-Error– Cross-EntropyCross-Entropy– Probability CalibrationProbability Calibration
SAR = ((1 - Squared Error) + Accuracy + ROC Area) / 3SAR = ((1 - Squared Error) + Accuracy + ROC Area) / 3
Accuracy
Predicted 1 Predicted 0
True
0
Tr
ue 1 a b
c d
correct
incorrect
accuracy = (a+d) / (a+b+c+d)threshold
Lift not interested in accuracy on entire dataset want accurate predictions for 5%, 10%, or 20% of dataset don’t care about remaining 95%, 90%, 80%, resp. typical application: marketing
how much better than random prediction on the fraction of the dataset predicted true (f(x) > threshold)
lift(threshold) =%positives>threshold%dataset>threshold
Lift
Predicted 1 Predicted 0
True
0
Tr
ue 1
a b
c d
threshold
lift= a (a+b)(a+c) (a+b+c+d)
lift = 3.5 if mailings sent to 20% of the customers
Precision/Recall, F, Break-Even Pt
€
PRECISION = a /(a + c)
RECALL = a /(a + b)
F =2* (PRECISION×RECALL)(PRECISION+RECALL)
€
Break Even Point = PRECISION = RECALL
harmonic average of precision and recall
betterperformance
worseperformance
Predicted 1 Predicted 0Tr
ue 0
True
1 truepositive
falsenegative
falsepositive
truenegative
Predicted 1 Predicted 0
True
0
Tr
ue 1
hits misses
falsealarms
correctrejections
Predicted 1 Predicted 0
True
0
Tr
ue 1
P(pr1|tr1) P(pr0|tr1)
P(pr0|tr0)P(pr1|tr0)
Predicted 1 Predicted 0
True
0
Tr
ue 1
TP FN
TNFP
ROC Plot and ROC Area Receiver Operator Characteristic Developed in WWII to statistically model false positive and false
negative detections of radar operators Better statistical foundations than most other measures Standard measure in medicine and biology Becoming more popular in ML
Sweep threshold and plot – TPR vs. FPR– Sensitivity vs. 1-Specificity– P(true|true) vs. P(true|false)– Sensitivity = a/(a+b) = Recall = LIFT numerator– 1 - Specificity = 1 - d/(c+d)
diagonal line israndom prediction
Calibration Good calibration:
If 1000 x’s have pred(x) = 0.2, ~200 should be positive
€
∀r x , prediction
r x ( ) = p(
r x )
Calibration Model can be accurate but poorly calibrated
– good threshold with uncalibrated probabilities Model can have good ROC but be poorly calibrated
– ROC insensitive to scaling/stretching– only ordering has to be correct, not probabilities themselves
Model can have very high variance, but be well calibrated Model can be stupid, but be well calibrated Calibration is a real oddball
Measuring Calibration Bucket method
In each bucket:– measure observed c-sec rate– predicted c-sec rate (average of probabilities)– if observed csec rate similar to predicted csec rate => good
calibration in that bucket
#0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# # # # # # # # ## # # # # # # # #
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Calibration Plot
Experiments
Base-Level Learning Methods Decision trees K-nearest neighbor Neural nets SVMs Bagged Decision Trees Boosted Decision Trees Boosted Stumps
Each optimizes different things Each best in different regimes Each algorithm has many variations and free parameters Generate about 2000 models on each test problem
Data Sets 7 binary classification data sets
– Adult– Cover Type– Letter.p1 (balanced)– Letter.p2 (unbalanced)– Pneumonia (University of Pittsburgh)– Hyper Spectral (NASA Goddard Space Center)– Particle Physics (Stanford Linear Accelerator)
4 k train sets Large final test sets (usually 20k)
Massive Empirical Comparison7 base-level learning methods
X100’s of parameter settings per method
=~ 2000 models per problem
X7 test problems
=14,000 models
X 10 performance metrics
=140,000 model performance evaluations
COVTYPE: Calibration vs. Accuracy
Multi Dimensional Scaling
M1 M2 M3 M4 M5 M6 M7 . . . M14,000ACC - - - - - - - - -FSC - - - - - - - - -LFT - - - - - - - - -AUC - - - - - - - - -APR - - - - - - - - -BEP - - - - - - - - -RMS - - - - - - - - -MXE - - - - - - - - -CAL - - - - - - - - -SAR - - - - - - - - -
Scaling, Ranking, and Normalizing Problem:
– some metrics, 1.00 is best (e.g. ACC)– some metrics, 0.00 is best (e.g. RMS)– some metrics, baseline is 0.50 (e.g. AUC)– some problems/metrics, 0.60 is excellent performance– some problems/metrics, 0.99 is poor performance
Solution 1: Normalized Scores:– baseline performance => 0.00– best observed performance => 1.00 (proxy for Bayes optimal)– puts all metrics on equal footing
Solution 2: Scale by Standard Deviation Solution 3: Rank Correlation
Multi Dimensional Scaling
Find low-dimension embedding of 10x14,000 data The 10 metrics span a 2-5 dimension subspace
Multi Dimensional Scaling Look at 2-D MDS plots:
Scaled by standard deviationNormalized scoresMDS of rank correlations
MDS on each problem individuallyMDS averaged across all problems
2-D Multi-Dimensional Scaling
2-D Multi-Dimensional Scaling
Normalized Scores Scaling Rank-Correlation Distance
Adult Covertype Hyper-Spectral
Letter Medis SLAC
Correlation Analysis 2000 performances for each metric on each problem Correlation between all pairs of metrics
– 10 metrics– 45 pairwise correlations
Average of correlations over 7 test problems
Standard correlation Rank correlation
Present rank correlation here
Rank CorrelationsMetric ACC FSC LFT AUC APR BEP RMS MXE CAL SAR Mean
ACC 1.00 0.87 0.85 0.88 0.89 0.93 0.87 0.75 0.56 0.92 0.852FSC 0.87 1.00 0.77 0.81 0.82 0.87 0.79 0.69 0.50 0.84 0.796LFT 0.85 0.77 1.00 0.96 0.91 0.89 0.82 0.73 0.47 0.92 0.832AUC 0.88 0.81 0.96 1.00 0.95 0.92 0.85 0.77 0.51 0.96 0.861 APR 0.89 0.82 0.91 0.95 1.00 0.92 0.86 0.75 0.50 0.93 0.853BEP 0.93 0.87 0.89 0.92 0.92 1.00 0.87 0.75 0.52 0.93 0.860RMS 0.87 0.79 0.82 0.85 0.86 0.87 1.00 0.92 0.79 0.95 0.872 MXE 0.75 0.69 0.73 0.77 0.75 0.75 0.92 1.00 0.81 0.86 0.803CAL 0.56 0.50 0.47 0.51 0.50 0.52 0.79 0.81 1.00 0.65 0.631SAR 0.92 0.84 0.92 0.96 0.93 0.93 0.95 0.86 0.65 1.00 0.896
Correlation analysis consistent with MDS analysis Ordering metrics have high correlations to each other ACC, AUC, RMS have best correlations of metrics in each metric class RMS has good correlation to other metrics SAR has best correlation to other metrics
Summary 10 metrics span 2-5 Dim subspace Consistent results across problems and scalings Ordering Metrics Cluster: AUC ~ APR ~ BEP CAL far from Ordering Metrics CAL nearest to RMS/MXE RMS ~ MXE, but RMS much more centrally located Threshold Metrics ACC and FSC do not cluster as tightly
as ordering metrics and RMS/MXE Lift behaves more like Ordering than Threshold metrics Old friends ACC, AUC, and RMS most representative New SAR metric is good, but not much better than RMS
New Resources Want to borrow 14,000 models?
– margin analysis– comparison to new algorithm X– …
PERF code: software that calculates ~2 dozen performance metrics:– Accuracy (at different thresholds)– ROC Area and ROC plots– Precision and Recall plots– Break-even-point, F-score, Average Precision– Squared Error– Cross-Entropy– Lift– …– Currently, most metrics are for boolean classification problems– We are willing to add new metrics and new capabilities– Available at: http://www.cs.cornell.edu/~caruana
Future Work
Future/Related Work Ensemble method optimizes any metric (ICML*04) Get good probs from Boosted Trees (AISTATS*05) Comparison of learning algs on metrics (ICML*06)
First step in analyzing different performance metrics
Develop new metrics with better properties– SAR is a good general purpose metric– Does optimizing to SAR yield better models?– but RMS nearly as good– attempts to make SAR better did not help much
Extend to multi-class or hierarchical problems where evaluating performance is more difficult
Thank You.
Spooky Stuff in Metric SpaceSpooky Stuff in Metric Space
Which learning methods perform best on each metric?
Normalized Scores Best Single ModelsThreshold Metrics Rank/Ordering Metrics Probability Metrics
Model Accuracy F-Score Lift ROC Area Average Precision
Break Even Point
Squared Error
Cross-Entropy Calibration SAR Mean
SVM 0.8134 0.9092 0.9480 0.9621 0.9335 0.9377 0.8767 0.8778 0.9824 0.9055 0.9156
ANN 0.8769 0.8752 0.9487 0.9552 0.9167 0.9142 0.8532 0.8634 0.9881 0.8956 0.9102
BAG-DT 0.8114 0.8609 0.9465 0.9674 0.9416 0.9220 0.8588 0.8942 0.9744 0.9036 0.9086
BST-DT 0.8904 0.8986 0.9574 0.9778 0.9597 0.9427 0.6066 0.6107 0.9241 0.8710 0.8631
KNN 0.7557 0.8463 0.9095 0.9370 0.8847 0.8890 0.7612 0.7354 0.9843 0.8470 0.8559
DT 0.5261 0.7891 0.8503 0.8678 0.7674 0.7954 0.5564 0.6243 0.9647 0.7445 0.7491
BST-STMP 0.7319 0.7903 0.9046 0.9187 0.8610 0.8336 0.3038 0.2861 0.9410 0.6589 0.7303
SVM predictions transformed to posterior probabilities via Platt Scaling SVM and ANN tied for first place; Bagged Trees nearly as good Boosted Trees win 5 of 6 Threshold & Rank metrics, but yield lousy probs! Boosting weaker stumps does not compare to boosting full trees KNN and Plain Decision Trees usually not competitive (with 4k train sets) Other interesting things. See papers.
Platt Scaling SVM predictions: [-inf, +inf] Probability metrics require [0,1] Platt scaling transforms SVM preds by fitting a sigmoid
This gives SVM good probability performance
0
0.2
0.4
0.6
0.8
1
1.2
-15 -10 -5 0 5 10 15
Series1
Outline Motivation: The One True Model Ten Performance Metrics Experiments Multidimensional Scaling (MDS) Analysis Correlation Analysis Learning Algorithm vs. Metric Summary
Base-Level Learners Each optimizes different things:
– ANN: minimize squared error or cross-entropy (good for probs)– SVM, Boosting: optimize margin (good for accuracy, poor for probs)– DT: optimize info gain– KNN: ?
Each best in different regimes:– SVM: high dimensional data– DT, KNN: large data sets– ANN: non-linear prediction from many correlated features
Each algorithm has many variations and free parameters:– SVM: margin parameter, kernel, kernel parameters (gamma, …)– ANN: # hidden units, # hidden layers, learning rate, early stopping point– DT: splitting criterion, pruning options, smoothing options, …– KNN: K, distance metric, distance weighted averaging, …
Generate about 2000 models on each test problem
Motivation Holy Grail of Supervised Learning:
– One True Model (a.k.a. Bayes Optimal Model)– Predicts correct conditional probability for each case– Yields optimal performance on all reasonable metrics– Hard to learn given finite data
train sets rarely have conditional probs, usually just 0/1 targets– Isn’t always necessary
Many Different Performance Metrics:– ACC, AUC, CXE, RMS, PRE/REC …– Each represents different tradeoffs– Usually important to optimize to appropriate metric– Not all metric created equal
Motivation In an ideal world:
– Learn model that predicts correct conditional probabilities– Yield optimal performance on any reasonable metric
In real world: – Finite data– 0/1 targets instead of conditional probabilities– Hard to learn this ideal model– Don’t have good metrics for recognizing ideal model– Ideal model isn’t always necessary
In practice:– Do learning using many different metrics: ACC, AUC, CXE, RMS, …– Each metric represents different tradeoffs– Because of this, usually important to optimize to appropriate metric
Target: 0/1, -1/+1, True/False, … Prediction = f(inputs) = f(x): 0/1 or Real Threshold: f(x) > thresh => 1, else => 0 threshold(f(x)): 0/1
#right / #total p(“correct”): p(threshold(f(x)) = target)
Accuracy
€
accuracy =1− (target i − threshold( f (
r x i)))( )
2
i=1K N∑
N
Precision and Recall Typically used in document retrieval Precision:
– how many of the returned documents are correct– precision(threshold)
Recall:– how many of the positives does the model return– recall(threshold)
Precision/Recall Curve: sweep thresholds
Precision/Recall
Predicted 1 Predicted 0
True
0
Tr
ue 1
a b
c d
PRECISION=a/(a+c)
RECALL=a/(a+b)
threshold