discriminant functions are trained on a finite set of data how much fitting should we do? what...

40
• Discriminant functions are trained on a finite set of data • How much fitting should we do? What should the model’s dimension be? • Model must be used to identify a piece of evidence (data) it was not trained with. • Accurate estimates for error rates of decision model are critical in forensic science applications. • The simplest is apparent error rate: Decision Model Validation

Upload: conrad-foster

Post on 05-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Discriminant functions are trained on a finite set of data • How much fitting should we do?• What should the model’s dimension be?

• Model must be used to identify a piece of evidence (data) it was not trained with. • Accurate estimates for error rates of decision

model are critical in forensic science applications.

• The simplest is apparent error rate:• Error rate on training set

• Lousy estimate, but better than nothing

Decision Model Validation

Page 2: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Cross-Validation: systematically hold-out chunks of data set for testing • Most common: Hold-one-out

1. Omit a data vector from X

2. Train model,

3. Classify held out observation

4. Repeat for all data vectors

• Simple but give a good estimate

• Lots of literature to back up its efficacy

Decision Model Validation

Page 3: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• C-fold cross-validation: hold out data chunks of size c.• Can become time consuming.

• Typically performance not much better than simple HOO-CV

• Caution! If decision model is sensitive to group sizes (e.g. CVA) cross-validation may not work well.• Should have at the very least, 5 replicates/group

Decision Model Validation

DON’T ARGUE WITH ME!!!!!!!!!!

Page 4: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Bootstrap: Make up data sets with randomly selected observation vectors (with replacement)• Bootstrap sample is the same size as X• You’ll get repeats

1. Train a decision model with the bootstrapped set• Model should not be sensitive to repeated observations!

• CVA is out!!!!

2. Test model with original X and compute error:

Decision Model Validation

Decision rules builtwith bootstrapped data set

Page 5: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

3. Test model with bootstrapped data set X* and compute error:

4. Repeat 1-3, B times:• B should be at least 200

5. Compute average “optimism”:

6. Compute the “refined” bootstrap error rate:

Decision Model Validation

Number of times obs. vect. occurs in X*

*Now Exercise: Explore some data sets with:

boostrap.R cv_boot_testset.R

Page 6: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• t is a test for association between:• xunk, data from an unknown

• Could be from crime scene

• Could be from suspect

• A group of data from a source• Could be from suspect

• Could be from crime scene

• ANY decision rule output by a pattern recognition program can be considered as a test for association

Probabilities

Page 7: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Codes:• t+/- : Test indicates inclusion/exclusion

• S+/- : Evidence is/is not associated with a source

• Four probabilities are of interest:• Probability that a test yields a positive association

given that there is truly an association between evidence and a source:

• TPR is very important for forensic applications!

Probabilities

= probability of a true positive (TP)= true positive rate (TPR)= probability of a true inclusion= sensitivity

Page 8: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Probability that a test yields a positive association given that there is truly no association between evidence and a source:

• FPR is very important for forensic applications!

• In traditional hypothesis testing, FPR is sometimes called

• 1-FPR = specificity (TNR): rate at which true exclusions are correctly excluded

Probabilities

= probability of a false positive (FP)= false positive rate (FPR)= probability of a false inclusion

Page 9: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Probability that a test yields a negative association given that there is truly no association between evidence and a source:

• TNR estimates may be the most useful (and trustworthy) numbers that come out of applications probability to physical evidence...

Probabilities

= probability of a true negative (TN)= true negative rate= probability of a true exclusion= specificity

Page 10: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Probability that a test yields a negative association given that there is truly an association between evidence and a source:

• In traditional hypothesis testing, FN is sometimes called

• 1-FNR = sensitivity (TPR): rate at which true inclusions are correctly included

Probabilities

= probability of a false negative (FN)= false negative rate (FNR)= probability of a false exclusion

Page 11: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• Summary:

• 1- is called test’s power

• Remember, these are all only ESTIMATES!

An association truly exists, S+

An association truly does not exist, S-

Test indicates an inclusion, t+

True Positive Rate False Positive Rate Type I error

Test indicates an exclusion, t-

False Negative RateType II error

True Negative Rate

Probabilities

Page 12: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Probabilities• Much more difficult to objectively estimate,

but of more interest in Law applications:• Probability that an association exists given a test

indicates an association:

• Probability that an no association exists given a test indicates an association:

Bayes’ Rule Again

Prior probability thatthere is an associationbetween evidence anda source…

Also called positivepredictive value (PV+)

Page 13: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Probabilities• Dividing these, we get the “famous” (positive)

likelihood ratio LR+:

• LR+ can be expressed as:

LikelihoodRatio

Prior odds in favorof association

Odds form of Bayes’ Rule

Posterior odds in favorof association given testindicates inclusion

Page 14: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Probabilities• LR+ interpretations:• Ratio of: probability test indicates inclusion given

a true association vs. probability test indicates inclusion given a true exclusion

• LR+ serves as a multiplier for the prior odds in favor of an association

• LR+ gives relative effect of same source origin odds given a positive test result

Page 15: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Probabilities• Note: In building a decision model, TPR, TNR,

FPR, FNR and LR+ are computed on a per group basis• There is no overall TPR, TNR, FPR, FNR and LR+!

• Value comes into forensic science if one of the groups is a known suspect or crime scene group, AND:• Unknowns are tested against suspect/crime scene

group• Confidence measures in the results are: TPR, FPR and

LR+ computed on the suspect/crime scene group

Page 16: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Probabilities• How can these be used/stated in court?

?Striation pattern foundat a crime scene (CS)

• Same class characteristics as C.S.• Subclass characteristics eliminated from data

Many striation patterns generatedby a tool associated with a suspect (SP)

Include SP set in database (DB) and compute/test discrimination model

• Get TP, FP and LR+ for SP wrt/ DB

• I.D. CS with discrimination model• Result is inclusion or exclusion• TP, FP and LR+ for SP apply

to result• State in court along with size of DB

Page 17: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Receiver Operating Characteristic• In general a classification rule t, applied to a

data point yields a score, t(x) = score

• For two groups, consider score distributions• Two groups can be right vs. wrong, pos vs. neg,

assoc. vs. no assoc., one vs. rest, one vs. one, etc.

score

cut-offscore

Page 18: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Receiver Operating Characteristic• The cut off score is adjustable• Different choices give different TPR and FPR

• Cut off is related to prior

• Changing cut off traces out a curve on a graph of TPR vs. FPR = ROC curve

TPR

FPR0 1

1

AUC = Mann-Whitney U

“chance” diagonal

*Now Exercise: Source roc_utilities.R

play with roc.R for PLS-DA

Page 19: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Receiver Operating Characteristic• “Chance” diagonal: If your ROC curve looks

like this• Score distributions for two groups are right on top

of each other

• 50/50 chance of assigning an unknown to the correct group.

• Area under curve (AUC): Probability of misclassification (estimated test error rate)• AUC range = 0 to 1 (really 0.5 to 1)*

• Gini coefficient: “Degree of inequality of ROC curve from chance diagonal = 2AUC-1

Page 20: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

How good of a “match” is it?Conformal PredictionVovk

• Data should be IID but that’s it C

umul

ativ

e #

of E

rror

s

Sequence of Unk Obs Vects

80% confidence20% errorSlope = 0.2

95% confidence5% errorSlope = 0.05

99% confidence1% errorSlope = 0.01

• Can give a judge or jury an easy to understand measure of reliability of classification result

• This is an orthodox “frequentist”

approach• Roots in Algorithmic Information

Theory

• Confidence on a scale of 0%-100%

• Testable claim: Long run I.D. error-rate should be the chosen significance level

Page 21: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

How Conformal Prediction works for us• Given a “bag” of obs with known identities and one obs of

unknown identityVovk

• Estimate how “wrong” labelings are for each observation with a non-conformity score (“wrong-iness”)

• Looking at the “wrong-iness” of known observations in the bag:

• Does labeling-i for the unknown have an unusual amount of “wrong-iness”??:

• For us, one-vs-one SVMs:

• If not:

• ppossible-IDi ≥ chosen level of significance

• Put IDi in the (1 - )*100% confidence interval

Page 22: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Conformal Prediction

Theoretical (Long Run) Error Rate: 5%

Empirical Error Rate: 5.3%

14D PCA-SVM Decision Modelfor screwdriver striation patterns

• For 95%-CPT (PCA-SVM) confidence intervals will not contain the correct I.D. 5% of the time in the long run• Straight-forward validation/explanation picture for

court

Page 23: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Conformal Prediction Drawbacks

• CPT is an interval method• Can (and does) produce multi-label I.D. intervals• A “correct” I.D. is an interval with all labels• Doesn’t happen often in practice…

• Empty intervals count as “errors”• Well…, what if the “correct” answer isn’t in the database• An “Open-set” problem which Champod, Gantz and

Saunders have pointed out

• Must be run in “on-line” mode for LRG

• After 500+ I.D. attempts run in “off-line” mode we noticed in practice

Page 24: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

• An I.D. is output for each questioned toolmark• This is a computer “match”

• What’s the probability it is truly not a “match”?

• Similar problem in genomics for detecting disease from microarray data• They use data and Bayes’ theorem to get an

estimateNo diseasegenomics = Not a true “match”toolmarks

How good of a “match” is it?Efron Empirical Bayes’

Page 25: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Random Match Probability

{

Distribution of nDs from fragments at crime scene

Distribution of nDs from fragments in population

99% of nDs from crime scene fragments: RMP “window”

Shaded area Prob. random frag from pop would be IDd as CS frag

Page 26: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Random Match Probability• Example

RMP ≈ (0.26 + 0.14 + 0.06)×100 = 46%

Distribution of nDs from glass fragments at crime sceneDistribution of nDs

from glass fragments in population

Page 27: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Random Match Probability• Problems with Random Match Probability

Computations• To get reliable probabilities, need accurate probability

density functions (pdfs)• Higher dimensional pdfs require exponential amounts

of data to accurately fit (curse of dimensionality)

• Overlap in higher dimensions??

• How wide should RMP “windows” be?

• Use distributions for univariate “similarity” measures?

• Different measures correspond to different RMPs!

• No natural choice!

Page 28: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Empirical Bayes’• We use Efron’s machinery for “empirical

Bayes’ two-groups model”Efron

• Surprisingly simple!

• Use binned data to do a Poisson regression

• Some notation:

• S-, truly no association, Null hypothesis

• S+, truly an association, Non-null hypothesis

• z, a score derived from a machine learning task to I.D. an unknown pattern with a group• z is a Gaussian random variate for the Null

Page 29: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Empirical Bayes’• From Bayes’ Theorem we can getEfron:

Estimated probability of not a true “match” given the algorithms' output z-score associated with its “match”

Names: Posterior error probability (PEP)Kall

Local false discovery rate (lfdr)Efron

• Suggested interpretation for casework:• We agree with Gelaman and ShaliziGelman:

= Estimated “believability” of machine made association

“…posterior model probabilities …[are]… useful as tools for prediction and for understanding structure in data, as long as these probabilities are not taken too seriously.”

Page 30: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Empirical Bayes’• Bootstrap procedure to get estimate of the KNM distribution of

“Platt-scores”Platt,e1071

• Use a “Training” set

• Use this to get p-values/z-values on a “Validation” set

• Inspired by Storey and Tibshirani’s Null estimation methodStorey

z-score

From fit histogram by Efron’s method get:

“mixture” density

We can test the fits to

and !

What’s the point??

z-density given KNM => Should be Gaussian

Estimate of prior for KNM

• Use SVM to get KM and KNM “Platt-score” distributions

• Use a “Validation” set

Page 31: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Obs#:1 0.340 0.022 0.033 0.043 0.011 0.013 0.006 0.0092 0.281 0.023 0.028 0.052 0.013 0.014 0.007 0.0103* 0.488 0.024 0.022 0.016 0.008 0.011 0.005 0.0064 0.385 0.021 0.035 0.023 0.011 0.011 0.006 0.0085* 0.523 0.020 0.022 0.017 0.008 0.010 0.005 0.0066 0.478 0.022 0.022 0.020 0.010 0.012 0.006 0.0087 0.451 0.018 0.021 0.026 0.012 0.011 0.007 0.0108* 0.325 0.029 0.036 0.033 0.011 0.015 0.006 0.0109 * 0.592 0.016 0.009 0.015 0.006 0.008 0.004 0.00410 0.279 0.038 0.015 0.027 0.013 0.019 0.008 0.012

Obs#:1 0.022 0.033 0.043 0.011 0.013 0.006 0.0092 0.023 0.028 0.052 0.013 0.014 0.007 0.0103* 0.024 0.022 0.016 0.008 0.011 0.005 0.0064 0.021 0.035 0.023 0.011 0.011 0.006 0.0085* 0.020 0.022 0.017 0.008 0.010 0.005 0.0066 0.022 0.022 0.020 0.010 0.012 0.006 0.0087 0.018 0.021 0.026 0.012 0.011 0.007 0.0108* 0.029 0.036 0.033 0.011 0.015 0.006 0.0109 * 0.016 0.009 0.015 0.006 0.008 0.004 0.00410 0.038 0.015 0.027 0.013 0.019 0.008 0.012

Obs#:1 0.022 0.033 0.043 0.011 0.013 0.006 0.0092 0.023 0.028 0.052 0.013 0.014 0.007 0.0103* 0.024 0.022 0.016 0.008 0.011 0.005 0.0064 0.021 0.035 0.023 0.011 0.011 0.006 0.0085* 0.020 0.022 0.017 0.008 0.010 0.005 0.0066 0.022 0.022 0.020 0.010 0.012 0.006 0.0087 0.018 0.021 0.026 0.012 0.011 0.007 0.0108* 0.029 0.036 0.033 0.011 0.015 0.006 0.0109 * 0.016 0.009 0.015 0.006 0.008 0.004 0.00410 0.038 0.015 0.027 0.013 0.019 0.008 0.012

Obs#:1 0.022 0.033 0.043 0.011 0.013 0.006 0.0092 0.023 0.028 0.052 0.013 0.014 0.007 0.0104 0.021 0.035 0.023 0.011 0.011 0.006 0.0086 0.022 0.022 0.020 0.010 0.012 0.006 0.0087 0.018 0.021 0.026 0.012 0.011 0.007 0.01010 0.038 0.015 0.027 0.013 0.019 0.008 0.012

Obs#:1 0.0332 0.0234 0.0216 0.0067 0.01210 0.015

Obs#:1 0.0332 0.0234 0.0216 0.0067 0.01210 0.015

0.0330.0230.0210.0060.0120.015

Bootstrap sampleTrain SVMGet Platt scores on whole set

Toss KM Platt scoresToss obs. in bootstrap sampleRandomly select a KNM scorefrom each obs.Collect

RepeatBootstrap algorithm to Estimate KNM distribution (The Null)

Page 32: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Estimate of log KNM Platt-score distribution

• Fit of log(KNM) to parametric form helps us avoid plethora of 0 p-values for KM validation set

• “Problem” p-values now

Page 33: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Validation Set

Sample to get a set of IID simulated log(KNM-scores)(“reusing data” less too…??)

Check assumptions on the Null

Uniform Null p-values

Close to N(0,1)Null z-values

Lump together as the “validation set”

Compute p-values for thevalidation set from the fit null

Page 34: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Use locfdrlocfdr Fit classic Poisson regression for f(z)

Use modified locfdr/JAGSJAGS,Plummer or StanStan

Fit Bayesian hierarchal Poisson regressions

z z

Fit local-fdr models

Page 35: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Posterior Association Probability: Believability Curve

12D PCA-SVM locfdr fit for Glock primer shear patterns

+/- 2 standard errors

Page 36: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Bayesian over-dispersed Poisson with intercept on test setBayesian Poisson with intercept on test set

Poisson (Efron) on test set Bayesian Poisson on test set

Page 37: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Bayes Factors/Likelihood Ratios

• In the “Forensic Bayesian Framework”, the Likelihood Ratio is the measure of the weight of evidence.• LRs are called Bayes Factors by most statistician

• LRs give the measure of support the “evidence” lends to the “prosecution hypothesis” vs. the “defense hypothesis”

• From Bayes Theorem:

Page 38: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Bayes Factors/Likelihood Ratios

• Once the “fits” for the Empirical Bayes method are obtained, it is easy to compute the corresponding likelihood ratios.o Using the identity:

the likelihood ratio can be computed as:

Page 39: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Bayes Factors/Likelihood Ratios • Using the fit posteriors and priors we can obtain the likelihood ratiosTippett, Ramos

Known match LR values

Known non-match LR values

Page 40: Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to

Empirical Bayes’: Some Things That Bother Me

• Need a lot of z-scores• Big data sets in forensic science largely don’t exist

• z-scores should be fairly independent• Especially necessary for interval estimates around

lfdrEfron

• Requires “binning” in arbitrary number of intervals• Also suffers from the “Open-set” problem• Interpretation of the prior probability for this

application• Should Pr(S-) be 1 or very close to it? How close?