interpreting a test - massachusetts institute of technologypeople.csail.mit.edu/psz/6.872/n/naive...

Diagnosis and Predictive Modeling (Bayesian Perspective)

Interpreting a Test

• Relationship between a diagnostic conclusion and a diagnostic test

Test Positive

Test Negative

Disease Present

True Positive

False Negative

TP+FN

Disease Absent

False Positive

True Negative

FP+TN

TP+FP FN+TN

Definitions

Sensitivity (true positive rate): TP/(TP+FN)

False negative rate: 1-Sensitivity = FN/(TP+FN)

Specificity (true negative rate): TN/(FP+TN)

False positive rate: 1-Specificity = FP/(FP+TN)

Positive Predictive Value: TP/(TP+FP)

Negative Predictive Value: TN/(FN+TN)

Test Positive

Test Negative

Disease Present

True Positive

False Negative

TP+FN

Disease Absent

False Positive

True Negative

FP+TN

TP+FP FN+TN

Test Thresholds

T

+

-

FPFN

Wonderful Test

T

+

-

FPFN

Test Thresholds Change Trade-off between Sensitivity and Specificity

T

+

-

FPFN

Receiver Operator Characteristic(ROC) Curve

FPR (1-specificity)

TPR (sensitivity)

00

1

1

T

What makes a better test?

FPR (1-specificity)

TPR (sensitivity)

00

1

1

worthless

superb

OK

What are Models for?

• Classification• Usually discrete (categorical)

• Diagnosis• Selection, …

• Regression• Usually continuous

• Prognosis (how long will you live?)• Estimation (what would this lab value be if I measured it?)

• Sequential diagnostic reasoning

• State of knowledge: a probability distribution over a set of possible diseases

• ∑i P(Di) = 1, P(Di and Dj) = 0 for i ≠ j

• In binary case, it’s just P(D) and P(~D)

• We can observe n different symptoms, S1, ... Sn, any one of which

Diagnostic Reasoning with Naive Bayes

S1 = Cough P(Cough|Di)

D1 0.001

D2 0.9

...

Dm 0.4

S2 = Fever P(F=none|Di) P(F=mild|Di) P(F=severe|Di)

D1 0.05 0.8 0.15

D2 0.9 0.07 0.03

...

Dm 0.4 0.4 0.2

How certain are we after a test?

D?

D+

D-

p(D+)

p(D-)=1-p(D+)

T+

T-

TP=p(T+|D+)

FN=p(T-|D+)

T+

T-

FP=p(T+|D-)

TN=p(T-|D-)Bayes’ Rule:

Imagine P(D+) = .001 (it’s a rare disease) Accuracy of test = P(T+|D+) = P(T-|D-) = .95

• Exploit assumption of conditional independence among symptoms

• Sequence of observations of symptoms, Si, each revise the distribution via Bayes’ Rule

Diagnostic Reasoning with Naive Bayes

D1: 0.12D2: 0.37...Dn: 0.03

D1: 0.19D2: 0.30...Dn: 0.01

D1: 0.08D2: 0.59...Dn: 0.05

D1: 0.01D2: 0.96...Dn: 0.00

obs Si obs Sj obs Sk

• After the j-th observation,

Entropy Redux

• How to choose which observation to make next?

• Compute the expected entropy of P(Di) after requesting each possible observation

• For each observation, Sj, we can get nj possible answers• For each answer, we can compute the revised (by Bayes rule)

posterior probability distribution• For that distribution, we compute its entropy• The expected entropy weights these entropies by the

probability that we would get that answer if we asked that question, namely

Odds-Likelihood

• In gambling, “3-to-1” odds means 75% chance of success

• P = 0.5 means O=1

• Likelihood ratio

• Odds-likelihood form of Bayes rule

• Log transform

Acute Renal Failure Program

• Differential Diagnosis of Acute Oliguric Renal Failure• “stop peeing”

• 14 potential causes, exhaustive and mutually exclusive• 27 tests/questions/observations relevant to differential

• “cheap”; therefore, ordering based on expected information gain• 3 invasive tests (biopsy, retrograde pyelography, renal arteriography)

• “expensive”; ordering based on (very naive) utility model• 8 treatments (conservative, IV fluids, surgery for obstruction, steroids, antibiotics,

surgery for clots, antihypertensive drugs, heparin)• expected outcomes are “better”, “unchanged”, “worse”

• Gorry, G. A., Kassirer, J. P., Essig, A., & Schwartz, W. B. (1973). Decision analysis as the basis for computer-aided management of acute renal failure. The American Journal of Medicine, 55(3), 473–484.

• Demo of ARF Program (reconstructed only the diagnostic portion, in Java, with added graphics

DECISION ANALYSIS--GORRY ET AL.

Question 5.-What is the krdney size on plarn film of the abdomen? ~______ 1. Small 2. Normal 3. Large 4. Very Large

Reply: 3

The current distribuhon is

Disease Probability

OBSTR 0.80 FARF 0.12 PYE 0.04

Question B-Was there a IargeTluid loss preceding the onset of oliguria?

Reply: No

The current distribution is

Disease Probability

OBSTR 0.88 PYE 0.05 FARF 0.03

Question 7-What is the degree of Proteinurra’J

1. 0 2. trace to 2+ 3. 3+to4+

Reply:1

The current distribution IS

Disease Probability

OBSTR 0.94 FARF 0.03 PYE 0.03

Question 8-1s there a history of prolonged hypotension preceding the OnSt ?t of oliguria?

Reply. No

The current distribution is

Disease Probability

OBSTR 0.96 PYE 0.03

Figure 1. Typical interactive dialogue between the physician and the phase I computer program. The final diagnosis, which was arrived at after eight questions were asked, was urinary tract obstruction.

puter program which operates in the interactive mode and which usually can arrive at a diagnosis quickly by requesting only the most critical information [4,5]. This latter program, like its predeces- sors, still has the serious deficiency that it is indifferent to the risks and pain involved in various tests and has no way of balancing the dangers and discomforts of a procedure against the value of the information to be gained. In this sense it lacks a key element that characterizes the practice of a good physician.

We describe an interactive computer program which deals with this problem by incorporating the potential risks and potential benefits of tests and treatments into the decision-making process, utilizing the discipline of decision analysis [2].* As a prototype for study we chose acute oliguric renal failure.

The program is divided into two portions: phase I, which considers only tests that involve little risk or discomfort, e.g., historic data, chemical tests of blood, and phase I I, which utilizes tests or treatments for which the potential risks are significant.

We also describe the structure of the program, the way in which it has performed in the diagnosis and management of simulated clinical cases, and the problems that must be resolved if the technic is to have value as a “consultant” to the practic- ing physician.

The system to be described has been imple- mented on a time-sharing facility at the Massa-

*In an accompanying paper we have shown how the discipline of decision analysis can be utilized without the aid of a computer in the management of complex clinical disorders

[31.

chusetts Institute of Technology, utilizing Fortran 4 as a programming language.

METHODS

Selection of the Clinical Problem. The clinical problem of acute renal failure was selected for several reasons. First, the number of diseases causing acute oliguric renal failure is relatively small and manageable. Second, the problem is within the field of our expertise. Third, the clinical characteristics and the therapy of the diseases causing acute renal failure are rather well defined. The Phase I Program. The phase I portion of the program, as mentioned earlier, considers only tests for which the risk or cost is negligible so that the potential benefit can therefore be measured solely in terms of the expected amount of information to be gained. The program operates in a sequential mode, engaging in an interactive dialogue with the physician (Figure 1) and has two basic functions. The first, the inference function, evaluates the diagnostic significance of new attri- butes (signs, symptoms and laboratory results) in light of the facts already available about a patient. The second function, the question selection function, determines which question should be asked next in order to maximize the expected gain in information. The underlying concepts of both of these functions will be discussed subsequently. The computer programs have been described elsewhere and will not be considered in detail here [5]. The inference function: The inference function is the means by which the program interprets diagnostic evidence about a patient. Given the a priori

October 1973 The American Journal of Medicine Volume 55 475

Utility Theory and Decision Analysis

• Goal

• In search, binary

• In real world, better or worse

• Utility measures the value of an outcome• $$$ in investments• Years of (quality-adjusted) life in healthcare

• Principle of rationality

• Choose the action that maximizes expected utility.

• No guarantee of instant “win”, but in long run, maximizes rewards

Case of a Man with Gangrene

• From Pauker’s “Decision Analysis Service” at New England Medical Center Hospital, late 1970’s.

• Man with gangrene of foot • Choose to amputate foot or treat medically • If medical treatment fails, patient may die or may

have to amputate whole leg. • What to do? How to reason about it?

Decision Tree for Gangrene Case (Different sense of “Decision Tree” from ML/Classification!)

worse (.25)

amputate foot

medicine

live (.99)

die (.01)

850

0

full recovery (.7)

die (.05)

1000

0

live (.98)

die(.02)

700

0

amputate leg

medicine

live (.6)

die (.4)

995

0

Choice

Chance

597

686

686

871.5

841.5

900

881

“Folding back” a Decision Tree

• The value of an outcome node is its utility

• The value of a chance node is the expected value of its alternative branches; i.e., their values weighted by their probabilities

• The value of a choice node is the maximum value of any of its branches

Where Do Utilities Come From?

• Standard gamble

• Would you prefer (choose one of the following two):1. I chop off your foot2. We play a game in which a fair process produces a random

number r between 0 and 1

• If r > 0.8, I kill you; otherwise, you live on, healthy

• If you’re indifferent, that’s the value of living without your foot!

• I vary the 0.8 threshold until you are indifferent.

• Alas, difficult ascertainment problems!

The Lady Tasting Tea

• R. A. Fisher & the Lady

• B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea

• Fisher was skeptical that she could distinguish

• Possible resolutions

• Reason about the chemistry of tea and milk

• Milk first: a little tea interacts with a lot of milk

• Tea first: vice versa

• Perform a “clinical trial”

• Ask her to determine order for a series of test cups

• Calculate probability that her answers could have occurred by chance guessing; if small, she “wins”

• ... Fisher’s Exact Test

• Significance testing

• Reject the null hypothesis (that it happened by chance) if its probability is < 0.1, 0.05, 0.01, 0.001, ..., 0.000001, ..., ????

How to deal with multiple testing

• Suppose Ms. Bristol had tried this test 100 times, and passed once. Would you be convinced of her ability to distinguish?

• Bonferroni correction: for n trials, insist on a p-value that is 1/n of what you would demand for a single trial

• Random permutations of data yield distribution of possible results;

• check to see if actual result is an “outlier” in this distribution

• if so, then it’s unlikely to be due to random chance

Cross-validation

• Any number of times

• Train on some subset of the training data

• Test on the remainder, called the validation set

• Choose best meta-parameters

• Train, with those meta-parameters, on all training data

• Test on Test data, once!

Training Data

Test Data

“Real” Training Data

Validation Data

Need to explore many models

• Remember:

• training set => model

• model + test set => measure of performance

• But

• How do we choose the best family of models?

• How do we choose the important features?

• Models may have structural parameters

• Number of hidden units in ANN

• Max number of parents in Bayes Net

• Parameters (like the betas in LR), and meta-parameters

• Not legitimate to “try all” and report the best !!!!!!!!!!!!!!!!!!

Google’s Lessons

• Much of human knowledge is not like physics!• “... invariably, simple models and a lot of data trump more elaborate models based

on less data”• “... simple n-gram models or linear classifiers based on millions of specific

features perform better than elaborate models that try to discover general rules”• “... all the experimental evidence from the last decade suggests that throwing away

rare events is almost always a bad idea, because much Web data consists of individually rare but collectively frequent events”

1 2.000000e+00 2 3.141593e+00 3 4.188790e+00 4 4.934802e+00 5 5.263789e+00 6 5.167713e+00 7 4.724766e+00 8 4.058712e+00 9 3.298509e+00 10 2.550164e+00 11 1.884104e+00 12 1.335263e+00 13 9.106288e-01 14 5.992645e-01 15 3.814433e-01 16 2.353306e-01 17 1.409811e-01 18 8.214589e-02

19 4.662160e-02 20 2.580689e-02 21 1.394915e-02 22 7.370431e-03 23 3.810656e-03 24 1.929574e-03 25 9.577224e-04 26 4.663028e-04 27 2.228721e-04 28 1.046381e-04 29 4.828782e-05 30 2.191535e-05 31 9.787140e-06 32 4.303070e-06 33 1.863467e-06 34 7.952054e-07 35 3.345288e-07 36 1.387895e-07

37 5.680829e-08 38 2.294843e-08 39 9.152231e-09 40 3.604731e-09 41 1.402565e-09 42 5.392665e-10 43 2.049436e-10 44 7.700707e-11 45 2.861553e-11 46 1.051847e-11 47 3.825461e-12 48 1.376865e-12 49 4.905322e-13 50 1.730219e-13

Brian Hayes, http://www.americanscientist.org/issues/pub/an-adventure-in-the-nth-dimension

Can We Deal with Publication Bias?

• Extrapolate from published studies to (perhaps) unpublished ones• Estimate the population of studies being performed

• Federal grant register• ClinicalTrials.gov

• required registration• Public availability of study data allows alternative analyses• “Journal of Negative Results”

Potential Goals of a Study

• Decision support in a clinical case• Maximize expected outcome to this patient

• Policy to establish standards of care• FDA regulation of drugs, devices, …• Diagnostic and treatment recommendation

• e.g., hormone replacement therapy, mammograms for breast cancer detection, prostate-specific antigen to detect prostate cancer, …

• D.A.R.E.• Scientific discovery

interpreting a test - massachusetts institute of technologypeople.csail.mit.edu/psz/6.872/n/naive...

Documents