interpreting a test - massachusetts institute of technologypeople.csail.mit.edu/psz/6.872/n/naive...
TRANSCRIPT
Diagnosis and Predictive Modeling (Bayesian Perspective)
Interpreting a Test
• Relationship between a diagnostic conclusion and a diagnostic test
Test Positive
Test Negative
Disease Present
True Positive
False Negative
TP+FN
Disease Absent
False Positive
True Negative
FP+TN
TP+FP FN+TN
Definitions
Sensitivity (true positive rate): TP/(TP+FN)
False negative rate: 1-Sensitivity = FN/(TP+FN)
Specificity (true negative rate): TN/(FP+TN)
False positive rate: 1-Specificity = FP/(FP+TN)
Positive Predictive Value: TP/(TP+FP)
Negative Predictive Value: TN/(FN+TN)
Test Positive
Test Negative
Disease Present
True Positive
False Negative
TP+FN
Disease Absent
False Positive
True Negative
FP+TN
TP+FP FN+TN
Test Thresholds
T
+
-
FPFN
Wonderful Test
T
+
-
FPFN
Test Thresholds Change Trade-off between Sensitivity and Specificity
T
+
-
FPFN
Receiver Operator Characteristic(ROC) Curve
FPR (1-specificity)
TPR (sensitivity)
00
1
1
T
What makes a better test?
FPR (1-specificity)
TPR (sensitivity)
00
1
1
worthless
superb
OK
What are Models for?
• Classification• Usually discrete (categorical)
• Diagnosis• Selection, …
• Regression• Usually continuous
• Prognosis (how long will you live?)• Estimation (what would this lab value be if I measured it?)
• Sequential diagnostic reasoning
• State of knowledge: a probability distribution over a set of possible diseases
• ∑i P(Di) = 1, P(Di and Dj) = 0 for i ≠ j
• In binary case, it’s just P(D) and P(~D)
• We can observe n different symptoms, S1, ... Sn, any one of which
Diagnostic Reasoning with Naive Bayes
S1 = Cough P(Cough|Di)
D1 0.001
D2 0.9
...
Dm 0.4
S2 = Fever P(F=none|Di) P(F=mild|Di) P(F=severe|Di)
D1 0.05 0.8 0.15
D2 0.9 0.07 0.03
...
Dm 0.4 0.4 0.2
How certain are we after a test?
D?
D+
D-
p(D+)
p(D-)=1-p(D+)
T+
T-
TP=p(T+|D+)
FN=p(T-|D+)
T+
T-
FP=p(T+|D-)
TN=p(T-|D-)Bayes’ Rule:
Imagine P(D+) = .001 (it’s a rare disease) Accuracy of test = P(T+|D+) = P(T-|D-) = .95
• Exploit assumption of conditional independence among symptoms
• Sequence of observations of symptoms, Si, each revise the distribution via Bayes’ Rule
Diagnostic Reasoning with Naive Bayes
D1: 0.12D2: 0.37...Dn: 0.03
D1: 0.19D2: 0.30...Dn: 0.01
D1: 0.08D2: 0.59...Dn: 0.05
D1: 0.01D2: 0.96...Dn: 0.00
obs Si obs Sj obs Sk
• After the j-th observation,
Entropy Redux
• How to choose which observation to make next?
• Compute the expected entropy of P(Di) after requesting each possible observation
• For each observation, Sj, we can get nj possible answers• For each answer, we can compute the revised (by Bayes rule)
posterior probability distribution• For that distribution, we compute its entropy• The expected entropy weights these entropies by the
probability that we would get that answer if we asked that question, namely
Odds-Likelihood
• In gambling, “3-to-1” odds means 75% chance of success
• P = 0.5 means O=1
• Likelihood ratio
• Odds-likelihood form of Bayes rule
• Log transform
Acute Renal Failure Program
• Differential Diagnosis of Acute Oliguric Renal Failure• “stop peeing”
• 14 potential causes, exhaustive and mutually exclusive• 27 tests/questions/observations relevant to differential
• “cheap”; therefore, ordering based on expected information gain• 3 invasive tests (biopsy, retrograde pyelography, renal arteriography)
• “expensive”; ordering based on (very naive) utility model• 8 treatments (conservative, IV fluids, surgery for obstruction, steroids, antibiotics,
surgery for clots, antihypertensive drugs, heparin)• expected outcomes are “better”, “unchanged”, “worse”
• Gorry, G. A., Kassirer, J. P., Essig, A., & Schwartz, W. B. (1973). Decision analysis as the basis for computer-aided management of acute renal failure. The American Journal of Medicine, 55(3), 473–484.
• Demo of ARF Program (reconstructed only the diagnostic portion, in Java, with added graphics
DECISION ANALYSIS--GORRY ET AL.
Question 5.-What is the krdney size on plarn film of the abdomen? ~______ 1. Small 2. Normal 3. Large 4. Very Large
Reply: 3
The current distribuhon is
Disease Probability
OBSTR 0.80 FARF 0.12 PYE 0.04
Question B-Was there a IargeTluid loss preceding the onset of oliguria?
Reply: No
The current distribution is
Disease Probability
OBSTR 0.88 PYE 0.05 FARF 0.03
Question 7-What is the degree of Proteinurra’J
1. 0 2. trace to 2+ 3. 3+to4+
Reply:1
The current distribution IS
Disease Probability
OBSTR 0.94 FARF 0.03 PYE 0.03
Question 8-1s there a history of prolonged hypotension preceding the OnSt ?t of oliguria?
Reply. No
The current distribution is
Disease Probability
OBSTR 0.96 PYE 0.03
Figure 1. Typical interactive dialogue between the physician and the phase I computer program. The final diagnosis, which was arrived at after eight questions were asked, was urinary tract obstruction.
puter program which operates in the interactive mode and which usually can arrive at a diagnosis quickly by requesting only the most critical infor- mation [4,5]. This latter program, like its predeces- sors, still has the serious deficiency that it is indif- ferent to the risks and pain involved in various tests and has no way of balancing the dangers and discomforts of a procedure against the value of the information to be gained. In this sense it lacks a key element that characterizes the practice of a good physician.
We describe an interactive computer program which deals with this problem by incorporating the potential risks and potential benefits of tests and treatments into the decision-making process, uti- lizing the discipline of decision analysis [2].* As a prototype for study we chose acute oliguric renal failure.
The program is divided into two portions: phase I, which considers only tests that involve little risk or discomfort, e.g., historic data, chemical tests of blood, and phase I I, which utilizes tests or treat- ments for which the potential risks are significant.
We also describe the structure of the program, the way in which it has performed in the diagnosis and management of simulated clinical cases, and the problems that must be resolved if the technic is to have value as a “consultant” to the practic- ing physician.
The system to be described has been imple- mented on a time-sharing facility at the Massa-
*In an accompanying paper we have shown how the disci- pline of decision analysis can be utilized without the aid of a computer in the management of complex clinical disorders
[31.
chusetts Institute of Technology, utilizing Fortran 4 as a programming language.
METHODS
Selection of the Clinical Problem. The clinical problem of acute renal failure was selected for several reasons. First, the number of diseases causing acute oliguric renal failure is relatively small and manageable. Second, the problem is within the field of our expertise. Third, the clinical characteristics and the therapy of the diseases causing acute renal failure are rather well defined. The Phase I Program. The phase I portion of the program, as mentioned earlier, considers only tests for which the risk or cost is negligible so that the potential benefit can therefore be mea- sured solely in terms of the expected amount of information to be gained. The program operates in a sequential mode, engaging in an interactive dia- logue with the physician (Figure 1) and has two basic functions. The first, the inference function, evaluates the diagnostic significance of new attri- butes (signs, symptoms and laboratory results) in light of the facts already available about a patient. The second function, the question selection func- tion, determines which question should be asked next in order to maximize the expected gain in in- formation. The underlying concepts of both of these functions will be discussed subsequently. The computer programs have been described elsewhere and will not be considered in detail here [5]. The inference function: The inference function is the means by which the program interprets diag- nostic evidence about a patient. Given the a priori
October 1973 The American Journal of Medicine Volume 55 475
Utility Theory and Decision Analysis
• Goal
• In search, binary
• In real world, better or worse
• Utility measures the value of an outcome• $$$ in investments• Years of (quality-adjusted) life in healthcare
• Principle of rationality
• Choose the action that maximizes expected utility.
• No guarantee of instant “win”, but in long run, maximizes rewards
Case of a Man with Gangrene
• From Pauker’s “Decision Analysis Service” at New England Medical Center Hospital, late 1970’s.
• Man with gangrene of foot • Choose to amputate foot or treat medically • If medical treatment fails, patient may die or may
have to amputate whole leg. • What to do? How to reason about it?
Decision Tree for Gangrene Case (Different sense of “Decision Tree” from ML/Classification!)
worse (.25)
amputate foot
medicine
live (.99)
die (.01)
850
0
full recovery (.7)
die (.05)
1000
0
live (.98)
die(.02)
700
0
amputate leg
medicine
live (.6)
die (.4)
995
0
Choice
Chance
597
686
686
871.5
841.5
900
881
“Folding back” a Decision Tree
• The value of an outcome node is its utility
• The value of a chance node is the expected value of its alternative branches; i.e., their values weighted by their probabilities
• The value of a choice node is the maximum value of any of its branches
Where Do Utilities Come From?
• Standard gamble
• Would you prefer (choose one of the following two):1. I chop off your foot2. We play a game in which a fair process produces a random
number r between 0 and 1
• If r > 0.8, I kill you; otherwise, you live on, healthy
• If you’re indifferent, that’s the value of living without your foot!
• I vary the 0.8 threshold until you are indifferent.
• Alas, difficult ascertainment problems!
The Lady Tasting Tea
• R. A. Fisher & the Lady
• B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea
• Fisher was skeptical that she could distinguish
• Possible resolutions
• Reason about the chemistry of tea and milk
• Milk first: a little tea interacts with a lot of milk
• Tea first: vice versa
• Perform a “clinical trial”
• Ask her to determine order for a series of test cups
• Calculate probability that her answers could have occurred by chance guessing; if small, she “wins”
• ... Fisher’s Exact Test
• Significance testing
• Reject the null hypothesis (that it happened by chance) if its probability is < 0.1, 0.05, 0.01, 0.001, ..., 0.000001, ..., ????
How to deal with multiple testing
• Suppose Ms. Bristol had tried this test 100 times, and passed once. Would you be convinced of her ability to distinguish?
• Bonferroni correction: for n trials, insist on a p-value that is 1/n of what you would demand for a single trial
• Random permutations of data yield distribution of possible results;
• check to see if actual result is an “outlier” in this distribution
• if so, then it’s unlikely to be due to random chance
Cross-validation
• Any number of times
• Train on some subset of the training data
• Test on the remainder, called the validation set
• Choose best meta-parameters
• Train, with those meta-parameters, on all training data
• Test on Test data, once!
Training Data
Test Data
“Real” Training Data
Validation Data
Need to explore many models
• Remember:
• training set => model
• model + test set => measure of performance
• But
• How do we choose the best family of models?
• How do we choose the important features?
• Models may have structural parameters
• Number of hidden units in ANN
• Max number of parents in Bayes Net
• Parameters (like the betas in LR), and meta-parameters
• Not legitimate to “try all” and report the best !!!!!!!!!!!!!!!!!!
Google’s Lessons
• Much of human knowledge is not like physics!• “... invariably, simple models and a lot of data trump more elaborate models based
on less data”• “... simple n-gram models or linear classifiers based on millions of specific
features perform better than elaborate models that try to discover general rules”• “... all the experimental evidence from the last decade suggests that throwing away
rare events is almost always a bad idea, because much Web data consists of individually rare but collectively frequent events”
1 2.000000e+00 2 3.141593e+00 3 4.188790e+00 4 4.934802e+00 5 5.263789e+00 6 5.167713e+00 7 4.724766e+00 8 4.058712e+00 9 3.298509e+00 10 2.550164e+00 11 1.884104e+00 12 1.335263e+00 13 9.106288e-01 14 5.992645e-01 15 3.814433e-01 16 2.353306e-01 17 1.409811e-01 18 8.214589e-02
19 4.662160e-02 20 2.580689e-02 21 1.394915e-02 22 7.370431e-03 23 3.810656e-03 24 1.929574e-03 25 9.577224e-04 26 4.663028e-04 27 2.228721e-04 28 1.046381e-04 29 4.828782e-05 30 2.191535e-05 31 9.787140e-06 32 4.303070e-06 33 1.863467e-06 34 7.952054e-07 35 3.345288e-07 36 1.387895e-07
37 5.680829e-08 38 2.294843e-08 39 9.152231e-09 40 3.604731e-09 41 1.402565e-09 42 5.392665e-10 43 2.049436e-10 44 7.700707e-11 45 2.861553e-11 46 1.051847e-11 47 3.825461e-12 48 1.376865e-12 49 4.905322e-13 50 1.730219e-13
Brian Hayes, http://www.americanscientist.org/issues/pub/an-adventure-in-the-nth-dimension
Can We Deal with Publication Bias?
• Extrapolate from published studies to (perhaps) unpublished ones• Estimate the population of studies being performed
• Federal grant register• ClinicalTrials.gov
• required registration• Public availability of study data allows alternative analyses• “Journal of Negative Results”
Potential Goals of a Study
• Decision support in a clinical case• Maximize expected outcome to this patient
• Policy to establish standards of care• FDA regulation of drugs, devices, …• Diagnostic and treatment recommendation
• e.g., hormone replacement therapy, mammograms for breast cancer detection, prostate-specific antigen to detect prostate cancer, …
• D.A.R.E.• Scientific discovery