medical imaging informatics lecture #10: clinical perspective: single subject classification susanne...
TRANSCRIPT
Medical Imaging Informatics Lecture #10:
Clinical Perspective: Single Subject Classification
Susanne Mueller M.D.
Center for Imaging of Neurodegenerative Diseases
Dept. of Radiology and Biomedical Imaging
Overview
1.Single Subject Classification/Characterization: Motivation and Problems
2.Bayesian Networks for Single Subject Classification/Characterization
1 .Single Subject Classification/Characterization
Quantitative Neuroimaging: Group Comparisons
Temporal Lobe Epilepsy Posttraumatic Stress Disorder
Major Depression
Implicit assumptions of Group Comparisons:
1. Abnormal regions are relevant/characteristic for the disease process.
2. Abnormalities present in all patients, i.e., subject showing abnormalities with disease specific distribution is likely to have the disease.
Quantitative Neuroimaging: Do the Assumptions hold up?
Temporal Lobe Epilepsy Posttraumatic Stress Disorder
Major Depression
Motivation
1. Identification of different variants and/or degrees of the disease process.
2.Translation into clinical application.
Requirements
1.Identification and extraction of discriminating feature:
- Single region.
- Combination of regions.
2. Definition of a threshold for “abnormality”.
Goal: High sensitivity and specificity.
Sensitivity and Specificity: Definitions I
Sensitivity: Probability that test is positive if the patient indeed has the disease.
P (Test positive|Patient has disease)
Test ideally always detects disease.
Sensitivity and Specificity: Definitions II
Specificity: Probability that test is negative if the patient does not have the disease.
P (Test negative|Patient does not have disease)
Test ideally detects only this disease and not some other non-disease related state/other disease
Sensitivity and Specificity
Sensitivity and specificity provide information about a test result given that the patient’s disease state is known.
In clinic however the patient’s disease state is unknown and this is why the test was done in the first place.
=> positive and negative predictive value of the test
Positive and Negative Predictive Value: Definition
Positive predictive value (PPV):
P (Patient has disease|Test positive)
Negative predictive value (NPV):
P (Patient does not have disease|Test negative
ExampleDisease: pos Disease: neg Total
Test: pos 200 20 220
Test: neg 50 300 350
Total 250 320 570
Disease: pos Disease: neg Total
Test: pos 200
True pos
20
False pos
220
All pos
Test: neg 50
False neg
300
True neg
350
All neg
Total 250
All Disease
320
No Disease
570
Total Patients
Sensitivity: 0.80 Specificity: 0.94
PPV: 0.90 NPV: 0.86
Receiver Operator Curve: ROC
Sensitivity and Specificity are good candidates to assess test accuracy. However, they vary with the threshold (test pos/test neg) used.
ROC is a means to compare the accuracy of diagnostic tests over a range of thresholds.
ROC plot sensitivity vs 1- specificity of the test.
EXAMPLE: ROC
sens
itivi
ty
1-specificity
High threshold
Good specificity: 0.92
Medium sensitivity: 0.52
Medium threshold
Medium specificity: 0.7
Medium sensitivity: 0.83
Low threshold
Low specificity: 0.45
Good sensitivity: 0.95
Extreme low threshold
No specificity: 0
Perfect sensitivity: 1
Example: ROC II
Chance line
Example ROC
Optimal threshold indicated by arrow
ROC of good test
Approaches the left corner of ROC
FeatureDefinition: Information extracted from image.
Usefulness of a feature to detect the disease is determined by
1. Convenience of measurement.
2. Accuracy of measurement.
3. Specificity for the disease (e.g.:CK-MB)
4. Number of features (single < several/feature map)
Features and Thresholds used in Imaging for Single-Subject Analyses I
A. Single feature = Region of interest (ROI) analysis
Previous knowledge that ROI is affected by the disease comes either from previous imaging studies or from other sources, e.g. histopathology
Approaches used to detect abnormality for ROI analyses:
z-scores: z = (xs – mean xc)/SDc
t-scores*: t = (xs – mean xc)/ SDc*(n+1/n)1/2)
Bayesian estimate**: z* = (xs – mean xc)/1/2
Crawford and Howell 1998*; Crawford and Grathwaite 2006**
Example: ROI Analyses and Thresholds
Hippocampal volumes corrected for intracranial volume obtained from T1 images of 49 age matched healthy controls (mean: 3.92±0.60) and hippocampal volume of a patient with medial temporal lobe epilepsy 3.29
z- score: -1.05 corresponds to one-tailed p = 0.147
t – score: -1.04 corresponds to one-tailed p = 0. 152
Bayesian one-tailed probability : 0.152 , i.e. 15% of the control hippocampal volumes fall below the patient’s volume
Features and Thresholds used in Imaging for Single-Subject Analyses II
B. Multiple features from the same source = map that encodes severity and distribution of the disease associated abnormality.
Previous knowledge about the distribution/severity of the abnormalities is not mandatory to generate “abnormality” map, i.e., typically whole brain search strategy is employed. However, previous knowledge can be helpful for the correct interpretation.
Approaches used to generate abnormality maps :
z- score maps (continuous or thresholded)
Single-case modification of the General Linear Model used for group analyses.
Features and Thresholds used in Imaging for single-subject analyses III
Problems:
1. Difference reflects normal individual variability rather than disease effects.
2. Assumption that single subject represents the mean of a hypothetical population with equal variance as observed in the control group
3. Higher number of comparisons (multiple ROI/voxel-wise) require:
a. correction for multiple comparisons.
b. Adjustment of result at ROI/voxel level for results in immediate neighborhood, e.g. correction at cluster level
4. Interpretation of resulting maps
Influence of Correction for Multiple Comparison
Scarpazza et al. Neuroimage 2013; 70: 175 -188
Increase
FWE p <0.05
Decrease
Increase
FWE p <0.01
Decrease
Increase
FWE p <0.001
Decrease
Interpretation of Single Subject Maps
Potential strategies for map interpretation:
1. Visual inspection using knowledge about typical distribution of abnormalities in group comparisons.
2. Quantitative comparison with known abnormalities in group comparisons, e.g. calculation of Dice’s co-efficient for whole map.
Problems:
1. Requires existence of “disease typical pattern”.
2. Requires selection of “threshold” indicating match with typical pattern or not.
3. Difficulties to interpret severe abnormalities that do not match typical pattern. Atypical representation? Different disease?
Examples
Gray matter loss in TLE compared to controls
2. Bayesian Networks for Single Subject
Classification/Characterization
Characteristics of an Ideal Classification System
1. Uses non-parametric, non-linear statistics.
2. Identifies characteristic severe and mild brain abnormalities distinguishing between two groups based on their spatial proximity and strength of association with clinical variable (e.g. group membership)
3. Weights abnormal regions according to their ability to discriminate between two groups.
4. Provides probability of group membership and objective threshold based on based on congruence of individual abnormalities with group specific abnormalities.
5. Uses expert a priori knowledge to combine information from different sources (other imaging modalities, clinical information) for the determination of the final group membership.
Bayesian Networks: Basics
Definition: Probabilistic graphical model defined as:
B = (G, )
G is directed acyclic graph (DAG) defined as G = (, ) where represents the set of nodes in the network and the set of directed edges that describe the probabilistic association among the nodes.
is the set of all conditional probability states that the nodes in the network can assume.
Bayesian Networks: Basics: Simple Network
Event A
Event B
A B Prob (A,B)
true true 0.10
false true 0.40
true false 0.35
false false 0.15
DAG G Joint Probability Distribution
Bayesian Networks: Basics: Slightly more complex Network
Event A Event B
Event C
Bayesian Networks: Basics: It is getting more complicated
A
B C
D E
F
I (V, Parents (V), Non-Descendents) V = any variable in the DAG
Markovian assumptions of the DAG
Bayesian Networks: Basics: It is getting more complicated
A
B C
D E
Bayesian Networks: Basics: It is getting more complicated
A
B C
D E
Bayesian Networks: Inference I: Probability of Evidence Query
A
B C
D E
True True Prob: 0.30
Bayesian Networks: Inference II: Prior and Posterior Marginal Query
A
B C
D E
True =0.52, False = 0.48
True = 0.60, False = 0.4
True=0.42, False = 0.58
True = 0.70, False = 0.3 True =0.36, False = 0.64
Evidence = True
True =1.0, False = 0.00
True = 0.92, False = 0.08
True=0.24, False = 0.76
True = 0.84, False = 0.16
Definition: Marginal: projection of the joint distribution on a smaller set of variables
prior marginalposterior marginal
If joint probability distribution is Pr(x1,….,xn), then marginal distribution Pr(x1,….,xm), m≤n is defined as:
Xm+1,….,xn
Pr(x1,….,xm) = Pr(x1,…,xn)
Bayesian Networks: Inference III: Most Probable Explanation (MEP) and Maximal a posteriori Hypothesis (MAP)
A
B C
D E
Definition: MEP = Given evidence for one network variable, instantiation of all other network variables for which probability of the given variable is maximal
MAP = Given evidence for one network variable, instantiation of a subset of network variables for which probability of the given variable is maximal
Evidence mpe: D = true
Evidence mep: D = true
Bayesian Networks: Inference IV:
Different algorithms have been developed to update the remaining network after observation of other network variables.
Examples for exact inference algorithms:
Variable or factor elimination
Recursive conditioning
Clique tree propagation
Belief propagation
Examples for approximate inference algorithms:
Generalized belief propagation
Loopy belief propagation
Importance sampling
Mini-bucket elimination
Bayesian Networks: Learning I: Parameter/Structure
A
B C
D E
Bayesian Networks: Learning II:
Parameter Learning
1. Expert Knowledge
2. Data driven
a. Maximum likelihood (complete data)
b. Expectation maximization (incomplete data).
c. Bayesian approach
Structure Learning
1. Expert Knowledge
2. Data driven:
a. Local search approach
b. Constraint based approach: Greedy search (K2, K3), optimal search
c. Bayesian approach
Bayesian Networks: Application to Image Analysis?
YES
1. Identification of features distinguishing between groups.
2. Combination of different distinguishing imaging features., e.g., volumetric and functional imaging.
Bayesian Network: Basics: Feature Identification I
Characterization of the problem
1. Parameter and structure learning.a. Representative trainings data setb. Information reduction:c. Definition of network nodes : d. Definition of possible node states.e. Calculation of the strength of association between image feature and variable of interest
2. Network query.a. Calculation of group affiliation based on concordance with feature set that had been identified during the learning process.
Structure learning
Parameter learning
Preparatory steps
GAMMA: Graphical Model-Based Morphometric Analysis*
Bayesian Network: Basics: Feature Identification II
Chen R, Herskovits E. IEEE Transactions on Medical Imaging: 2005; 24: 1237 – 1248
GAMMA: Preparatory Steps I
1. Identification of trainings set:
Images patients and controls or subjects with and without the function variable of interest for the Bayesian network.
Representative for population, i.e., encompasses the variability typically found in each of the population
GAMMA: Preparatory Steps II
2. Data Reduction
Use of prior knowledge regarding the nature of the feature, e.g., reduction of information in image to regions with relative volume loss if disease is associated with atrophy.
Creation of binary images: Each individual image is compared to a mean image and voxels with intensities below a predefined threshold, e.g. – 1 SD below control are set to 1, other voxels to zero
GAMMA: Preparatory Steps II
Each binary map can be represented as follows: {F, V1, V2, V3…Vm} where F represents the state, i.e. patient or control and Vi represents the voxel at location i. Given the above definition, a voxel Vi with the value 1 means that there is a volume lossChoice of images to generate mean/SD image and threshold for binarization are crucial for performance
Data Reduction
Mean
SD
Original Control
Original Patient
BinarizedControl(1SD below mean)
BinarizedPatient(1SD below mean)
GAMMA: Structure Learning Theoretical
Steps.
1. Generate Bayesian Network that identifies the probabilistic relationship among {Vi} and F.
2. Generate cluster(s) of representative voxels (R, output: label map) such that all voxels in a cluster have similar probabilistic associations with F (output: belief map). All clusters are independent from each other and each cluster corresponds to a node.
GAMMA: Structure Learning Practical I
Step 1a. Definition of search space V, e.g., all voxels where at least one subject has a value that differs from every other subject’s value for that voxel.
b. Identification of the first search space voxel(s) that provide optimal distinction between states F, e.g. all controls 0, all patients 1. Assign voxel to putative group of representative voxels A.
Group A n=10, “Controls” Group B n=10 , “Patients”
Disease characterized by atrophy or “1” voxels compared to controls
. . . . . .
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
GAMMA: Structure Learning Practical I
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
search space
representative voxels 1st iteration
GAMMA: Structure Learning Practical II
Step 1 cont.
c. Identification of voxel(s) whose addition to A increases the ability of A to correctly distinguish between states F. Process is repeated until there is no voxel left that fulfills that condition.
d. Identification of all those voxels Rn in A that maximize the distinction between states F. The Rn of the first iteration corresponds to R. (The Rn after the first iteration are added to R). Voxels belonging to Rn are removed from search space V.
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
GAMMA: Structure Learning Practical II
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
GAMMA: Structure Learning Practical III
Step 2 (iteration 2 and higher)
a. Calculation of similarity s between voxels in A and voxels in Rn-1. Similarity s for one voxel Vi in A is defined as
s(Vi,Rn-1)= P( Vi=1, Rn-1= 1) + P(Vi = 0, Rn-1 =0)
The similarity for all n voxels in A is expressed as a similarity map S
S = {s(Vi,Rn-1), s(Vj,Rn-1)….s(Vn,Rn-1)}.
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
GAMMA: Structure Learning Practical III
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
GAMMA: Structure Learning Practical IV
Step 2 (iteration 2 and higher) cont.
b. Initial random assignment of a label L (patient or control) to each voxel in A. Voxel with the same label are in the same cluster. Initially there are max only 2 cluster: Cluster of voxels with 100% probability to be patients and cluster of voxels with 100% probability to be controls. During the optimization the probabilities are adjusted and so is the number of clusters. The global variance criterion is used to determine the optimal number of clusters.The L of all n voxels is defined as the label map L.
Group A Controls Group B Patients
0/0 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 1/10 2/9 0/9 2/1
0/0 1/9 1/9 0/9 3/4
0/0 3/0 1/0 1/1 4/3
0/0D 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 B 2/9 B 2/1
0/0 A B B 3/4
0/0 3/0 1/0 1/1 4/3
GAMMA: Structure Learning Practical IV
0/0D 0/1 3/6 1/1 0/0
1/1 3/9 4/9 3/9 1/0
0/1 B 2/9 B 2/1
0/0 B B B 3/4
0/0 3/0 1/0 1/1 4/3
GAMMA: Structure Learning Practical IV
c. Using the similarity map S and the initial label map L as an input, the problem can be reduced to find a posterior MAP estimation of L given the information of S.
MRF Sets penalty for different patterns. Low penalty for combining voxels
with similar probabilistic association and spatial closeness
“Bayesian component”Describes relationship between
similarity map and label map
Loopy Belief Propagation is used for Label map/Belief map inference
GAMMA: Structure/Parameter Learning Practical V
Step 3. Update of the cluster(s) of representative voxels R from previous iterations (n-1) by adding Ln and Bn to generate Lall and B all. Voxels belonging to Rn are removed from the search space V.
Start of a new iteration (Step 1-3) until no voxel in the remaining search space V left. Lall and Ball of the last iteration are defined as Lfinal and Bfinal.
GAMMA: Structure/Parameter Learning Practical VI
Step 4. Validation of Lfinal and Bfinal using jackknife method. The resulting sampling distribution is used to generate a p map on which each p-value indicates the likelihood of an outcome as or more extreme as that observed.
Step 5. Regional state inference: Group assignment of each subject in the trainings set based on correspondence of the individual abnormalities with Lfinal. Observed group membership of trainings set and inferred group membership based RSL are used as parameter set for DAG
GAMMA: Structure/Parameter Learning: Outputs
C: 0.729
P: 0.270
Event A
Event B
Label Map Lfinal
Belief Map Bfinal
GAMMA vs GLM
GAMMA Label Map
SPM GLM FDR 0.05
GAMMA vs. GLM I
GLM
normal distribution
parametric statistic
linear state-image feature association
Detects:
- Segregation
GAMMA
normal/non-normal distribution
non-parametric statistic
probabilistic state-image feature association
Detects:
- Segregation
- Degeneration
- Integration
Group A
Group B
GLM GAMMA GLM GAMMA GLM GAMMA
GAMMA vs. GLM II
Segregation Degeneration Integration
Bayesian Networks: Combination of Features
GAMMA uses a Bayesian network approach to identify features of a single image modality to distinguish between two groups, e.g. patient vs. control.
A. The question is often not only if a subject is a patient or not but also what type of patient the subject is.
However, this scenario does not really reflect the questions that need to be answered in clinical practice.
B. Often information from several sources, imaging other exams, that can be confirmatory but also conflicting.
=> Classical problem for “conventional” Bayesian Network approach
Bayesian Networks: Multi-Level Application I
Example:
Three types of focal non-lesional epilepsy with similar clinical manifestation, controls with matching imaging protocol
A. Temporal lobe epilepsy with mesial temporal sclerosis
B. Temporal lobe epilepsy with normal MRI
C. Frontal lobe epilepsy with normal MRI und different semiology
Two MR imaging modalities:
A. structural whole brain T1 for volumetry = gray matter loss
B. whole brain DTI = white matter abnormalities
GOAL: Bayesian Network classifier that calculates the probability of a patient to belong to one of the 3 types based
on imaging features
Bayesian Networks: Multi-Level Application II
Strategy:
1. First Level: Full characterization of GM and WM imaging features in each group using GAMMA. Each group is compared against each other group, i.e. total of 12 whole brain comparisons and 1 region of interest (hippocampus) comparison.
2. Second Level: Combination of the imaging information incl. one clinical variable (seizures yes/no) into a Bayesian network that allows to calculate the probability of a patient to belong to one of the three epilepsy types (simple evidence query).
Bayesian Networks: First LevelCharacterization of GM Loss
Bayesian Networks: First LevelCharacterization of WM Integrity Loss
Bayesian Networks: Second Level
Results I
TLE with sclerosis: 84.5% correctly classified
15.8% incorrectly classified
0% not classified
TLE with normal MRI 59.1% correctly classified
22.7% incorrectly classified
18.2% not classified
FLE with normal MRI 50% correctly identified
28.6% incorrectly identified
21.4% not identified
Not classified: abnormalities in both modalities not exceeding those found in controls.
Results II
Summary: Classifier using Bayesian Networks
Bayesian networks can be used at several stages of the image processing and analysis.
Bayesian networks are ideal to combine information from different imaging modalities but also from sources, e.g., clinical, metabolomic, genetic etc.
Bayesian networks do not depend on the assumptions of the classical parametric statistics.
Bayesian network provide the probability to belong to a certain group, i.e., are threshold-free.
Bayesian networks show some promise to be useful for “subtype” identification
ReferencesCrawford JR, Howell DC. Comparing an individual’s test score against norms derived from small samples. The Clinical Neuropsychologist 1998; 12: 482 - 486
Crawford JR, Garthwaite PH. Comparison of a single case to a control or normative sample in neuropsychology: Development of a Bayesian approach. Cog Neuropsychology 2007: 24: 343 -372
Scarpezza C, Sartori G, De Simone MS, Mechelli A. When the single matters more than the group: Very high false positive rates in single case voxel-based morphometry. Neuroimage 2013: 70: 175 -188
Darwiche A. Modeling and Reasoning with Bayesian Networks. Cambridge University Press 2009
Chen R, Herskovits EH. Graphical-Model-Based Morphometric Analysis. IEEE Transactions Med Imaging 2005; 24: 1237 – 1248
Chen R, Herskovits EH. Graphical-Model-Based multivariate analysis of functional magnetic resonance data. Neuroimage 2007; 35: 635 -647
Chen R, Herskovits EH. Graphical-Model-Based multivariate analysis (GAMMA): An Open source, cross-platform neuroimaging data analysis software package. Neuroimform DOI 10.1007/s12021-011-9129-7
Mueller SG, Young K, Hartig M, Barakos J, Garcia P. Laxer KD. A two-level multimodality imaging Bayesian network approach for classification of partial epilepsy: Preliminary data. Neuroimage 2013 71:224-232
Software
http://homepages.abdn.ac.uk/j.crawford/pages/dept/SingleCaseMethodsComputerPrograms.HTM
http://genie.sis.pitt.edu/
http://reasoning.cs.ucla.edu/samiam/
GAMMA: http://www.nitrc.org