medical imaging informatics lecture #10: clinical perspective: single subject classification susanne...

Medical Imaging Informatics Lecture #10:

Clinical Perspective: Single Subject Classification

Susanne Mueller M.D.

Center for Imaging of Neurodegenerative Diseases

Dept. of Radiology and Biomedical Imaging

[email protected]

Overview

1.Single Subject Classification/Characterization: Motivation and Problems

2.Bayesian Networks for Single Subject Classification/Characterization

1 .Single Subject Classification/Characterization

Quantitative Neuroimaging: Group Comparisons

Temporal Lobe Epilepsy Posttraumatic Stress Disorder

Major Depression

Implicit assumptions of Group Comparisons:

1. Abnormal regions are relevant/characteristic for the disease process.

2. Abnormalities present in all patients, i.e., subject showing abnormalities with disease specific distribution is likely to have the disease.

Quantitative Neuroimaging: Do the Assumptions hold up?

Temporal Lobe Epilepsy Posttraumatic Stress Disorder

Major Depression

Motivation

1. Identification of different variants and/or degrees of the disease process.

2.Translation into clinical application.

Requirements

1.Identification and extraction of discriminating feature:

- Single region.

- Combination of regions.

2. Definition of a threshold for “abnormality”.

Goal: High sensitivity and specificity.

Sensitivity and Specificity: Definitions I

Sensitivity: Probability that test is positive if the patient indeed has the disease.

P (Test positive|Patient has disease)

Test ideally always detects disease.

Sensitivity and Specificity: Definitions II

Specificity: Probability that test is negative if the patient does not have the disease.

P (Test negative|Patient does not have disease)

Test ideally detects only this disease and not some other non-disease related state/other disease

Sensitivity and Specificity

Sensitivity and specificity provide information about a test result given that the patient’s disease state is known.

In clinic however the patient’s disease state is unknown and this is why the test was done in the first place.

=> positive and negative predictive value of the test

Positive and Negative Predictive Value: Definition

Positive predictive value (PPV):

P (Patient has disease|Test positive)

Negative predictive value (NPV):

P (Patient does not have disease|Test negative

ExampleDisease: pos Disease: neg Total

Test: pos 200 20 220

Test: neg 50 300 350

Total 250 320 570

Disease: pos Disease: neg Total

Test: pos 200

True pos

20

False pos

220

All pos

Test: neg 50

False neg

300

True neg

350

All neg

Total 250

All Disease

320

No Disease

570

Total Patients

Sensitivity: 0.80 Specificity: 0.94

PPV: 0.90 NPV: 0.86

Receiver Operator Curve: ROC

Sensitivity and Specificity are good candidates to assess test accuracy. However, they vary with the threshold (test pos/test neg) used.

ROC is a means to compare the accuracy of diagnostic tests over a range of thresholds.

ROC plot sensitivity vs 1- specificity of the test.

EXAMPLE: ROC

sens

itivi

ty

1-specificity

High threshold

Good specificity: 0.92

Medium sensitivity: 0.52

Medium threshold

Medium specificity: 0.7

Medium sensitivity: 0.83

Low threshold

Low specificity: 0.45

Good sensitivity: 0.95

Extreme low threshold

No specificity: 0

Perfect sensitivity: 1

Example: ROC II

Chance line

Example ROC

Optimal threshold indicated by arrow

ROC of good test

Approaches the left corner of ROC

FeatureDefinition: Information extracted from image.

Usefulness of a feature to detect the disease is determined by

1. Convenience of measurement.

2. Accuracy of measurement.

3. Specificity for the disease (e.g.:CK-MB)

4. Number of features (single < several/feature map)

Features and Thresholds used in Imaging for Single-Subject Analyses I

A. Single feature = Region of interest (ROI) analysis

Previous knowledge that ROI is affected by the disease comes either from previous imaging studies or from other sources, e.g. histopathology

Approaches used to detect abnormality for ROI analyses:

z-scores: z = (xs – mean xc)/SDc

t-scores*: t = (xs – mean xc)/ SDc*(n+1/n)1/2)

Bayesian estimate**: z* = (xs – mean xc)/1/2

Crawford and Howell 1998*; Crawford and Grathwaite 2006**

Example: ROI Analyses and Thresholds

Hippocampal volumes corrected for intracranial volume obtained from T1 images of 49 age matched healthy controls (mean: 3.92±0.60) and hippocampal volume of a patient with medial temporal lobe epilepsy 3.29

z- score: -1.05 corresponds to one-tailed p = 0.147

t – score: -1.04 corresponds to one-tailed p = 0. 152

Bayesian one-tailed probability : 0.152 , i.e. 15% of the control hippocampal volumes fall below the patient’s volume

Features and Thresholds used in Imaging for Single-Subject Analyses II

B. Multiple features from the same source = map that encodes severity and distribution of the disease associated abnormality.

Previous knowledge about the distribution/severity of the abnormalities is not mandatory to generate “abnormality” map, i.e., typically whole brain search strategy is employed. However, previous knowledge can be helpful for the correct interpretation.

Approaches used to generate abnormality maps :

z- score maps (continuous or thresholded)

Single-case modification of the General Linear Model used for group analyses.

Features and Thresholds used in Imaging for single-subject analyses III

Problems:

1. Difference reflects normal individual variability rather than disease effects.

2. Assumption that single subject represents the mean of a hypothetical population with equal variance as observed in the control group

3. Higher number of comparisons (multiple ROI/voxel-wise) require:

a. correction for multiple comparisons.

b. Adjustment of result at ROI/voxel level for results in immediate neighborhood, e.g. correction at cluster level

4. Interpretation of resulting maps

Influence of Correction for Multiple Comparison

Scarpazza et al. Neuroimage 2013; 70: 175 -188

Increase

FWE p <0.05

Decrease

Increase

FWE p <0.01

Decrease

Increase

FWE p <0.001

Decrease

Interpretation of Single Subject Maps

Potential strategies for map interpretation:

1. Visual inspection using knowledge about typical distribution of abnormalities in group comparisons.

2. Quantitative comparison with known abnormalities in group comparisons, e.g. calculation of Dice’s co-efficient for whole map.

Problems:

1. Requires existence of “disease typical pattern”.

2. Requires selection of “threshold” indicating match with typical pattern or not.

3. Difficulties to interpret severe abnormalities that do not match typical pattern. Atypical representation? Different disease?

Examples

Gray matter loss in TLE compared to controls

2. Bayesian Networks for Single Subject

Classification/Characterization

Characteristics of an Ideal Classification System

1. Uses non-parametric, non-linear statistics.

2. Identifies characteristic severe and mild brain abnormalities distinguishing between two groups based on their spatial proximity and strength of association with clinical variable (e.g. group membership)

3. Weights abnormal regions according to their ability to discriminate between two groups.

4. Provides probability of group membership and objective threshold based on based on congruence of individual abnormalities with group specific abnormalities.

5. Uses expert a priori knowledge to combine information from different sources (other imaging modalities, clinical information) for the determination of the final group membership.

Bayesian Networks: Basics

Definition: Probabilistic graphical model defined as:

B = (G, )

G is directed acyclic graph (DAG) defined as G = (, ) where represents the set of nodes in the network and the set of directed edges that describe the probabilistic association among the nodes.

is the set of all conditional probability states that the nodes in the network can assume.

Bayesian Networks: Basics: Simple Network

Event A

Event B

A B Prob (A,B)

true true 0.10

false true 0.40

true false 0.35

false false 0.15

DAG G Joint Probability Distribution

Bayesian Networks: Basics: Slightly more complex Network

Event A Event B

Event C

Bayesian Networks: Basics: It is getting more complicated

A

B C

D E

F

I (V, Parents (V), Non-Descendents) V = any variable in the DAG

Markovian assumptions of the DAG

Bayesian Networks: Basics: It is getting more complicated

A

B C

D E

Bayesian Networks: Inference I: Probability of Evidence Query

A

B C

D E

True True Prob: 0.30

Bayesian Networks: Inference II: Prior and Posterior Marginal Query

A

B C

D E

True =0.52, False = 0.48

True = 0.60, False = 0.4

True=0.42, False = 0.58

True = 0.70, False = 0.3 True =0.36, False = 0.64

Evidence = True

True =1.0, False = 0.00

True = 0.92, False = 0.08

True=0.24, False = 0.76

True = 0.84, False = 0.16

Definition: Marginal: projection of the joint distribution on a smaller set of variables

prior marginalposterior marginal

If joint probability distribution is Pr(x1,….,xn), then marginal distribution Pr(x1,….,xm), m≤n is defined as:

Xm+1,….,xn

Pr(x1,….,xm) = Pr(x1,…,xn)

Bayesian Networks: Inference III: Most Probable Explanation (MEP) and Maximal a posteriori Hypothesis (MAP)

A

B C

D E

Definition: MEP = Given evidence for one network variable, instantiation of all other network variables for which probability of the given variable is maximal

MAP = Given evidence for one network variable, instantiation of a subset of network variables for which probability of the given variable is maximal

Evidence mpe: D = true

Evidence mep: D = true

Bayesian Networks: Inference IV:

Different algorithms have been developed to update the remaining network after observation of other network variables.

Examples for exact inference algorithms:

Variable or factor elimination

Recursive conditioning

Clique tree propagation

Belief propagation

Examples for approximate inference algorithms:

Generalized belief propagation

Loopy belief propagation

Importance sampling

Mini-bucket elimination

Bayesian Networks: Learning I: Parameter/Structure

A

B C

D E

Bayesian Networks: Learning II:

Parameter Learning

1. Expert Knowledge

2. Data driven

a. Maximum likelihood (complete data)

b. Expectation maximization (incomplete data).

c. Bayesian approach

Structure Learning

1. Expert Knowledge

2. Data driven:

a. Local search approach

b. Constraint based approach: Greedy search (K2, K3), optimal search

c. Bayesian approach

Bayesian Networks: Application to Image Analysis?

YES

1. Identification of features distinguishing between groups.

2. Combination of different distinguishing imaging features., e.g., volumetric and functional imaging.

Bayesian Network: Basics: Feature Identification I

Characterization of the problem

1. Parameter and structure learning.a. Representative trainings data setb. Information reduction:c. Definition of network nodes : d. Definition of possible node states.e. Calculation of the strength of association between image feature and variable of interest

2. Network query.a. Calculation of group affiliation based on concordance with feature set that had been identified during the learning process.

Structure learning

Parameter learning

Preparatory steps

GAMMA: Graphical Model-Based Morphometric Analysis*

Bayesian Network: Basics: Feature Identification II

Chen R, Herskovits E. IEEE Transactions on Medical Imaging: 2005; 24: 1237 – 1248

GAMMA: Preparatory Steps I

1. Identification of trainings set:

Images patients and controls or subjects with and without the function variable of interest for the Bayesian network.

Representative for population, i.e., encompasses the variability typically found in each of the population

GAMMA: Preparatory Steps II

2. Data Reduction

Use of prior knowledge regarding the nature of the feature, e.g., reduction of information in image to regions with relative volume loss if disease is associated with atrophy.

Creation of binary images: Each individual image is compared to a mean image and voxels with intensities below a predefined threshold, e.g. – 1 SD below control are set to 1, other voxels to zero

GAMMA: Preparatory Steps II

Each binary map can be represented as follows: {F, V1, V2, V3…Vm} where F represents the state, i.e. patient or control and Vi represents the voxel at location i. Given the above definition, a voxel Vi with the value 1 means that there is a volume lossChoice of images to generate mean/SD image and threshold for binarization are crucial for performance

Data Reduction

Mean

SD

Original Control

Original Patient

BinarizedControl(1SD below mean)

BinarizedPatient(1SD below mean)

GAMMA: Structure Learning Theoretical

Steps.

1. Generate Bayesian Network that identifies the probabilistic relationship among {Vi} and F.

2. Generate cluster(s) of representative voxels (R, output: label map) such that all voxels in a cluster have similar probabilistic associations with F (output: belief map). All clusters are independent from each other and each cluster corresponds to a node.

GAMMA: Structure Learning Practical I

Step 1a. Definition of search space V, e.g., all voxels where at least one subject has a value that differs from every other subject’s value for that voxel.

b. Identification of the first search space voxel(s) that provide optimal distinction between states F, e.g. all controls 0, all patients 1. Assign voxel to putative group of representative voxels A.

Group A n=10, “Controls” Group B n=10 , “Patients”

Disease characterized by atrophy or “1” voxels compared to controls

. . . . . .

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

GAMMA: Structure Learning Practical I

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

search space

representative voxels 1st iteration

GAMMA: Structure Learning Practical II

Step 1 cont.

c. Identification of voxel(s) whose addition to A increases the ability of A to correctly distinguish between states F. Process is repeated until there is no voxel left that fulfills that condition.

d. Identification of all those voxels Rn in A that maximize the distinction between states F. The Rn of the first iteration corresponds to R. (The Rn after the first iteration are added to R). Voxels belonging to Rn are removed from search space V.

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

GAMMA: Structure Learning Practical II

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

GAMMA: Structure Learning Practical III

Step 2 (iteration 2 and higher)

a. Calculation of similarity s between voxels in A and voxels in Rn-1. Similarity s for one voxel Vi in A is defined as

s(Vi,Rn-1)= P( Vi=1, Rn-1= 1) + P(Vi = 0, Rn-1 =0)

The similarity for all n voxels in A is expressed as a similarity map S

S = {s(Vi,Rn-1), s(Vj,Rn-1)….s(Vn,Rn-1)}.

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

GAMMA: Structure Learning Practical III

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

GAMMA: Structure Learning Practical IV

Step 2 (iteration 2 and higher) cont.

b. Initial random assignment of a label L (patient or control) to each voxel in A. Voxel with the same label are in the same cluster. Initially there are max only 2 cluster: Cluster of voxels with 100% probability to be patients and cluster of voxels with 100% probability to be controls. During the optimization the probabilities are adjusted and so is the number of clusters. The global variance criterion is used to determine the optimal number of clusters.The L of all n voxels is defined as the label map L.

Group A Controls Group B Patients

0/0 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 1/10 2/9 0/9 2/1

0/0 1/9 1/9 0/9 3/4

0/0 3/0 1/0 1/1 4/3

0/0D 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 B 2/9 B 2/1

0/0 A B B 3/4

0/0 3/0 1/0 1/1 4/3


0/0D 0/1 3/6 1/1 0/0

1/1 3/9 4/9 3/9 1/0

0/1 B 2/9 B 2/1

0/0 B B B 3/4

0/0 3/0 1/0 1/1 4/3


c. Using the similarity map S and the initial label map L as an input, the problem can be reduced to find a posterior MAP estimation of L given the information of S.

MRF Sets penalty for different patterns. Low penalty for combining voxels

with similar probabilistic association and spatial closeness

“Bayesian component”Describes relationship between

similarity map and label map

Loopy Belief Propagation is used for Label map/Belief map inference

GAMMA: Structure/Parameter Learning Practical V

Step 3. Update of the cluster(s) of representative voxels R from previous iterations (n-1) by adding Ln and Bn to generate Lall and B all. Voxels belonging to Rn are removed from the search space V.

Start of a new iteration (Step 1-3) until no voxel in the remaining search space V left. Lall and Ball of the last iteration are defined as Lfinal and Bfinal.

GAMMA: Structure/Parameter Learning Practical VI

Step 4. Validation of Lfinal and Bfinal using jackknife method. The resulting sampling distribution is used to generate a p map on which each p-value indicates the likelihood of an outcome as or more extreme as that observed.

Step 5. Regional state inference: Group assignment of each subject in the trainings set based on correspondence of the individual abnormalities with Lfinal. Observed group membership of trainings set and inferred group membership based RSL are used as parameter set for DAG

GAMMA: Structure/Parameter Learning: Outputs

C: 0.729

P: 0.270

Event A

Event B

Label Map Lfinal

Belief Map Bfinal

GAMMA vs GLM

GAMMA Label Map

SPM GLM FDR 0.05

GAMMA vs. GLM I

GLM

normal distribution

parametric statistic

linear state-image feature association

Detects:

- Segregation

GAMMA

normal/non-normal distribution

non-parametric statistic

probabilistic state-image feature association

Detects:

- Segregation

- Degeneration

- Integration

Group A

Group B

GLM GAMMA GLM GAMMA GLM GAMMA

GAMMA vs. GLM II

Segregation Degeneration Integration

Bayesian Networks: Combination of Features

GAMMA uses a Bayesian network approach to identify features of a single image modality to distinguish between two groups, e.g. patient vs. control.

A. The question is often not only if a subject is a patient or not but also what type of patient the subject is.

However, this scenario does not really reflect the questions that need to be answered in clinical practice.

B. Often information from several sources, imaging other exams, that can be confirmatory but also conflicting.

=> Classical problem for “conventional” Bayesian Network approach

Bayesian Networks: Multi-Level Application I

Example:

Three types of focal non-lesional epilepsy with similar clinical manifestation, controls with matching imaging protocol

A. Temporal lobe epilepsy with mesial temporal sclerosis

B. Temporal lobe epilepsy with normal MRI

C. Frontal lobe epilepsy with normal MRI und different semiology

Two MR imaging modalities:

A. structural whole brain T1 for volumetry = gray matter loss

B. whole brain DTI = white matter abnormalities

GOAL: Bayesian Network classifier that calculates the probability of a patient to belong to one of the 3 types based

on imaging features

Bayesian Networks: Multi-Level Application II

Strategy:

1. First Level: Full characterization of GM and WM imaging features in each group using GAMMA. Each group is compared against each other group, i.e. total of 12 whole brain comparisons and 1 region of interest (hippocampus) comparison.

2. Second Level: Combination of the imaging information incl. one clinical variable (seizures yes/no) into a Bayesian network that allows to calculate the probability of a patient to belong to one of the three epilepsy types (simple evidence query).

Bayesian Networks: First LevelCharacterization of GM Loss

Bayesian Networks: First LevelCharacterization of WM Integrity Loss

Bayesian Networks: Second Level

Results I

TLE with sclerosis: 84.5% correctly classified

15.8% incorrectly classified

0% not classified

TLE with normal MRI 59.1% correctly classified

22.7% incorrectly classified

18.2% not classified

FLE with normal MRI 50% correctly identified

28.6% incorrectly identified

21.4% not identified

Not classified: abnormalities in both modalities not exceeding those found in controls.

Results II

Summary: Classifier using Bayesian Networks

Bayesian networks can be used at several stages of the image processing and analysis.

Bayesian networks are ideal to combine information from different imaging modalities but also from sources, e.g., clinical, metabolomic, genetic etc.

Bayesian networks do not depend on the assumptions of the classical parametric statistics.

Bayesian network provide the probability to belong to a certain group, i.e., are threshold-free.

Bayesian networks show some promise to be useful for “subtype” identification

ReferencesCrawford JR, Howell DC. Comparing an individual’s test score against norms derived from small samples. The Clinical Neuropsychologist 1998; 12: 482 - 486

Crawford JR, Garthwaite PH. Comparison of a single case to a control or normative sample in neuropsychology: Development of a Bayesian approach. Cog Neuropsychology 2007: 24: 343 -372

Scarpezza C, Sartori G, De Simone MS, Mechelli A. When the single matters more than the group: Very high false positive rates in single case voxel-based morphometry. Neuroimage 2013: 70: 175 -188

Darwiche A. Modeling and Reasoning with Bayesian Networks. Cambridge University Press 2009

Chen R, Herskovits EH. Graphical-Model-Based Morphometric Analysis. IEEE Transactions Med Imaging 2005; 24: 1237 – 1248

Chen R, Herskovits EH. Graphical-Model-Based multivariate analysis of functional magnetic resonance data. Neuroimage 2007; 35: 635 -647

Chen R, Herskovits EH. Graphical-Model-Based multivariate analysis (GAMMA): An Open source, cross-platform neuroimaging data analysis software package. Neuroimform DOI 10.1007/s12021-011-9129-7

Mueller SG, Young K, Hartig M, Barakos J, Garcia P. Laxer KD. A two-level multimodality imaging Bayesian network approach for classification of partial epilepsy: Preliminary data. Neuroimage 2013 71:224-232

Software

http://homepages.abdn.ac.uk/j.crawford/pages/dept/SingleCaseMethodsComputerPrograms.HTM

http://genie.sis.pitt.edu/

http://reasoning.cs.ucla.edu/samiam/

GAMMA: http://www.nitrc.org



http://genie.sis.pitt.edu/

http://reasoning.cs.ucla.edu/samiam/

medical imaging informatics lecture #10: clinical perspective: single subject classification susanne...

Documents

disease test

disease process

test accuracy

patients disease state

test result

medium sensitivity

good sensitivity

p test positivepatient