multivariate models of inter-subject anatomical variability · introduction geometric variability...

IntroductionGeometric Variability

Similarity MeasuresReal data

Multivariate models of inter-subjectanatomical variability

John Ashburner

Wellcome Trust Centre for Neuroimaging,UCL Institute of Neurology,

12 Queen Square,London WC1N 3BG,

UK.

John Ashburner Anatomical Features



PredictionBinary ClassificationCurse of dimensionality

“The only relevant test of the validityof a hypothesis is comparison ofprediction with experience.”

Milton Friedman





Choosing models/hypotheses/theories

MacKay, DJC.“Bayesianinterpolation.”Neural computation4, no. 3 (1992):415-447.





Evidence-based Science

...also just known as “science”.

Researchers claim to find differences between groups. Dothose findings actually discriminate?

How can we most accurately diagnose a disorder from imagedata?

Pharma wants biomarkers. How do we most effectivelyidentify them?

There are lots of potential imaging biomarkers. Which aremost (cost) effective?

Pattern recognition provides a framework to compare data (orpreprocessing strategy) to determine the most accurate approach.





Biological variability is multivariate





A generative classification approach

p(x,y=0) = p(x|y=0) p(y=0)

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1p(x,y=1) = p(x|y=1) p(y=1)

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1

p(x) = p(x,y=0) + p(x,y=1)

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1p(y=0|x) = p(x,y=0)/p(x)

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1





Discriminative classification approaches

Ground truth

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1FLDA

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1

SVC

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1Simple Logistic Regression

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1





Bayesian classification

Simple Logistic Regression

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1

0 2 4−7

−6

−5

−4

−3

−2

−1Hyperplane Uncertainty

Feature 1

Fea

ture

2

Bayesian Logistic Regression

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1Bayesian Logistic Regression

Feature 1

Fea

ture

2

0 2 4−7

−6

−5

−4

−3

−2

−1





Why Bayesian?

To deal with different priors.

Consider a method with 90% sensitivity and specificity.Consider using this to screen for a disease afflicting 1% of thepopulation.On average, out of 100 people there would be 10 wronglyassigned to the disease group.A positive diagnosis suggests only about a 10% chance ofhaving the disease.

P(Disease|Pred+) = P(Pred+|Disease)P(Disease)P(Pred+|Disease)P(Disease)+P(Pred+|Healthy)P(Healthy)

= Sensitivity×P(Disease)Sensitivity×P(Disease)+(1−Specificity)×P(Healthy)

Better decision-making by accounting for utility functions.





Curse of dimensionality

Large p, small n.





Nearest-neighbour classification

−3 −2 −1 0 1 2−2

−1

0

1

2

Feature 1

Featu

re 2

Not nicesmoothseparations.

Lots of sharpcorners.

May beimproved withK-nearestneighbours.





Rule-based approaches

0 1 2 3 40

1

2

3

4

x

y

((x<0.3) & (y<2)) | ((x<0.75) & (y<1.25)) | (y<0.4)

Not nicesmoothseparations.

Lots of sharpcorners.





Corners matter in high-dimensions





Corners matter in high-dimensions

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Circle area = π r2

Sphere volume = 4/3 π r3

Number of dimensions

Vo

lum

e o

f h

yp

er−

sp

he

re (

r=1

/2)





Dimensionality 6= number of voxels

Little evidence to suggest that most voxel-based featureselection methods help.

Little or no increase in predictive accuracy.Commonly perceived as being more “interpretable”.

Prior knowledge derived from independent data is the mostreliable way to improve accuracy.

e.g. search the literature for clues about which regions toweight more heavily.

Cuingnet, Remi, Emilie Gerardin, Jerome Tessieras, Guillaume Auzias, Stephane Lehericy, Marie-Odile Habert,Marie Chupin, Habib Benali, and Olivier Colliot. “Automatic classification of patients with Alzheimer’s disease fromstructural MRI: a comparison of ten methods using the ADNI database.” Neuroimage 56, no. 2 (2011): 766-781.Chu, Carlton, Ai-Ling Hsu, Kun-Hsien Chou, Peter Bandettini, and ChingPo Lin. “Does feature selection improveclassification accuracy? Impact of sample size and feature selection on classification using anatomical magneticresonance images.” Neuroimage 60, no. 1 (2012): 59-70.See winning strategies in http://www.ebc.pitt.edu/PBAIC.html


http://www.ebc.pitt.edu/PBAIC.html




Linear versus Nonlinear methods

Linear methods are more interpretable.

Nonlinear methods usually increasedimensionality.

Better to preprocess to obtain featuresthat behave more linearly.

40 50 60 70 80 90 100 110 120 130 140 150 1601.4

1.5

1.6

1.7

1.8

1.9

2

Weight (kg)

Hei

ght (

m)

Body Mass Index (linear plot)

BMI=18.5BMI=25BMI=30

40 50 60 70 80 90 100 120 140 1601.4

1.5

1.6

1.7

1.8

1.9

2

Weight (kg)

Hei

ght (

m)

Body Mass Index (log−log plot)

BMI=18.5BMI=25BMI=30

2 4 6 8 10

2

4

6

8

10

Feature 1

Featu

re 2

Raw Data

100

10−1

100

101

Feature 1F

eatu

re 2

Log−Transformed




ManifoldsPrincipal Components

Transformed images fall on manifolds

Rotating an image leads to points on a 1D manifold.

0 0.2 0.4 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rigid-body motion leads to a 6-dimensional manifold (not shown).





Local linearisation through smoothing

0 200 400 600 8000

100

200

300

400

500

600

700

800

0 100 200 300 4000

50

100

150

200

250

300

350

400

Spatial smoothingcan make themanifolds morelinear with respectto smallmisregistrations.Some information isinevitably lost.





One mode of geometric variability

Simulated images Principal components

A suitable model would reduce these data to a single dimension.





Two modes of geometric variability

Simulated images Principal components

A suitable model would reduce these data to two dimensions.




DistancesExamples of features

Similarity Measures

Many methods are based on similarity measures.

A common similarity measure is the dot product.

Similarity: k(x, y) =∑

k

xkyk

Nonlinear methods are often based on distances.

Distance: d(x, y) =

√∑k

(xk − yk )2

Similarity: k(x, y) = exp(−λd(x, y)2)

How do we best measure distances between brain images?





Image Registration

Image registration measures distancesbetween images.

Often involves minimising the sum of twoterms:

Distance between the image intensities.Distance of the deformation from zero.

The sum of these terms gives thedistance.

µ ° θ f ° φ

φ

|Jφ|

θ

|Jθ|





Different ways of measuring distances





Different ways of measuring distances

Twosimulated

images





Metrics

Distances need to satisfy the properties of a metric:

1 d(x, y) ≥ 0 (non-negativity)

2 d(x, y) = 0 if and only if x = y (identity of indiscernibles)

3 d(x, y) = d(y, x) (symmetry)

4 d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).

Satisfying (3) requires inverse-consistent image registration.Satisfying (4) requires a specific family of image registrationalgorithm.





Non-Euclidean geometry

Distances are not alwaysmeasured along a straightline.

“Shapes are the ultimatenon-linear sort of thing”





Linear approximations to nonlinear problems





Example Images

Some example (non-brain) images.





Registered Images

We could register the images to their average shape...





Deformations

...and study the deformations...





Jacobian Determinants

...or the relative volumes...





Scalar Momentum

... or “scalar momentum”





Reconstructed Images

Reconstructions from template and scalar momenta.




DataFeaturesResults

Real data

Used 550 T1w brain MRI fromIXI (Information eXtraction fromImages) dataset.http://www.

brain-development.org/

Data from three differenthospitals in London:

Hammersmith Hospitalusing a Philips 3T system

Guy’s Hospital using aPhilips 1.5T system

Institute of Psychiatry usinga GE 1.5T system


http://www.brain-development.org/

http://www.brain-development.org/



DataFeaturesResults

Grey and White Matter

Segmented intoGM and WM.Approximatelyaligned viarigid-body.

Ashburner, J & Friston, KJ. Unified segmentation. NeuroImage 26(3):839–851 (2005).




DataFeaturesResults

Diffeomorphic Alignment

All GM and WM were diffeomorphically aligned to their commonaverage-shaped template.

Ashburner, J & Friston, KJ. Diffeomorphic registration using geodesic shooting and Gauss-Newton optimisation.NeuroImage 55(3):954–967 (2011).Ashburner, J & Friston, KJ. Computing average shaped tissue probability templates. NeuroImage 45(2):333–341(2009).




DataFeaturesResults

Volumetric Features

A number of featureswere used for patternrecognition.Firstly, two featuresrelating to relativevolumes.Initial velocitydivergence is similarto logarithms ofJacobiandeterminants.

JacobianDeterminants

Initial VelocityDivergence




DataFeaturesResults

Grey Matter Features

Rigidly RegisteredGM

NonlinearlyRegistered GM

Registered andJacobian Scaled GM




DataFeaturesResults

“Scalar Momentum” Features

“Scalar momentum”actually has twocomponents becauseGM was matchedwith GM and WMwas matched withWM.

First MomentumComponent

Second MomentumComponent




DataFeaturesResults

Age Regression

Linear Gaussian Process Regression to predict subject ages.

0 2 4 6 8 10 12 14 16 18 20−2020

−2000

−1980

−1960

−1940

−1920

−1900

−1880

−1860

−1840

−1820

Bayesian Model Evidence

Smoothing (mm)

Log L

ikelih

ood

Jacobians

Divergences

Rigid GM

Unmodulated GM

Modulated GM

Scalar Momentum

0 2 4 6 8 10 12 14 16 18 206

6.5

7

7.5

8

8.5

8−Fold Cross−Validation

Smoothing (mm)

RM

S e

rro

r (y

ea

rs)

Jacobians

Divergences

Rigid GM

Unmodulated GM

Modulated GM

Scalar Momentum

Rasmussen, CE & Williams, CKI. Gaussian processes for machine learning. Springer (2006).




DataFeaturesResults

Sex Classification

Linear Gaussian Process Classification (EP) to predict sexes.

0 2 4 6 8 10 12 14 16 18 20−260

−240

−220

−200

−180

−160

−140

−120

Bayesian Model Evidence

Smoothing (mm)

Log L

ikelih

ood

Jacobians

Divergences

Rigid GM

Unmodulated GM

Modulated GM

Scalar Momentum

0 2 4 6 8 10 12 14 16 18 200.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

Gaussian Process (EP)

Smoothing (mm)

AU

C

Jacobians

Divergences

Rigid GM

Unmodulated GM

Modulated GM

Scalar Momentum

Rasmussen, CE & Williams, CKI. Gaussian processes for machine learning. Springer (2006).




DataFeaturesResults

Predictive Accuracies

Age

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

Scalar Momentum (10mm FWHM)

Actual Age (years)

Pre

dic

ted

Ag

e (

ye

ars

)

Sex

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

ROC Curve (AUC=0.9769)

Sensitivity

Sp

ecific

ity




DataFeaturesResults

Conclusions

Scalar momentum (with about 10mm smoothing) appears tobe a useful feature set.

Jacobian-scaled warped GM is surprisingly poor.

Amount of spatial smoothing makes a big difference.

Further dependencies on the details of the registration stillneed exploring.




DataFeaturesResults


multivariate models of inter-subject anatomical variability · introduction geometric variability...

Documents