the impact of skull-stripping and radio-frequency bias correction on

12
The impact of skull-stripping and radio-frequency bias correction on grey-matter segmentation for voxel-based morphometry Julio Acosta-Cabronero, a, Guy B. Williams, a João M.S. Pereira, a George Pengas, b and Peter J. Nestor b a Wolfson Brain Imaging Centre, Department of Clinical Neurosciences, University of Cambridge School of Clinical Medicine, Addenbrookes Hospital, Cambridge, UK b Neurology Unit, Department of Clinical Neurosciences, University of Cambridge School of Clinical Medicine, Addenbrookes Hospital, Cambridge, UK Received 9 March 2007; revised 30 October 2007; accepted 31 October 2007 This study evaluates the application of (i) skull-stripping methods (hybrid watershed algorithm (HWA), brain surface extractor (BSE) and brain-extraction tool (BET2)) and (ii) bias correction algorithms (nonparametric nonuniform intensity normalisation (N3), bias field corrector (BFC) and FMRIBs automated segmentation tool (FAST)) as pre-processing pipelines for the technique of voxel-based morpho- metry (VBM) using statistical parametric mapping v.5 (SPM5). The pipelines were evaluated using a BrainWeb phantom, and those that performed consistently were further assessed using artificial-lesion masks applied to 10 healthy controls compared to the original unlesioned scans, and finally, 20 Alzheimers disease (AD) patients versus 23 controls. In each case, pipelines were compared to each other and to those from default SPM5 methodology. The BET2+ N3 pipeline was found to produce the least miswarping to template induced by real abnormalities, and performed consistently better than the other methods for the above experiments. Occasionally, the clusters of significant differences located close to the boundary were dragged out of the glass-brain projectionsthis could be corrected by adding background noise to low-probability voxels in the grey matter segments. This method was confirmed in a one-dimensional simulation and was preferable to threshold and explicit (simple) masking which excluded true abnormalities. © 2007 Elsevier Inc. All rights reserved. Introduction Voxel-based morphometry or VBM (Ashburner and Friston, 2000) is a frequently employed method for evaluating regional differences in grey-matter density. Data sets can be compared on a voxel-by-voxel basis since the structural magnetic resonance (MR) images are normalised to the same standard space, and segmented into grey matter (GM) and white matter (WM) prior to statistical analysis. The latest statistical parametric mapping (SPM) release, SPM5, enables spatial normalisation, tissue classification and radio-frequency (r.f.) bias correction (BC) to be combined within the same model (Ashburner and Friston, 2005). Spatial normalisation inaccuracy is an important limiting factor for the validity of the VBM results (Bookstein, 2001). Systematic miswarping to template of a given structure across groups may lead to either false positives or false negatives. Therefore, spatial smoothing is also required prior to statistical analysis in order to cope, not only with miswarping, but also with inter-subject variations in anatomy, to improve the signal-to-noise ratio and to render the data more normally distributed. Smoothing MR data with a Gaussian filter, however, limits spatial selectivity and significant clusters should not be interpreted as anatomically precise results. A recent study found that VBM using SPM2 can be improved by skull-stripping MR images prior to analysis (Fein et al., 2006). Many other studies have contrasted the performance of different skull-stripping algorithms (Boesen et al., 2004; Fennema-Notes- tine et al., 2006; Rex et al., 2004; Segonne et al., 2004; Smith, 2002; and Zhuang et al., 2006), but did not assess the impact on VBM. It is also clear that correcting for nonuniform tissue intensities as a result of r.f. bias may improve VBM sensitivity and accuracy; several bias correction methods have been reported and quantitatively compared to each other by Arnold et al. (2001), Shattuck et al. (2001), Sled et al. (1998) and Vovk et al. (2007). Given that SPM5 has incorporated nonuniformity correction as well as image registration and tissue classification in its generative YNIMG-05036; No. of pages: 12; 4C: www.elsevier.com/locate/ynimg NeuroImage xx (2007) xxx xxx Corresponding author. Wolfson Brain Imaging Centre, Addenbrookes Hospital (Box 65), Hills Road, Cambridge CB2 0QQ, UK. Fax: +44 1223 331826. E-mail addresses: [email protected] (J. Acosta-Cabronero), [email protected] (G.B. Williams), [email protected] (J.M.S. Pereira), [email protected] (G. Pengas), [email protected] (P.J. Nestor). Available online on ScienceDirect (www.sciencedirect.com). 1053-8119/$ - see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.neuroimage.2007.10.051 ARTICLE IN PRESS Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripping and radio-frequency bias correction on grey-matter segmentation for voxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007.10.051

Upload: others

Post on 09-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

YNIMG-05036; No. of pages: 12; 4C:

www.elsevier.com/locate/ynimg

ARTICLE IN PRESS

NeuroImage xx (2007) xxx–xxx

The impact of skull-stripping and radio-frequency bias correction ongrey-matter segmentation for voxel-based morphometry

Julio Acosta-Cabronero,a,⁎ Guy B. Williams,a João M.S. Pereira,a

George Pengas,b and Peter J. Nestorb

aWolfson Brain Imaging Centre, Department of Clinical Neurosciences, University of Cambridge School of Clinical Medicine, Addenbrooke’s Hospital,Cambridge, UK

bNeurology Unit, Department of Clinical Neurosciences, University of Cambridge School of Clinical Medicine, Addenbrooke’s Hospital, Cambridge, UK

Received 9 March 2007; revised 30 October 2007; accepted 31 October 2007

This study evaluates the application of (i) skull-stripping methods(hybrid watershed algorithm (HWA), brain surface extractor (BSE)and brain-extraction tool (BET2)) and (ii) bias correction algorithms(nonparametric nonuniform intensity normalisation (N3), bias fieldcorrector (BFC) and FMRIB’s automated segmentation tool (FAST))as pre-processing pipelines for the technique of voxel-based morpho-metry (VBM) using statistical parametric mapping v.5 (SPM5). Thepipelines were evaluated using a BrainWeb phantom, and those thatperformed consistently were further assessed using artificial-lesionmasks applied to 10 healthy controls compared to the originalunlesioned scans, and finally, 20 Alzheimer’s disease (AD) patientsversus 23 controls. In each case, pipelines were compared to each otherand to those from default SPM5 methodology. The BET2+N3 pipelinewas found to produce the least miswarping to template induced by realabnormalities, and performed consistently better than the othermethods for the above experiments. Occasionally, the clusters ofsignificant differences located close to the boundary were dragged outof the glass-brain projections—this could be corrected by addingbackground noise to low-probability voxels in the grey mattersegments. This method was confirmed in a one-dimensional simulationand was preferable to threshold and explicit (simple) masking whichexcluded true abnormalities.© 2007 Elsevier Inc. All rights reserved.

⁎ Corresponding author. Wolfson Brain Imaging Centre, Addenbrooke’sHospital (Box 65), Hills Road, Cambridge CB2 0QQ, UK. Fax: +44 1223331826.

E-mail addresses: [email protected] (J. Acosta-Cabronero),[email protected] (G.B. Williams), [email protected](J.M.S. Pereira), [email protected] (G. Pengas),[email protected] (P.J. Nestor).

Available online on ScienceDirect (www.sciencedirect.com).

1053-8119/$ - see front matter © 2007 Elsevier Inc. All rights reserved.doi:10.1016/j.neuroimage.2007.10.051

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

Introduction

Voxel-based morphometry or VBM (Ashburner and Friston,2000) is a frequently employed method for evaluating regionaldifferences in grey-matter density. Data sets can be compared on avoxel-by-voxel basis since the structural magnetic resonance (MR)images are normalised to the same standard space, and segmentedinto grey matter (GM) and white matter (WM) prior to statisticalanalysis. The latest statistical parametric mapping (SPM) release,SPM5, enables spatial normalisation, tissue classification andradio-frequency (r.f.) bias correction (BC) to be combined withinthe same model (Ashburner and Friston, 2005).

Spatial normalisation inaccuracy is an important limiting factorfor the validity of the VBM results (Bookstein, 2001). Systematicmiswarping to template of a given structure across groups may leadto either false positives or false negatives. Therefore, spatialsmoothing is also required prior to statistical analysis in order tocope, not only with miswarping, but also with inter-subjectvariations in anatomy, to improve the signal-to-noise ratio and torender the data more normally distributed. Smoothing MR datawith a Gaussian filter, however, limits spatial selectivity andsignificant clusters should not be interpreted as anatomicallyprecise results.

A recent study found that VBM using SPM2 can be improvedby skull-stripping MR images prior to analysis (Fein et al., 2006).Many other studies have contrasted the performance of differentskull-stripping algorithms (Boesen et al., 2004; Fennema-Notes-tine et al., 2006; Rex et al., 2004; Segonne et al., 2004; Smith,2002; and Zhuang et al., 2006), but did not assess the impact onVBM. It is also clear that correcting for nonuniform tissueintensities as a result of r.f. bias may improve VBM sensitivityand accuracy; several bias correction methods have been reportedand quantitatively compared to each other by Arnold et al. (2001),Shattuck et al. (2001), Sled et al. (1998) and Vovk et al. (2007).Given that SPM5 has incorporated nonuniformity correction aswell as image registration and tissue classification in its generative

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

2 J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

model, the purpose of the present study was to investigate theimpact that this new SPM approach would have on smoothedmodulated, normalised GM segments in comparison to thosederived from the most widely used pre-processing methods; theseare enumerated below:

Skull-stripping• Hybrid watershed algorithm (HWA) using atlas informa-

tion in FreeSurfer v.3.04 (http://surfer.nmr.mgh.harvard.edu). HWA makes use of local statistics for the templatedeformation and integrates an atlas-based term constrain-ing the shape of the brain (Segonne et al., 2004).

• Brain surface extractor (BSE) in BrainSuite v.2.0 (http://brainsuite.usc.edu). BSE combines edge-detection andmorphology-based techniques. Adaptive anisotropic diffu-sion, edge detection and morphological erosions are usedto identify the brain component (Shattuck et al., 2001).

• Brain extraction tool v.2.1 (BET2) in FSL (http://www.fmrib.ox.ac.uk/fsl). BET2 is based on regional propertiesof the image; the forces pushing the template outward arelocally computed at each vertex (Smith et al., 2002).

Bias correction• Nonparametric nonuniform intensity normalisation v.1.10

(N3) in FreeSurfer v.3.04. N3 corrects intensity non-uniformities without requiring a model of tissue classes. Ituses a deconvolution kernel to sharpen the histogram plotsthat have been smoothed by the bias field (Sled et al.,1998).

• Bias field corrector (BFC) in BrainSuite v.2.0. BFCcomputes local estimates of bias fields uniformly spacedthroughout the volume using an adaptive partial volumetissue model (Shattuck et al., 2001).

• FMRIB’s Automated Segmentation Tool v.3.53 (FAST) inFSL. FAST uses a hidden Markov random field model andan associated expectation–maximisation algorithm toclassify the brain into different tissue types and to correctfor intensity nonuniformities (Zhang et al., 2001).

The pre-processing pipelines were first evaluated with aphantom, and those that showed the best results were then usedfor VBM analyses of simulated lesions in healthy controls and realpatient data.

Fig. 1. Input volumes and GM segments optimised by the unified segmentation mphantom. From left to right: full volume (SPM5 default), manually stripped and p

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

Phantom study

Methods

The pre-processing pipelines were evaluated on the T1-weighted MRI BrainWeb phantom (http://www.bic.mni.mcgill.ca/brainweb/selection_normal.html) with a noise level of 3% and 40%bias as shown in Fig. 1. This phantom was chosen because itresembles, in terms of signal-to-noise ratio (SNR) and intensity r.f.bias, the average scan used in this study, acquired with a 1.5-T GESigna MRI scanner (GE Medical Systems, Milwaukee, WI, USA)using a T1-weighted 3-dimensional (3-D) inversion-recovery fastspoiled gradient-echo (IR-FSPGR) sequence (echo time: 4.2 ms,inversion time: 650 ms and flip angle: 20°) with voxel size0.86×0.86×1.5 mm. The gold standard used to compare thepipelines was based on the 3-D “fuzzy” GM tissue volume (http://www.bic.mni.mcgill.ca/brainweb/anatomic_normal.html). Thephantom fuzzy (1×1×1 mm) was re-sampled and co-registeredto match the International Consortium for Brain Mapping (ICBM)GM template (2×2×2 mm) to which all volumes were warped inthe unified segmentation step. The fuzzy GM segment was firstrecalculated using a linear combination of sample-weighted sincfunctions. Next, the volume was convolved with an 8-mm full-width at half maximum (FWHM) isotropic Gaussian kernel tomatch the GM template’s smoothness prior to registration.Subsequent affine and nonrigid co-registrations of the smoothedimages were performed using the visualisation toolkit (VTK, http://www.vtk.org), and the resulting transformation was applied to there-sampled phantom fuzzy to produce the gold standard for thisstudy. A manually skull-stripped data set was obtained bypainstaking delineation of the cortical surface on the ground-truthimage (noise: 0%; bias: 0%) to provide a representation of what an“ideal” skull-stripping algorithm should achieve and applying theresulting mask to the unbiased 3%-noise phantom.

The order of pre-processing steps was recently investigated byFennema-Notestine et al. (2006); although they found someexceptions for strong biases, it was reported that prior biascorrection does not significantly improve performance of skull-stripping methods. It is also known that N3 and FAST are moreaccurate when the brain is segmented from background (Sled et al.,1998; Zhang et al., 2001). Therefore, the phantom volume was firstskull-stripped and then intensity-corrected prior to being processedby SPM5. The full volume without pre-processing (default) was

odel in SPM5. Inputs were generated from the T1-weighted MRI BrainWebre-processed volumes.

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

Table 1Summary of phantom study RMS differences

Method BC off BC on

N3 BFC FAST None a N3 Nonea

Full volume – – – – – 4.37Manual – – – 1.49 – –HWAatl 4.26 7.34 9.62 – 4.43 4.40BSEopt 4.22 4.60 4.80 – 4.49 4.44BET2f=0.4 4.23 5.57 9.63 – 4.49 4.41BET2f=0.5 4.22 5.51 9.64 – 4.46 4.42a Bias correction was not performed prior to the unified segmentation step.

3J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

also segmented using SPM5 in order to assess the need for suchsteps. Note that HWAwas fully automated using default arguments.BSE had to be manually optimised, since BSE’s default settingsproduced a nonstripped output. The optimal BSE volume wasobtained by computing the Jaccard similarity coefficient (J, seebelow) for several edge constants (0.53–0.58) with three diffusioniterations, diffusion constant set to 1 and erosion size, 3. An edgeconstant of 0.56 yielded the highest similarity coefficient. BET2was used with fixed arguments; fractional intensity threshold, f, wasset to 0.4 and 0.5, and vertical gradient, g, was set to 0. N3 and BFCwere also automated using default arguments.

Bias correctionPrevious investigations suggested that SPM5 is more accurate if

it does not attempt to estimate bias fields when nonuniformities arenot present (Ashburner and Friston, 2005). Hence, the performanceof the bias correction algorithms was evaluated by computing theroot-mean-squared (RMS) difference between the unbiased Brain-Web phantom, and the nonuniformity corrected (BC on—SPM5default) and uncorrected (BC off) SPM5 (intensity-corrected)outputs in native space as a percentage of the maximum signalintensity from the ground-truth volume. The bias correction wasdisabled by providing parameter settings that caused a negligibleeffect over the volume of interest; that is, bias regularisation was setto 10 and bias FWHM was set to 150-mm cut-off.

Tissue segmentationThe accuracy in GM-tissue classification was evaluated in native

space by considering the overlap between the calculated GM andthat derived from the gold standard. The Jaccard similaritycoefficient (J) is defined as the ratio of the size (i.e. number ofvoxels, N) of the intersection between GM segment for eachpipeline and for the gold standard, divided by the size of their union:

J ¼ GMA \ GMT

GMA [ GMT;

where GMA is the binarised GM segment (threshold set to 0.5) andGMT is the ground-truth GM segment (also thresholded at 0.5). Onecould argue that if Prob({grey matter, white matter, cerebro-spinalfluid})={0.4, 0.3, 0.3} at a voxel, then it should be accepted as GM.However, the corollary of this approach is that the probability ofsuch a voxel being non-GM is 60%. Hence, the threshold used hereimposed that only strong candidates (≥0.5) of being GM would becounted. False-negative (FN) and false-positive rates (FP) as apercentage of the size of the binarised gold-standard GM segmentwere also calculated. In addition, the difference between the size(ΔN) of the pre-processed and gold-standard final outcomes as apercentage of the latter was computed.

Warping to templateThe impact on spatial normalisation was evaluated by

calculating J, FN, FP and ΔN for binarised (threshold set to 0.5),unmodulated normalised GM segments for each pipeline comparedto the gold standard.

Results

Bias correctionN3 outperformed BFC and FAST for all skull-stripping

methods as shown in Table 1. FAST’s performance was poor

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

although its nonuniformity correction was significantly improvedwhen the phantom was skull-stripped using optimised BSE. BFCalso showed better performance when used after BSE. N3performed consistently well for all brain-extraction methods, butthe RMS error was lower when the bias correction within SPM5was disabled (BC off). Table 1 also shows that any skull-strippingmethod and N3 correction used as pre-processing pipelinesoutperformed the bias correction implemented in SPM5 for thefull volume. It is also important to mention here that although aperfectly bias-corrected output volume would be expected from theunbiased manually stripped input (BC off in SPM5), the Gaussiannoise field added to the real and imaginary channel data (3% noise,calculated relative to the brightest tissue) induced an RMS errorpercentage of almost 1.5% in the resulting volume when comparedto the ground truth.

Tissue segmentationAlthough all methods (excepting when BET2/HWAwas run in

conjunction with FAST) yielded similar J as shown in Table 2(native space), the distribution of dissimilarities was different ineach case. BSE was the most specific skull-stripping algorithm, butit generated GM segments with the highest FN rate (except forFAST). Consequently, ΔN was negative, indicating that more GMvoxels were erroneously excluded than were falsely included. TheBET2/HWA+FAST segments were very poor, whereas BSE’sspecificity suited the FAST method. This result suggests that FASTrequires very accurate brain extraction and therefore, two tissueclasses only in order to perform well. Because the HWA methodaims to conservatively bound the pial surface, it was noted thatparts of the venous sinuses were included in the volume—these, inturn, were classified as GM tissue in the SPM5 unifiedsegmentation step. However, it was surprising to find thatHWAatl +N3 performed very well in terms of similarity to thephantom fuzzy in native space. HWAatl produced the highestvolume difference (except when used with FAST) relative to thereference method due to its high FP and low FN. Default SPM5settings (full volume) performed well; comparatively, the FP ratewas low and the FN rate was not too high. The most consistenttissue classification was produced when N3 was combined witheither HWAatl or BET2f = 0.4. BET2 can perform in a more specificmanner by increasing the intensity threshold; f=0.5 discriminatedmore voxels from the cortical surface and also false positives wereexcluded from the GM segment without a severe decrease insimilarity. It is clear from Table 2 (native space) that any skull-stripping method+N3 and default SPM5 settings systematicallyyielded more accurate GM segments compared to those bias-corrected using BFC or FAST.

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

Table 2Summary of phantom study similarity metrics

Fullvolume

Manual BET2f=0.4N3

BET2f=0.5N3

BSEopt

N3HWAatl

N3BET2f=0.4BFC

BET2f=0.5BFC

BSEopt

BFCHWAatl

BFCBET2f=0.4FAST

BET2f=0.5FAST

BSEopt

FASTHWAatl

FAST

Native spaceJ 0.866 0.864 0.870 0.862 0.846 0.875 0.821 0.827 0.804 0.786 0.673 0.674 0.830 0.670FN (%) 7.6 7.9 7.1 8.1 9.3 6.0 11.8 11.3 13.7 12.2 27.2 27.1 10.7 26.3FP (%) 6.6 6.6 6.8 6.7 7.3 7.4 7.4 7.3 7.4 11.7 8.2 8.2 7.6 10.0ΔN (%) −1.0 −1.3 −0.4 −1.4 −2.0 1.4 −4.3 −4.0 −6.3 −0.5 −19.0 −18.8 −3.2 −16.4

Standard spaceJ 0.706 0.712 0.706 0.714 0.715 0.692 0.700 0.702 0.686 0.675 0.472 0.483 0.689 0.553FN (%) 18.4 17.7 16.1 16.5 17.3 16.4 19.1 18.9 21.0 18.8 42.6 41.2 20.9 34.7FP (%) 15.6 15.6 18.5 17.0 15.6 20.8 15.6 15.5 15.1 20.3 21.6 21.8 14.9 18.1ΔN (%) −2.8 −2.1 2.4 0.5 −1.7 4.5 −3.6 −3.4 −5.9 1.5 −21.1 −19.4 −6.0 −16.6

4 J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

Warping to templateTable 2 (Standard space) also shows the impact that the different

methods have on the correspondence after spatial normalisation;optimal BSE and BET2f = 0.5+N3 had the best similarity with regardto the gold-standard boundary in standard space. Note that the FNrate for the full-volume method (SPM5 default) increasedsubstantially after normalisation. HWAatl produced the highestvolume difference and highest FP rate (except when used withFAST) and, although the HWAatl segment had a low FN rate innative space, the redundant GM tissue induced substantialmisregistration in standard space; this generated a large increasein false negatives, which together with the large FP, resulted in asignificant decrease in similarity. As mentioned above, BSE is themost specific of all methods; it also yielded low FPs and negativeΔNs for normalised segments. BET2 kept low FN rates, and FP andΔN could be lowered by increasing f. In summary, although thedifferences are not substantial, Table 2 shows that the default SPM5settings produced lower error rates than HWA+N3, but worse thanthose of the BSE+N3 and BET2+N3 pipelines.

The manual skull-stripping method of the unbiased phantomgenerated an input data set that was very similar in appearance tothe BET2f = 0.5+N3 volume (Table 2), suggesting that the latter isvery close to what one would want from a pre-processingmethodology.

It is also important to mention here that occasionally, automatedskull-stripping methods fail to remove neck tissue; this may have anegative impact on the unified segmentation process, but can besolved by recursively running the algorithm with conservativesettings (e.g. BET2 with f=0.4). It was found that, althoughdifficult to automate, this method may also improve VBM by moreaccurately excluding venous sinuses and cerebro-spinal fluid (CSF)prior to processing in SPM5. In contrast, it was found that multi-pass N3 and BFC did not improve bias correction.

In summary, this experiment evaluated the performance ofSPM5 using different methods to pre-process artificial structuralMR data. The aim was to evaluate the most widely used pre-processing pipelines that are easy to automate and that requireminimal user intervention. Running BET2 to skull-strip and N3 tobias-correct the BrainWeb phantom performed especially well.BSE+N3 also appeared to perform better than SPM5 default andHWA+N3 method. In addition, BSE’s specificity allowed BFCand FAST to perform more accurately. However, closer inspectionof the results revealed that its high specificity came at the price ofGM tissue removal; as the ultimate aim of VBM, in this context, is

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

to compare GM segments, a high FN value (GM removal) is farless desirable than a high FP value (inclusion of extra-cerebralvoxels) in that it could adversely impact on statistical analyses.BSE was therefore not taken forward to the subsequent studies. Itwas also demonstrated that including nonbrain voxels in theinferred GM segment may have a negative effect on the warpingprocess.

The phantom study also showed that BET2/HWA+N3/BFCperforms well using default settings (in the case of BET2, itsspecificity could be controlled with a single parameter), whereasBSE needed undesirable user interaction. In addition, BSE/BFC inBrainSuite2 could not be automated for multi-subject studies; thus,BFC was also dropped for the remainder of the study as was FASTdue to its poor performance in the phantom experiment.

Simulated-lesion study

Although results derived from phantom studies can offer usefulinsights, they do not necessary capture the full complexity of realscan conditions (e.g. motion artefacts, complex r.f. bias fields, etc.).Therefore the next step was to evaluate the performance of thepipelines using real data. However, assessing the impact of thesemethods on real MR data with patient data sets is limited by the lackof a ground-truth T-map. Therefore, we created simulated lesions tothe temporal lobes (right worse than left) and then evaluated the pre-processing method’s ability to detect these lesions.

Methods

Ten healthy controls were scanned coronally on a 1.5-T GESigna MRI scanner (GE Medical Systems, Milwaukee, WI, USA)using a T1-weighted 3-D IR-FSPGR sequence (echo time: 4.2 ms,inversion time: 650 ms and flip angle: 20°) with voxel size0.86×0.86×1.5 mm. All scans were re-sampled to 256×256×256(1-mm isotropic) using sinc interpolation. The GUI tkmedit inFreeSurfer v.3.04 was used to manually mask GM voxels in thetemporal lobe and insula of all subjects; the removal being moreintense on the right side (Fig. 2). The two populations (original andmanually edited) were (i) not pre-processed (Full Volume—SPM5default); and (ii) skull-stripped prior to analysis using (iia) BET2with fixed arguments (fractional intensity threshold, f, set to 0.4 andvertical gradient, g, set to 0) and (iib) HWA using atlas information.The stripped volumes were then bias-corrected using N3 andsubsequently, all volumes were normalised, segmented and

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

Fig. 2. Artificial temporal lobe and insula lesions (right worse than left) in ahealthy control.

5J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

modulated using the unified segmentation tool provided in SPM5.As demonstrated above, bias correction is more accurate if SPM5does not correct for intensity nonuniformities from pre-processedvolumes. Hence, bias regularisation was set to 10 and bias FWHMwas increased to 150-mm cut-off for (ii), and default settings (biasregularisation: 0.001; bias FWHM: 60-mm cut-off) were kept for (i).GM segments were then smoothed using an 8-mm FWHM isotropicGaussian kernel, the two populations (original and manually edited)were entered into a design matrix and the relationships betweenchanges in GM content were statistically compared by performingtwo-sample t-tests. The analyses were run both with, and without,masking; two types of maskingwere performed: (i) threshold and (ii)explicit (simple) masking. Relative threshold masking (RTM) onlyallows statistical analysis of voxels at which all images exceed thevalue of the threshold as a proportion of the global value. Theproportionality constant (relative threshold) used in this study wasset to 0.6. Simple masking, on the other hand, directly excludesvoxels which are not to be assessed; the mask was generated by

Fig. 3. VBM results projected onto glass-brains for the simulated l

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

binarising (threshold at Prob({GM})=0.05), smoothing (8-mmFWHM kernel), GM-tissue probability mapping (“tpm/grey.nii” inSPM). Results are reported at a statistical threshold of p=0.001uncorrected.

Results

Fig. 3 (Raw) shows the VBM results projected onto glass brainsfor the methods described above. HWAatl +N3 and full-volume(SPM5 default) methods were inferior to BET2f = 0.4+N3 in thatthey both failed to identify the left temporal lesion. However, thestatistical effect using BET2 with N3 (and using default SPM5settings to some extent) was dragged out from the brain to the edgeof the bounding box. Historically, this problem had been solved bythreshold or simple masking. Fig. 3 (BET2+N3, RTM 0.6),however, shows that although BET2f = 0.4+N3 (Raw) was the onlymethod that successfully detected the left temporal lesion, the leftside abnormality was entirely erased and right side peaks wereexcluded from the output with RTM 0.6. BET2+N3 (simplemasking) was more inclusive and although both clusters werereduced, the right side primary peak was included. However, theprimary peak of the left side cluster, which was dragged out to theedge of the bounding box in BET2+N3 (Raw), was completelyremoved by simple masking, leaving the secondary peak as themaximum T-statistic.

As the differences in GM segments for the T-map generationwere in both temporal lobes and insulae, we hypothesised that the

esions with statistical threshold set to p=0.001 uncorrected.

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

6 J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

T-statistics formed from the smoothed data were affected by thenonlinear interaction of the Gaussian kernel with the voxelvariances, and the generation of low-probability voxels in thesegmentation procedure meant that the statistical effect at the GMboundary could be dragged out of the brain. Given that the T-mapis purely a reflection on local statistical differences in voxel value,regions of negligible voxel intensity may be influenced by adjacenttrue statistical effects after application of the smoothing kernel. Inaddition, smoothing may also cause neighbouring GM regions tobe averaged and therefore result in a minimum variance betweenthe regions.

To test this hypothesis, we ran the following experiment inwhich random background noise was introduced to low-probability GM voxels with the prediction that this would correctthe observed artefactual displacement of significant clusters in theT-map.

Fig. 4. Significant clusters (Haircut method) overlayed onto the smoothed populashows the lesion map (average mask) generated directly from the artificial mask a

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

Haircut experiment on the simulated lesion data

Methods

The same volumes and processing steps described in thesimulated-lesion study were used in this experiment. However, theHaircut procedure followed tissue classification; random noise wasadded to low-probability voxels from the GM segments prior tosmoothing. An empirical threshold an order of magnitude lowerthan expected GM probabilities in GM regions was set to 0.05.Noise uniformly distributed between 0 and 0.05 was only added toprobabilities lying below this threshold. In addition, a lesion map(average mask) was employed in order to estimate the sensitivity ofthe results obtained using the method proposed in this study. Theground-truth map was generated by the sum of subtracted (originalminus manually edited) binarised (threshold at 0.5) smoothed GM

tion average (templates/T1.nii in SPM) at four coronal depths. The top rowt identical depths.

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

7J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

segments in standard space. This could be seen as imposing a softminimum on GM probability, which is justifiable due to errorswhich are not accounted for in the mixture model of segmentation.

Results

Comparing the VBM results from Fig. 3, it is clear that addingbackground noise to low-probability voxels in GM segments priorto smoothing reduces the spreading effect from clusters ofsignificant differences located close to the external glass-brainboundary. It is also clear that, although the blobs were notcompletely restored to the glass brain, the Haircut method ispreferable to threshold masking for the BET2+N3 method becauseit does not exclude real abnormalities from the statistical analysis.The left side cluster (BET2+N3, Haircut) contained three peaksyet, as mentioned above, when simple masking was applied, themost significant of these peaks was omitted. Fig. 4 compares theareas of significant atrophy obtained using the Haircutmethod withthe average mask described in the Methods section at four differentcoronal depths. It is encouraging to observe that the VBM clustersand the lesion mask are highly concordant, especially for the BET2+N3 pipeline (see Supplementary material for unthresholded T-mapcomparisons). Although noise was also added to voxels within thebrain boundary (WM tissue), the method neither improved norworsened localisation accuracy in internal areas due to the mildseverity of the artificial lesions compared to the size of the kernel(8-mm FWHM). However, study populations with severe atrophymight benefit from this restoration technique on both internal andexternal boundaries of GM tissue. The threshold and noiseamplitude were chosen empirically, but it was noticed that muchlower and much higher thresholds resulted in inaccurate location,shape or size of the blobs. This result suggests that there is anoptimal threshold and noise amplitude that should be used for thespecific volumes under study. Intuitively, the statistical effect ofnoise, with mean and standard deviation an order of magnitudelower than Prob({GM})=0.5, being smoothed into GM tissue, can

Fig. 5. Synthetic control and patient profiles generated from the central row of the mvoxels #76–#80 where the lesion was created (ground truth); an example of one syafter smoothing (smoothed synthetic control/patient) and their appearance after Ha(20 synthetic controls versus 20 synthetic patients) with pb0.001 uncorrected for

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

be neglected. The next section illustrates the restoration effect of theHaircut method on synthetic data and the impact of the choice ofthreshold and noise properties in the results of two-sample t-tests.

Haircut experiment: synthetic data

Methods

We designed a synthetic, 1-D, controls versus patients’experiment, where the central row of the mid-axial slice from there-sampled phantom fuzzy (91 voxels) was used as a baseline togenerate data as shown in Fig. 5. “Healthy” regions of syntheticpatients and all synthetic controls were obtained as follows:

• if PbaselineN0.025, then P= |Pbaseline−Ps|; where bPsN=0,σ=0.2

• if Pbaselineb0.025, then P=Pbaseline

Ps was a random number from the normal distribution withmean parameter, bPsN, and standard deviation, σ. Baselineprobabilities below 0.025 were kept unchanged in order tosimulate the effect of probability gradients at the edge of the brainboundary. The artificial lesion was created at voxel #76 to #80 (seeFig. 5, ground truth); that is, at the right hand edge of brainboundary. The lesion consisted of subtracting higher probabilities(compared to those that were subtracted to create the “healthy”tissue) in order to simulate loss of GM density at this location: P= |Pbaseline−Ps|; where bPsN=0.5, σ=0.4.

Fig. 5 shows a sample synthetic “control” and “patient” profilegenerated as described above. In total, a population of 20 syntheticcontrols and 20 synthetic patients were generated for thisexperiment, and three different procedures were followed prior toperforming the usual unpaired t-test: (i) default: all profiles weresmoothed using an 8-mm FWHM Gaussian kernel (see Fig. 5,smoothed control/patient); (ii) Haircut-0.4: random noise uni-

id-axial slice from the 2-mm isotropic, BrainWeb phantom fuzzy (baseline);nthetic subject from each group (synthetic control/patient); their appearanceircut-0.4 and smoothing. t-tests: T-statistics as a result of the unpaired t-testeach comparison as labelled in the figure.

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

8 J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

formly distributed between 0 and 0.4 was added to probabilitiesbelow 0.4 from the GM segments prior to smoothing (see Fig. 5,Haircut-0.4, smoothed control/patient); and (iii) Haircut-0.05:random noise with mean 0.025 was only added to voxels with Prob({GM})b0.05. Threshold (masking at relative threshold of 0.6)and simple masking (ideal brain mask—0s at voxels #1–#11 and#81–#91) were also used to exclude voxels from the statisticalanalysis.

Results

Unpaired t-tests for each method confirmed the hypothesisoutlined in the previous section. The Haircut-0.4 method clearlyrestricted the voxels of significance to their true location as shownin Fig. 5 (t-test,Haircut-0.4), and even a more conservative additionof noise (Haircut-0.05) showed a significant improvement of theblob location and size compared to the default VBM processing.This statistical effect appears to be stronger at the externalboundary, because at internal structures, the effect is reduced bythe presence of adjacent GM tissue with high GM probabilities. TheHaircut method has therefore proven to reduce the effect of bothexternal and internal T-statistics leaking. Alternatively, simplemasking (t-test, default, simple mask) cropped the default blob tomore accurately match the spatial location of the ground-truthlesion. This method, however, depends not only on the accuracy ofthe brain-extraction algorithm, but also on cluster maxima beingincluded in the brain mask. We saw this effect earlier in the manual-lesion experiment; here the peak for the default procedure waslocated at voxel #81, hence simple masking failed to include themaximum T-statistics. In contrast, Haircut (0.05 and 0.4) restoredthe maximum T-value to voxel #79. Consequently, simple maskingmay be more precise when used after blob restoration using Haircut(t-test, Haircut-0.05, simple mask) as it may benefit from bothexternal and internal cluster restoration. Fig. 5 (t-test, default, RTM0.6) also shows the impact of threshold masking; RTM 0.6 was themost conservative method and erroneously excluded all voxels inthe lesion region.

A noise magnitude of 0.4 was empirically needed to define thesynthetic lesion with accuracy; however, such high magnitude ofadded noise may be smoothed into brain regions and maysignificantly reduce statistical power, therefore the less invasive0.05 Haircut noise was used for the remainder of the study. Higherthreshold values were also applied; interestingly, it was noticedthat, although the region of significance was underestimated, theblob maximum was located at the centre of the ground-truth lesion.In addition, it should be noted that lighter and stronger lesions didnot change the behaviour of the results described above.

Patient study

Having demonstrated the comparative validity and accuracy ofthe methods in phantom, simulated-lesion and synthetic data, thefollowing study compared the impact on real patient data.

Methods

Forty-three subjects, 20 patients diagnosed with incipientAlzheimer’s disease or AD (age: 70±6 years old) and 23 healthycontrols (age: 64±8 years old), were scanned using the sameprotocol, and analysed using the same statistical model andthreshold described in the previous section. The patient scans came

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

from a longitudinal study of mild cognitive impairment (MCI): atthe time of scanning, all met criteria for amnesic MCI, butsubsequently declined over a 2–3 year follow-up period indicativeof probable AD. When scanned, the patient group’s mean mini-mental state examination score was 26.9/30 (σ=1.5).

VBM analyses were performed for default SPM5 settings (fullvolumes), and pre-processed volumes using HWAatl/BET2f = 0.4

and N3. Four different methods followed segmentation: (i) Raw(default SPM5 procedure and analysis); (ii) Haircut (defaultanalysis); (iii) threshold masking the Raw analysis at a relativethreshold of 0.6 (RTM 0.6) and (iv) threshold masking the Rawanalysis at a relative threshold of 0.8 (RTM 0.8-default settings inSPM2). Although SPM5 does not perform threshold masking bydefault, we have used the old SPM2 default setting (RTM 0.8) forcomparison purposes. In order to provide an independent yardstickagainst which to compare results, manual volumetric measure-ments of the left and right hippocampi were performed in eachsubject. The hippocampus was chosen because previous volumetricstudies have demonstrated that it is atrophic in MCI-stage AD(Jack et al., 1999). Native-space MRI scans were oriented intostereotaxic alignment and then manual tracings taken in coronalplane from the rostral-most extremity through to the first slice onwhich the posterior commissure was visible. These steps wereperformed using Analyze version 6 software (Biomedical ImagingResource, Mayo Foundation, Rochester, MN, USA). In this cohort,the mean volume reduction in patients compared to controls was23.2% (p=0.00002) on the left and 20.8% (p=0.0001) on the right.Thus, to assess the accuracy of each VBM method, this enabled usto ask, (i) could the VBM methods detect this independentlyverified atrophy and (ii) if so, could each confirm that the effectsize on the left side was marginally greater than the right?

Results

Fig. 6 shows the VBM glass-brain projections for eachmethod. An overall assessment of all projections indicated thatconsiderable differences in sensitivity existed across methods,although there was reasonable concordance in the observedrelative regional patterns of atrophy. All VBM methods success-fully detected bilateral hippocampal atrophy at a statistical cut-offof p=0.001 uncorrected, but interestingly, not at p=0.05 falsediscovery rate (FDR) corrected. However, results were morevariable when the statistical magnitude of right versus lefthippocampus was examined. T-statistics and effect size for eachmethod were calculated using a small volume correction (SVC)with a 3-mm radius sphere centred at the local maxima for eachhippocampus in each method. Consistent with the independentvolumetric results, the T-statistic and effect size were marginallyhigher on the left, compared to the right, hippocampus forBET2f = 0.4 and HWAatl (T-stats, left/right: 4.61/4.47 and 4.72/4.49; effect size, left/right: 11.7%/10.6% and 12.5%/12.3%,respectively), but were found to have the opposite behaviour forthe default SPM5 analysis (T-stats: 4.82/5.09; effect size: 9.8%/13.4%). The results of the BET2+N3 analysis were particularlystriking in their clarity: with RTM applied (primarily for viewingease, Fig. 7), the hippocampi are identified as a virtually perfectcast of their anatomical contour. As an interesting aside, all theanalyses (best seen in Fig. 6 with RTM 0.6) identified posteriorcingulate atrophy. This area has been shown repeatedly to be thelocus of most severe (18F)-fluorodeoxyglucose positron emissiontomography (PET) hypometabolism in incipient AD (Minoshima

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

Fig. 6. VBM clusters on glass-brain projections for the Alzheimer's disease analysis at a statistical threshold of pb0.001 (uncorrected) and an extent threshold of200 voxels (see Methods for details).

9J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

et al., 1997; Nestor et al., 2003) and, along with the hippocampus,the only site of accelerated volume loss in pre-symptomaticautosomal dominant-AD mutation carriers (Scahill et al., 2002).This observation is being investigated further for a future clinicalreport.

Turning to the cortical surface, Fig. 6 illustrates that the Haircuttechnique attenuated the spread of statistical abnormalities outsidethe brain boundary. Unlike the VBM results for the simulated-lesion study (Fig. 3), the spreading effect was greatest for theHWAatl+N3 method and least evident for full volume (SPM5default). This suggests that the spreading effect is very idiosyn-cratic to the volumes under study; that is, it depends on the locationof the largest differences in relation to the brain boundary, and thesize and/or mean variance of the clusters. For these reasons, it isnot possible to predict a priori which method will produce more orless spreading after smoothing.

Although the default SPM5 methodology (full volume)apparently shows higher sensitivity than pre-processing methods,it was of concern to the authors that given the results of thephantom study (see Table 2, standard space), this apparentsensitivity could be due to template miswarping (false positives).We were also concerned because some aspects of the corticaldistribution of atrophy appeared at odds with prior knowledge ofcortical pathology in AD. Overall, the results looked plausible –

maximal atrophy in posterior association cortex atrophy, slightlyless significant prefrontal atrophy and relatively preserved primarysensorimotor cortex – however, the finding of severe atrophy in

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

visual association cortex near the occipital pole was concerning ascortical atrophy in AD is generally considered to be of greaterseverity in more rostral parieto-temporal and occipito-temporalassociation cortices. To investigate the possibility that the observedgreater involvement of occipital pole may have been due tomiswarping, measurements of spatial normalisation accuracyacross methods were performed by calculating the regional mean,bSN, and standard deviation, σ, of the local mean values inferredfrom the binarised (threshold set to 0.5), unmodulated normalisedGM segments (after excluding all voxels with probabilities of 1and 0 that are common to all techniques). A larger mean value foran individual voxel indicates lower inter-subject variation, hencethe regional mean increases as the overall variability acrossnormalised segments decreases in such region. The standarddeviation is a measure of the degree of dispersion of miswarping.The number of voxels remaining after excluding 1s and 0s, L, isalso an interesting measure because it represents the number ofvoxels which are found to be discrepant across subjects; a larger Lindicates stronger miswarping. It should be noted that large FN/FPrates might have a strong impact on this measure. Table 3 showsthe above measurements for the complete data set, the rightoccipital pole (20, −96, 2) and left frontal pole (−12, 66, 4). Thesewere contrasted to volumes of interest (VoIs) in areas where onewould predict maximal cortical atrophy: the medial occipital–parietal junction (14, −78, 36), then more laterally the posterior–parietal region (32, −76, 36), and most laterally the lateralposterior–parietal region (52, −48, 34). Note that the anatomical

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

Fig. 7. Overlays of the hippocampal and posterior cingulate clusters at six coronal depths for the BET2f = 0.4+N3 method with relative threshold set to 0.6.

10 J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

locations are expressed as Montreal Neurological Institute (MNI)co-ordinates in millimetres. All VoIs, except the complete set, arecubes of dimensions 11×11×11 voxels centred at the co-ordinatesgiven above. The BET2 method systematically produced the leastnumber of discrepant voxels, which indicates the best warping totemplate, although the percentage difference between it and othermethods was small in each case. For the complete set, the segmentsgenerated by the standard settings produced the highest L, whichmight explain the apparent high sensitivity of the default SPM5method observed in Fig. 6, and is in agreement with previousfindings regarding the better warping to template obtained by pre-processing using BET2f = 0.4 and N3. The most lateral regionpresented the highest mean value, bSN, which suggests that there

Table 3Warping error metrics in patient data

VoI Metric BET2f=0.4+N3 HWAatl+N3 Full volume

Complete set L 1.94·105 2.01·105 2.04·105

bSN 0.508 0.502 0.502σ 0.295 0.300 0.302

R occipital pole(20, −96, 2)

L 1222 1240 1223bSN 0.437 0.441 0.390σ 0.234 0.235 0.226

R medial occipital–parietal junction(14, −78, 36)

L 1314 1317 1323bSN 0.516 0.513 0.511σ 0.162 0.164 0.170

R posterior–parietal(32, −76, 36)

L 1247 1260 1254bSN 0.537 0.531 0.535σ 0.180 0.179 0.186

R lateral posterior–parietal(52, −48, 34)

L 1294 1304 1300bSN 0.577 0.585 0.601σ 0.208 0.213 0.223

L frontal pole(−12, 66, 4)

L 811 825 822bSN 0.500 0.512 0.513σ 0.281 0.286 0.282

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

was a better registration compared to less lateral areas. Specifically,the occipital pole (right side) showed a very low mean and highstandard deviation for all methods, especially for full volume(SPM5 default) suggesting that this implausible abnormality wasindeed an artefact of miswarping. It was also found that the frontalpole (left side) region has a similar mean of the normalised sums ofprobabilities than those at the medial occipital–parietal junctionand the more lateral region; these are all more plausibly abnormalareas in AD cases (Graham et al., 2002). Additionally, it isimportant to note that the standard deviation (i.e. dispersion ofmiswarping) is much larger at occipital and frontal poles, whichsuggests that the rostral and caudal extremities are moresusceptible to registration errors.

Discussion

It is known that clusters of statistically significant differencesfound in VBM may lead to ambiguous results due to systematicmiswarping of a particular region across groups (Bookstein, 2001);this is a topic of ongoing discussion among the scientificcommunity. This study was not intended to solve this problem,but rather, to evaluate the impact that pre-VBM pipelines (skull-stripping and bias correction) have on VBM results by comparingthem to each other and to those derived from unpre-processed MRimages (full volumes), where standard VBM procedures wereused.

The first experiment investigated the impact that pre-processingpipelines had on bias correction and spatial normalisation usingBrainWeb phantoms. N3 outperformed FAST, BFC and biascorrection in SPM5. It is important to note that previousinvestigations (Shattuck et al., 2001) suggested that BFC performsslightly better than N3 when strong bias fields are present,therefore using BFC instead of N3 should be considered for suchcases. With regard to tissue segmentation and warping to template,it was found that although HWAatl/BET2f = 0.4+N3 outperformed

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

11J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

other methods in native space, optimised BSE and BET2f = 0.5+N3yielded the most accurate segments compared to the gold standardin standard space. However, BSE required impractical userintervention and demonstrated visibly erroneous brain-tissueremoval. Thus, for the remainder of the study, only a conservativeBET2 setting (BET2f = 0.4) and HWAatl+N3 were used as pre-processing methods. BFC and FASTwere also dropped due to poorautomation and performance, respectively. It was also found thatmulti-pass skull-stripping improves brain extraction; however,multi-pass bias correction resulted in inaccurate calculations ofnonuniformity fields. Future studies may certainly benefit from theuse of multi-pass skull-stripping. It should also be noted that arecent comparative study (Zaidi et al., 2006) suggested that othersegmentation methods were more effective than an earlier versionof SPM (SPM2). Future studies, comparing the performance ofdifferent segmentation methods (including SPM5) using pre-processing pipelines, would also be of interest.

As lack of a gold standard is an inherent problem in real datasets, the performance of the pre-processing pipelines as opposed tothe default SPM5 procedure (full volume) was, first, tested withsimulated lesions to the temporal lobes (right worse than left). TheBET2f = 0.4+N3 pipeline outperformed HWAatl+N3 and defaultSPM5 methods, since it produced the only VBM result thatincluded the left temporal lesion. However, the clusters ofsignificant differences (left and right) were significantly blurredaway from the known lesion sites into extra-cerebral space. In thepast, this problem was solved by threshold or explicit (simple)masking; however, the results of the BET2f = 0.4+N3 methodshowed that both masking procedures excluded real statisticaldifferences from the analysis. In contrast, the Haircut method,named because of the observed impact, successfully confined theexternal significant clusters close to the brain boundary for allmethods in the manual-lesion study and for a synthetic data set.Haircut is more inclusive than masking as it includes all theinformation from the original T-map, whereas masking can removetrue peaks which, in turn, can move the location of the apparentcluster maximum and can attenuate the T-stats size (e.g. simplemasking excluded the primary peak of the left temporal simulatedlesion, leaving the secondary peak as the maximum T value).Masking can also reduce cluster size (e.g. the single, contiguouslesion on the left side was broken up into separate fragments withthreshold masking). The Haircut technique needs further work inorder to address some issues: (i) this method might also improveVBM accuracy in internal structures of real data sets as seen forsynthetic data and (ii) the performance of the technique variesdepending on both the threshold that determines what voxels areconsidered to be of low probability and the properties of the noiseadded to such voxels.

Other investigators (Reimold et al., 2006) have suggested atechnique to correct for smoothing artefacts and improve spatialprecision in VBM by combining contrast images and T-maps. Thisis an interesting approach because although contrast varies withFWHM, it is not affected by the nonlinear interaction of the kernelwith the voxel variances. Reimold et al.’s method, therefore,physically relocates the cluster peaks to better match the effect-sizemaxima, whilst Haircut reduces the significance of voxels withvery low GM probability. Further work related to the effect ofsmoothing on T-maps should compare the performance of bothapproaches.

Finally, the VBM results from the BET2+N3 and HWA+N3methods were in agreement with manual volumetric measurements

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

in the patient study; however, the results from the default SPM5procedure failed to identify that the effect size for the lefthippocampus has marginally greater abnormality than the right.The strong concordance between the results from the independentmanual measurements and those of BET2+N3 and HWA+N3raises concerns over the validity of the relative distribution of T-statistics when using default SPM5 (i.e. if this had been a clinicalstudy using default SPM5 settings, one would have erroneouslyconcluded that the more atrophic hippocampus was actually themore preserved).

As an aside, it was interesting that although the manualvolumetric measurements showed highly significant hippocampalatrophy, the VBM statistical threshold that identified this atrophyrequired the most liberal, ‘uncorrected’ method (pb0.001). Themore widely used FDR-corrected threshold proved too conserva-tive to identify this atrophy in spite of this being a reasonably well-powered sample size. Deciding on an appropriate statisticalthreshold for the identification of biologically meaningful clustersis a perennial problem in SPM analyses. These patient data resultshighlight how this can be particularly pertinent to the risk ofgenerating type II errors. It also suggests that calibrating thestatistical threshold to an independent measurement of a discretestructure could provide some justification for choosing a giventhreshold.

From this experiment, we again found that the Haircuttechnique was useful in correcting for artefactual displacement ofclusters from the external cortical surface into the extra-cerebralspace; however, it is notable that, unlike the simulated lesionexperiment, this phenomenon was particularly evident using theHWA+N3 pipeline. We could not identify a pattern of spreadingacross methods and we hypothesised that this effect could berelated to the location, amplitude and size of the clusters in relationto the GM boundary.

Major differences in VBM sensitivity were observed acrossmethods, raising concerns that inaccurate registration might begenerating artefactual abnormalities in the VBM results. Inparticular, although BET2+N3 performed very well in thephantom study and demonstrated considerably better sensitivityin detecting the simulated lesions, the full-volume SPM5 andHWA+N3 method detected a far greater volume of significantclusters at the cortical surface in the patient data. Taken together,these observations suggest that the latter two methods may havebeen generating false positives. To measure warping accuracy, theunmodulated normalised GM segments were binarised andcompared across subjects for each method. Complete set and VoImeasurements again showed that HWA+N3 and standard methodsgenerated the poorest registrations, being particularly inaccurate inexternal structures. This might explain the apparent high sensitivityinferred from their VBM results and the implausible atrophyobserved in the occipital lobe for all methods.

Conclusion

In conclusion, the experiments indicated that although SPM5includes segmentation, spatial normalisation and bias correction inthe same model, pre-processing with BET2+N3 prior to SPM5appeared to improve VBM. We also reported that artefactualdisplacement of significant clusters from the cortical surface can besuccessfully corrected by addition of background noise to low-probability GM voxels (Haircut method); this has advantages overthreshold and explicit masking in that it does not exclude real

ping and radio-frequency bias correction on grey-matter segmentation for.10.051

12 J. Acosta-Cabronero et al. / NeuroImage xx (2007) xxx–xxx

ARTICLE IN PRESS

abnormalities. However, at least as importantly, the results highlightthat there remains considerable scope for improvement in VBMmethodology, particularly with reference to registration errors. Themassive differences found in external cortical clusters show thatchoice of pre-processing method can have a major impact on theVBM results.

Acknowledgments

We gratefully acknowledge Professor John R. Hodges foridentifying patients as well as the participants themselves and theirrelatives for their continued support with our research. This researchwas funded by the Medical Research Council (MRC), U.K.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, inthe online version, at doi:10.1016/j.neuroimage.2007.10.051.

References

Arnold, J.B., Liow, J.S., Schaper, K.A., Stern, J.J., Sled, J.G., Shattuck, D.W.,Worth, A.J., Cohen, M.S., Leahy, R.M., Mazziotta, J.C., Rottenberg, D.A.,2001. Qualitative and quantitative evaluation of six algorithms forcorrecting intensity non-uniformity effects. NeuroImage 13, 931–943.

Ashburner, J., Friston, K.J., 2000. Voxel-based morphometry—the methods.NeuroImage 11, 805–821.

Ashburner, J., Friston, K.J., 2005. Unified segmentation. NeuroImage 26,839–851.

Boesen, K., Rehm, K., Schaper, K., Stoltzner, S., Woods, R., Luders, E.,Rottenberg, D., 2004. Quantitative comparison of four brain extractionalgorithms. NeuroImage 22, 1255–1261.

Bookstein, F.L., 2001. Voxel-based morphometry should not be used withimperfectly registered images. NeuroImage 14, 1454–1462.

Fein, G., Landman, B., Tran, H., Barakos, J., Moon, K., Sclafani, V.D.,Shumway, R., 2006. Statistical parametric mapping of brain morphol-ogy: sensitivity is dramatically increased by using brain-extractedimages as inputs. NeuroImage 30, 1187–1195.

Fennema-Notestine, C., Ozyurt, I.B., Clark, C.P., Morris, S., Bischoff-Grethe, A., Bondi, M.W., Jernigan, T.L., Fischl, B., Segonne, F.,Shattuck, D.W., Leahy, R.M., Rex, D.E., Toga, A.W., Zou, K.H., Brown,G.G., 2006. Quantitative evaluation of automated skull-strippingmethods applied to contemporary and legacy images: effects ofdiagnosis, bias correction, and slice location. Hum. Brain Mapp. 27,99–113.

Please cite this article as: Acosta-Cabronero, J., et al., The impact of skull-stripvoxel-based morphometry, NeuroImage (2007), doi:10.1016/j.neuroimage.2007

Graham, D.I., Lantos, P.L. (Eds.), 2002. Greenfield’s neuropathology, 7thed. Arnold, London.

Jack Jr., C.R., Petersen, R.C., Xu, Y.C., O’Brien, P.C., Smith, G.E., Ivnik,R.J., Boeve, B.F., Waring, S.C., Tangalos, E.G., Kokmen, E., 1999.Prediction of AD with MRI-based hippocampal volume in mildcognitive impairment. Neurology 52, 1397–1403.

Minoshima, S., Giordani, B., Berent, S., Frey, K.A., Foster, N.L., Kuhl, D.E.,1997. Metabolic reduction in the posterior cingulate cortex in very earlyAlzheimer’s disease. Ann. Neurol. 42, 85–94.

Nestor, P.J., Fryer, T.D., Smielewski, P., Hodges, J.R., 2003. Limbichypometabolism in Alzheimer’s disease and mild cognitive impairment.Ann. Neurol. 54, 343–351.

Reimold, M., Slifstein, M., Heinz, A., Mueller-Schauenburg, W., Bares, R.,2006. Effect of spatial smoothing on t-maps: arguments for going backfrom t-maps to masked contrast images. J. Cereb. Blood FlowMetab. 26,751–759.

Rex, D.E., Shattuck, D.W., Woods, R.P., Narr, K.L., Luders, E., Rehm, K.,Stolzner, S.E., Rottenberg, D.A., Toga, A.W., 2004. A meta-algorithmfor brain extraction in MRI. NeuroImage 23, 625–637.

Scahill, R.I., Schott, J.M., Stevens, J.M., Rossor, M.N., Fox, N.C., 2002.Mapping the evolution of regional atrophy in Alzheimer’s disease:unbiased analysis of fluid-registered serial MRI. Proc. Natl. Acad. Sci.U. S. A. 99, 4703–4707.

Segonne, F., Dale, A.M., Busa, E., Glessner, M., Salat, D., Hahn, H.K.,Fischl, B., 2004. A hybrid approach to the skull stripping problem inMRI. NeuroImage 22, 1060–1075.

Shattuck, D.W., Sandor-Leahy, S.R., Schaper, K.A., Rottenberg, D.A.,Leahy, R.M., 2001. Magnetic resonance image tissue classification usinga partial volume model. NeuroImage 13, 856–876.

Sled, J.G., Zijdenbos, A.P., Evans, A.C., 1998. A nonparametric method forautomatic correction of intensity non-uniformity in MRI data. IEEETrans. Med. Imag. 17, 87–97.

Smith, S.M., 2002. Fast robust automated brain extraction. Hum. BrainMapp. 17, 143–155.

Vovk, U., Permuš, F., Boštjan, L., 2007. A review of methods for correctionof intensity inhomogeneity in MRI. IEEE Trans. Med. Imag. 26,405–421.

Zaidi, H., Ruest, T., Schoenahl, F., Montandon, M.-L., 2006. Comparativeevaluation of statistical brain MR image segmentation algorithms andtheir impact on partial volume effect correction in PET. NeuroImage 32,1591–1607.

Zhang, Y., Brady, M., Smith, S., 2001. Segmentation of brain MRimages through a hidden Markov random field model and theexpectation maximization algorithm. IEEE Trans. Med. Imag. 20,45–57.

Zhuang, A.H., Valentino, D.J., Toga, A.W., 2006. Skull-stripping magneticresonance brain images using a model-based level set. NeuroImage 32,79–92.

ping and radio-frequency bias correction on grey-matter segmentation for.10.051