multimodal brain age estimation1470319/fulltext01.pdf · multimodal brain age estimation oscar...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2020

Multimodal Brain Age Estimation

OSCAR DANIELSSON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Multimodal Brain AgeEstimation

OSCAR DANIELSSON

Master in Computer ScienceDate: July 22, 2020Supervisor: Kevin SmithCo-supervisor: Lennart Van der GotenExaminer: Erik FransénSchool of Electrical Engineering and Computer ScienceSwedish title: Multimodal estimering av hjärnans ålder

iii

AbstractMachine learning models trained on MRI brain scans of healthy subjects canbe used to predict age. Accurate estimation of brain age is important for reli-ably detecting abnormal aging in the brain. One way to increase the accuracyof predicted brain age is through using multimodal data. Previous research us-ing multimodal data has largely been non-deep learning-based; in this thesis,we examine a deep learning model that can effectively utilize several modal-ities. Three baseline models were trained. Two used T1-weighted and T2-weighted data, respectively. The third model was trained on both T1- and T2-weighted data using high-level fusion. We found that using multimodal datareduced themean absolute error of predicted ages. A fourthmodel utilized dis-entanglement to create a representation robust to missing T1- or T2-weighteddata. Our results showed that this model performed similarly to the baselines,meaning that it is robust to missing data and at no significant cost of predictionaccuracy.

iv

SammanfattningMaskininlärningsmodeller tränade på MR-data av friska personer kan använ-das för att estimera ålder. Noggrann uppskattning hjärnans ålder är viktigt föratt pålitligt upptäcka onormalt åldrande av hjärnan. Ett sätt att öka noggrann-heten är genom att använda multimodal data. Tidigare forskning gjord medmultimodal data har till stor del inte varit baserad på djupinlärning; i detta ex-amensarbete undersöker vi en djupinlärningsmodell som effektivt kan utnyttjaflera modaliteter. Tre basmodeller tränades. Två använde T1-viktad respektiveT2-viktad data. Den tredje modellen tränades på både T1- och T2-viktad datagenom högnivå-fusion. Vi fann att användning av multimodal data minskadedet genomsnittliga absoluta felet för estimerade åldrar. En fjärde modell an-vände separering (eng. disentanglement) för att skapa en representation som ärrobust vid avsaknad av T1- eller T2-viktad data. Resultaten var lika för dennamodell och basmodellerna, vilket innebär att modellen är robust mot avsaknadav data, utan någon betydande försämring i noggranhet.

Contents

1 Introduction 11.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . 21.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 T1- and T2-weighted data . . . . . . . . . . . . . . . 42.1.2 Other modalities . . . . . . . . . . . . . . . . . . . . 42.1.3 MRI Protocols . . . . . . . . . . . . . . . . . . . . . 4

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Evaluation Methods . . . . . . . . . . . . . . . . . . 62.2.3 Unimodal Methods . . . . . . . . . . . . . . . . . . . 72.2.4 Multimodal Methods . . . . . . . . . . . . . . . . . . 8

2.3 Disentanglement . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Methods 103.1 Methodological Approach . . . . . . . . . . . . . . . . . . . 103.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Sessions . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Subject Data . . . . . . . . . . . . . . . . . . . . . . 123.2.3 Data Details . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . 133.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . 133.3.2 Brain Extraction and Neck Cropping . . . . . . . . . . 143.3.3 Image Registration . . . . . . . . . . . . . . . . . . . 143.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4.1 Unimodal Models . . . . . . . . . . . . . . . . . . . . 15

v

vi CONTENTS

3.4.2 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4.3 Multimodal Disentanglement . . . . . . . . . . . . . 17

3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5.1 Transforms . . . . . . . . . . . . . . . . . . . . . . . 193.5.2 Splits . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Results 214.1 Mean Absolute Error: Unimodal . . . . . . . . . . . . . . . . 214.2 Mean Absolute Error: High-Level Fusion . . . . . . . . . . . 224.3 Mean Absolute Error: Multimodal Disentanglement . . . . . . 234.4 Missing Modalities . . . . . . . . . . . . . . . . . . . . . . . 244.5 Effect of Using Multimodal Data on Accuracy . . . . . . . . . 244.6 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Discussion 275.1 General discussion . . . . . . . . . . . . . . . . . . . . . . . 275.2 Societal Aspects and Sustainability . . . . . . . . . . . . . . . 29

5.2.1 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.1 Generalizability . . . . . . . . . . . . . . . . . . . . . 295.3.2 Limited Number of Modalities . . . . . . . . . . . . . 305.3.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . 30

5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4.1 Improving Preprocessing . . . . . . . . . . . . . . . . 315.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . 315.4.3 Multiple Datasets . . . . . . . . . . . . . . . . . . . . 315.4.4 Multiple Modalities . . . . . . . . . . . . . . . . . . . 32

6 Conclusions 33

Bibliography 34

A Network Details 39

Chapter 1

Introduction

Aging is a highly intricate multi-factorial process in which cognitive and phys-ical abilities deteriorate over time. While we have witnessed increasing lifes-pans in the last century, healthspans, the part of our lives that we spend in goodhealth, have not increased as significantly due to spending an increasing partof our lives at hospitals [1, 2]. Healthspans vary drastically between individu-als; some retain normal cognitive and physical abilities throughout their liveswhile others experience early onset of, for example, memory loss.

By measuring biological age, we learn more about a subject’s aging pro-cess than from the age alone. The difference between the actual age and bi-ological age of a subject can be used to monitor aging. External factors canincrease an individual’s biological age at a higher rate than normal; for exam-ple, progression of cognitive diseases such as Alzheimer’s and dementia arein some ways similar to an accelerated aging of the brain [3]. Several differ-ent methods for measuring biological age exists including blood tests, retinalexamination and examining changes in the brain structure. MRI (magneticresonance imaging) is the preferred method for high-resolution imaging ofsoft tissue, like the brain. Using models trained on MRI scans of brain tissueas well as the subjects’ age, it has been shown that having a higher predictedbrain age than chronological age is correlated with an increase in mortality[2], cognitive disease [4] and schizophrenia [4]. Having a lower predicted agethan actual age, on the other hand, is correlated with physical exercise andmeditation [5].

Being able to accurately estimate brain age is important for detecting andmonitoring cognitive diseases. Accurate monitoring of disease progressioncould help, for example, evaluating new treatments. Several different MRImodalities exist that provide different structural information of the brain. Thus

1

2 CHAPTER 1. INTRODUCTION

far, research exploring the potential of using multiple modalities, in order tocreate a richer representation and make more accurate predictions, has beenlimited to non-deep learning architectures. Utilizingmultiplemodalities presentsseveral new challenges, one being missing data since each imaging sessiondoes not involve all MRI modalities. This thesis draws inspiration from previ-ous work done onmultimodal brain tumor segmentation in order to address theproblem of missing modalities. The aim of the thesis is to explore the potentialof using multimodal data for brain age prediction. Due to computational limi-tations 10-fold cross-validation could not be performed for all models trained.

1.1 Research QuestionThe thesis aims to answer the following two research questions.

• What are the advantages of using T1- and T2-weighted MRI data overusing either modality alone for brain age estimation?

• How well does a multimodal deep learning-based approach perform un-der missing modalities?

1.2 DelimitationsWe delimit our work in two ways: (1) modalities are restricted to T1 and T2(2) examining how different preprocessing methods affect performance is notexplored. These delimitations are inherently linked; going beyond T1- andT2-weighted data would inevitably mean more time spent on preprocessingsince the pipeline would have to be adapted for each modality.

Chapter 2

Background

2.1 MRIMRI is a non-invasive procedure for generating imagery of organs or physio-logical functions. Structural MRI (often referred to as just MRI or sMRI) iswhat is used when generating imagery of organs while functional MRI (fMRI)is often used for measuring brain activity. This thesis focuses on differentmodalities of structural MRI. Imagery can be generated either slice by slice(2D MRI) or as a volume (3D MRI) but the output is a 3D volume regardless.Voxels can be isotropic, though the slice dimension is often slightly largerwhen using 2D MRI. Commonly, voxels occupy a volume of around 0.5mm3

to 1.5mm3.A basic MRI experiment can be explained in three steps. A strong mag-

netic fieldB0 magnetizes (changing the spin of particles of) the tissue. A radiofrequency pulse is then emitted in direction B1, different from B0. Once thepulse is turned off, the magnetization will return to B0 after a short while.When returning from this excited state to equilibrium, radio frequency energyis emitted and this signal is measured and used to construct the image.

By varying parameters such as signal measuring time and time betweenradio frequency bursts the intensity of the signal produced by a certain tissuechanges. These different settings are often referred to as different protocolsor modalities of MRI, each having different clinical applications due to differ-ences in signal response over tissue.

3

4 CHAPTER 2. BACKGROUND

Figure 2.1: Side-by-side comparison of T1- and T2-weighted imaging data.White matter appears bright in T2- but is darker in T1-weighted data. Graymatter, on the other hand, appears brighter in T1- than in T2-weighted data.

2.1.1 T1- and T2-weighted dataT1- and T2-weighted imaging data represent the most commonly used modal-ities for examining neurological structure. T1 and T2 refer to the relaxationtime of tissue according to the component of the spin aligned to the base mag-netic field and the component aligned with the frequency pulse, respectively.In T1-weighted data, gray matter gives a high-intensity signal while whitematter appears darker. For T2-weighted data, white matter appears brighterthan gray matter (see Figure 2.1). T1- and T2-weighted data are often used foranalysing anatomical structure. T2-weighted is better for pathological tissue;the increased water content in this tissue gives of a high intensity signal inT2-weighted data [6]. Using both modalities together can help doctors in dif-ferentiating pathology by looking at differences in the water/fat concentration.

2.1.2 Other modalitiesOther commonly used MRI modalities include Fluid-attenuated Inversion Re-covery (FLAIR), DiffusionWeighted Imaging (DWI), SusceptibilityWeightedImaging (SWI) and Double Inversion Recovery (DIR).

2.1.3 MRI ProtocolsMRI protocols detail the modalities collected during one MRI session. Theprotocol used depends on the purpose of the experiment: (brain) screening,epilepsy, tumour growth and neurodegenerative disease all imply different

CHAPTER 2. BACKGROUND 5

MRI protocols. Table 2.1 shows which MRI modalities are recommendedfor different protocols relevant for this thesis. The actual modalities collectedmight defer, depending on the institution, what scanner is used and preferencesof the doctor [7].

Protocol T1 T2 FLAIR DWI SWI DIRScreening x x x x x -

Neurodegenerative x x x x x -Epilepsy x - x x x x

Table 2.1: Recommended MRI modalities for different protocols.

2.2 Related WorkBrain-predicted age is the estimated age of the brain of a given subject. Thedifference between the estimated brain age and the chronological age of thesubject is often referred to in the literature as predicted brain age difference(PAD). A positive PAD, meaning that the estimated age of the brain is higherthan the chronological age, has been shown to be associated with reducedcognitive and physical abilities, higher mortality [2], schizophrenia [4] andAlzheimer’s disease [8]. On the other hand, a negative PAD indicates bettercognitive functions, a higher level of attained education, as well as physicalexercise [5].

The interest in brain age estimation has mainly been focused on two as-pects: firstly, as a biomarker for the aging process, and secondly, to increaseour knowledge of the brain’s aging process through visualization. Creating areliable biomarker is important, since it could enable us to identify people atrisk of developing cognitive diseases, monitor disease progression, and evalu-ate treatments. Another use case is the identification of genes or other variablesrelated to aging [9].

This section will survey previous unimodal and multimodal approaches tobrain age estimation and related tasks, such as brain segmentation, as well asgive a brief overview of the evaluation and preprocessing methods used.

2.2.1 PreprocessingNoise introduced by the imaging process, variations in signal intensity overdifferent regions of the image, and machine-dependent intensities are three


things that preprocessing methods such as denoising, bias field correction andintensity standardization try to correct [10, 11]. Using multiple MRI scannersoften result in data of different fields of view, caused by different machine set-tings. Image registration can then be applied to transform the images into astandard coordinate system and to align them. Aligning the data is also use-ful when using a single scanner, due to variations in the position of the brainbetween experiments (caused by for example head tilting). Aligning data ofdifferent MRI modalities is also possible with commonly used libraries likeFSL or ANTs, differences between modalities, however, makes this more chal-lenging [12]. An assortment of different preprocessing pipelines exists, andthe relative effectiveness of each is debated [13]. Two tools commonly usedfor feature extraction as well as performing the aforementioned preprocessingand more are FreeSurfer and FSL [14, 15, 16, 17].

2.2.2 Evaluation MethodsA reliable biomarker should give consistent age estimates when varying theMRI scanner model. When evaluated on longitudinal data, estimated ageshould be higher for reexaminations made at later dates. Models trained onpreprocessed data have been shown to be robust to scanner differences and re-examinations [9, 13, 18]. When trained on raw data, models were found to beless robust to scanner differences [9]. Comparing the performance of modelsimplemented in different papers is not straightforward. While most work onbrain age estimation has used the same evaluation metric, the mean absoluteerror (MAE), simply comparing this metric between different reports is insuf-ficient to draw accurate conclusions. Often, different datasets are used, whichmight have very different underlying age distributions of the subjects, whichin turn affects the MAE [19], as errors are not equally distributed over age.Performing comparisons on the same dataset might also not be possible, sincedatasets are often kept private due to privacy reasons.

Since training data use the chronological age, the underlying true brain ageremains unseen and a lower bound on the MAE is unclear. Instead of evalu-ating the performance based on measured chronological age, an alternativeapproach is to measure the amount of variance explained on various cognitiveand physical tests performed by the subject that are known to correlate withaging [20]. By evaluating how accurately the model explains deterioration incognitive and physical abilities, we avoid the problem of differences in chrono-logical and biological age. Two subjects that have the same chronological agemight perform differently in the cognitive and physical tests, due to the fact


that they are in different stages of the aging process (different biological age).This, however, is only feasible to do if accurate and appropriate tests can beconstructed.

2.2.3 Unimodal MethodsPrevious works in the area of brain age estimation have primarily relied onunimodal data. The methods used can be split up into two categories: deeplearning-based and non-deep learning-based. Gaussian process regression(GPR) [9], support vector regression [21] and relevance vector machines [18]have been popular choices for non-deep learning-based methods. Features ex-tracted include cortical thickness, surface area and subcortical volume [22].For fMRI data, connectivity maps have also been used as features.

An advantage of deep learningmethods is their decreased reliance on hand-crafted features, which might provide imperfect information. Multiple dif-ferent architectures and network types have been tried for T1-weighted data.Huang et al. [23] trained a 2D convolutional neural network (CNN) based onthe VGGNet architecture. Cole et al. [9] also used the VGGNet architecture,but for a 3D CNN instead; using raw T1-weighted data, an MAE of 4.61 yearswas achieved, which is similar to the MAE of 4.65 years using GPR with graymatter data. As noted above, however, when trained on raw data, the networkwas less robust to scanner differences. In comparison, when using gray matteras input, the 3D CNN achieved anMAE of 4.1 years; adding white matter datadid not improve accuracy significantly for either model. By using the newerResNet architecture and combining predictions of four 3D CNNs trained ongray matter, white matter, Jacobian and raw data, respectively, in an ensem-ble regressor, an MAE of 3.5 years was achieved, compared to 3.9 years foran ensemble model based only on ridge regression and GPR [13]. All threenetworks use a single linear unit in the last layer to output the predicted brainage.

The effectiveness of deep learning-based methods on the task of brain ageestimation has been debated, and their performance is claimed by some to besimilar to, for example, GPR [9, 20]. Recent results [13], however, have beenvery promising. In Photon-AI 1, an annual online competition for estimatingbrain age, the winning entry for 2019 achieved an MAE of 2.9 years using anensemble-based deep learning model [24].

1https://www.photon-ai.com/pac2019


2.2.4 Multimodal MethodsCombining complementary information from several data sources leads to aricher representation and could mean an overall more accurate estimate. Pre-vious work has shown the potential for this: using multiple regression for T1-,T2- and diffusion-weighted imaging (DWI), age estimation was shown to bemore accurate [25]. Similar results have also been obtained using T1-weightedand DWI data [16] alone, as well as by combining features from resting-statefMRI and T1-weighted data [15, 20]. Using six modalities was found to yieldan MAE of 3.5 years, compared to 4.13 years for T1-weighted data only, usinglasso regression [24]. In contrast with these results, using multimodal datahas in one case resulted in comparable performance as to using unimodal dataonly [14].

A wide variety of machine learning methods have been employed for mul-timodal brain age estimation. A reoccurring theme is using software to extracthigh-level features from each modality and training a model on this; with theexception of [25], where multiple regression was used on raw data, the major-ity of work has followed this pattern. Using deep learning for feature extrac-tion and prediction has been largely ignored for the task of multimodal brainage estimation. Other researchers have also identified this gap in the literature[20].

In related areas of research, several attempts have been made to bridgethe gap between modalities. Most approaches fall into one of two categories:feature-level fusion or high-level fusion [26]. High-level fusion, in the contextof CNNs, means combining predictions or classification layer features fromdifferent models, each trained on one modality. Feature-level fusion is oftenachieved by extracting high-level features from each modality and mergingthese. Another way of fusing is to concatenate modalities; a 2D input matrixof dimensions (h, w) would be combined to a 3D tensor of size (K, h, w), whereK is the number of modalities [27]. However, this requires that the modalitiesare of the same dimension. While these methods are often simple to imple-ment, problems arise when modalities are missing, such as is often the case inclinical environments. Any successful multimodal model needs to effectivelydeal with missing modalities since not all modalities are collected for everypatient. Other methods have been suggested in the literature, such as merg-ing multiple modalities by linear transformations [28] as well as probabilisticapproaches [29].

Recent work has successfully used a disentanglement-based approach forliver cancer segmentation [30] and brain tumor segmentation [31] using mul-


Figure 2.2: Changing attributes of images using a Fader Network.

timodal data. The method used by Chen et al. [31] aims to disentangle MRIcontent from modality style/appearance. Compared to previous methods, thismeans seamlessly dealing with missingmodalities, while performing similarlyto state-of-the-art methods under no missing modalities.

Multimodal brain age estimation using deep learning-based methods has,up until now, been largely ignored. This thesis aims to exploit T1- and T2-weighted imaging data using a disentanglement-based approach. The objec-tive will be twofold: (1) compare a multimodal disentanglement approachto unimodal and high-level fusion baselines; (2) evaluate performance undermissing modalities.

2.3 DisentanglementDisentanglement is the process of disentangling or separating a set of attributesfrom some other content. The result of this procedure is a new representa-tion of the input where we can freely change the disentangled attribute. Awell-known implementation of this procedure is presented in [32], an encoder-decoder structure is combined with a discriminator to form the Fader Network.The general structure is the following: input x has an attribute a that we wantto disentangle. x is encoded to z, which is independent from a. The indepen-dence between z and a is enforced by the discriminator by giving a penaltywhenever the discriminator can tell what a is by looking at z. The decoderthen uses a and z to generate x again. An example of using disentanglementis shown in Figure 2.2.

Another approach to disentanglement is to instead separate an image intocontent and style [33, 34]. Combining the content with a style of anotherimage allows for effective translation of images between different styles. Forexample, summer landscape to winter landscape, painting to photograph orcat to dog.

Chapter 3

Methods

3.1 Methodological ApproachThe overlying goal is to analyze and evaluate the potential of utilizing mul-timodal data for the task of brain age estimation. Previous research in thisarea is limited and direct comparisons are hard to make due to the use of pri-vate datasets. To address these challenges, we construct three baseline models.Two unimodal networks are trained on T1- and T2-weighted data, respectively.The last baseline model uses high-level fusion and is trained with both modal-ities. Using 10-fold cross-validation high-level fusion is compared against thetwo unimodal models in order to analyze the effect of using multimodal data.These three models are then used as baselines to compare a fourth model basedon multimodal disentanglement against. Three different scenarios are created:missing T1-weighted data, missing T2-weighted data and nomissing data. To-gether, they give an estimate of how robust the model is to missing modalities.The multimodal disentanglement model is not included in the 10-fold cross-validation due to computational reasons. Therefore, no claims of statisticalsignificance can be made when comparing this model to the baselines.

3.2 DatasetThe Oasis3 1 dataset contains publicly available MRI and PET sessions of1098 subjects over a time period of 30 years. Each subject took part in one ormore imaging sessions and in each session, several different imaging modal-ities could be used. Since this thesis is limited to T1- and T2-weighted data,

1https://www.oasis-brains.org

10

CHAPTER 3. METHODS 11

Figure 3.1: The dataset contains 7208 T1-weighted (blue) and T2-weighted(green) scans of 1098 subjects. Out of these, 1599 T1-weighted and 1878T2-weighted scans were taken of cognitively normal subjects.

other modalities are filtered out and not part of the discussion of the data be-low. We opt to use a publicly available dataset since it will make our resultseasier for other researchers to reproduce and compare against.

3.2.1 SessionsData from several imaging sessions is available for most subjects. The ma-jority of the imaging sessions collected both T1- and T2-weighted data. Eachsession contains metadata such as the subject’s age at the time that the datawas collected and general information about type of scanner and its settings.In order to simplify the preprocessing procedure, only data collected fromMRIscanners with a magnetic field strength of 3 Tesla were used, which constitutedthe large majority of MRI data.

Apart from imaging sessions, subjects also underwent cognitive tests tomeasure memorization ability, for example. Around 60% of the subjects weredeemed cognitively normal and 40% showed signs of cognitive decline, suchas Alzheimer’s disease or dementia. We consider a subject to be cognitivelynormal only if all cognitive tests performed by the subject were labeled ascognitively normal. Subjects who did not fulfill this were instead labeled asnot cognitively normal. Figure 3.1 shows the proportion of scans associatedwith a subject deemed to be cognitively normal and not for both modalities.Only data from cognitively normal subjects was used in order to ensure normalbrain aging across the training set.

12 CHAPTER 3. METHODS

Figure 3.2: Distribution of age recorded at each T1- or T2-weighted imagingsession of a cognitively normal subject.

3.2.2 Subject DataFigure 3.2 shows the distribution of ages recorded at each imaging sessionof cognitively normal subjects. In this dataset, subjects are between 42 and95 years old with a mean age of 70 years. Other subject data such as race,ethnicity, left/right-handedness and education level is also available.

3.2.3 Data DetailsVoxel size and field of view (FOV) varied over scanners for both modali-ties. T1-weighted data either had a voxel size of 1mm3 and scan size of(176, 256, 256) or a voxel size of 1.2x1.05x1.05mm3 and a data size of (176, 240, 256).T2-weighted data had a larger variability in terms of FOV. The dimensional-ity was (256, 256, 36) and (228, 256, 44) with a voxel size of 1x1x4mm3 or(176, 256, 256) with voxel size 1mm3. The number of scans with a certainsize, for the two different modalities, is shown in Figure 3.3. Data sizes ofwhich there were only one scan were filtered out. For computational reasons,data was down-sampled to a resolution of (44, 64, 64) with a voxel size of


Figure 3.3: Number of scans with a certain size permodality. The y-axismarksthe size of the data. Scans include subjects of any cognitive state.

4mm3.

3.3 Data ProcessingThe data preprocessing pipeline consists of four steps: normalization, brainextraction, neck cropping and registration. Brain extraction, neck croppingand registration functions were imported from FSL2. See Figure 3.4 for a vi-sualization of how imagery is changed by the preprocessing.

3.3.1 NormalizationMin-max normalization was used over z-score normalization due to the sparsenature of the MRI data. However, empirical testing showed no major differ-ence between using min-max normalization or z-score normalization. Nor-malization was applied two times, once before brain extraction and once againafter all non-brain tissue had been removed. Neck and skull tissue gives a highintensity signal, causing some brains to appear darker (scans with a large fieldof view) unless the second normalization is performed.

2https://fsl.fmrib.ox.ac.uk/fsl/fslwiki FSL is a library of MRI processing functions devel-oped by Analytics Group, Oxford University


Figure 3.4: Illustration of preprocessing for two subjects. Three differentviews of the same brain is shown, before and after preprocessing. Right sideis T1-weighted and the left side data is T2-weighted. The second row showshow the same brains look after normalization, rotation, neck-cropping, brainextraction and registration have been performed.

3.3.2 Brain Extraction and Neck CroppingBrain extraction is the procedure of removing non-brain tissue and neck crop-ping tries to remove neck tissue. These operations are performed since ourinterest lies in gaining a better understanding of the aging process of the brain;skull or neck tissue is not of interest and is hence removed. BET and Robust-FOV of the FSL library were used to do brain extraction and neck cropping,respectively. Default settings were used for RobustFOV. BET was run with the-R option for more accurate brain extraction.

3.3.3 Image RegistrationOne common problem associated with MRI data is different spatial positionand orientation of brains. These differences may be due to patients tilting theirheads in different directions or due to different protocols or scanners. Imageregistration tries to align images to the same coordinate system so that analysiscan be performedmore easily. Figure 3.5 shows how registration can be used toalign three randomly selected T1-weighted scans. The FLIRT function fromthe FSL library was used for registration of T1- and T2- weighted images,FLIRT also performs resampling to standardize voxel sizes. After performingregistration all data had a voxel size of 1mm3 and a shape of (256, 176, 256).Figure 3.6 illustrates how two T1-weighted images with different field of viewsand voxel sizes are registered together.


Figure 3.5: Visualization of how registration can be used to align brains. Firstrow is before registration and second row is after. Colors red, blue and greenrepresent different subjects.

3.3.4 PerformancePreprocessing one MRI scan takes 7 minutes on an Intel Core i7-6950X pro-cessor at 3.0 GHz. The process is trivially parallelizable across scans, mean-ing that we get a speed up scaling linearly with the number of cores of theprocessor. On a 10-core processor with simultaneous multithreading enabled,preprocessing all 3500 scans took about 24 hours.

3.4 ModelsFour different networks were trained, of which two were unimodal and twomultimodal. All models were implemented using PyTorch.

3.4.1 Unimodal ModelsThe two networks trained on unimodal data were based on the ResNet18 ar-chitecture, extended to work on volumetric data (3D). This meant that input toconvolutional layers and normalization layers was changed from working on4D input (B,C,H,W ) to 5D (B,C,H,W,D), whereB is batch size andC is


Figure 3.6: Two T1-weighted scans with different voxel size and field of view.The third column shows both scans plotted on top of each other. The twoscans had a voxel size of 1mm3 and 1.2x1.05x1.05mm3, respectively. Thesecond scan had a slightly larger field of view in the first dimension and moreof the throat is, therefore, visible. The second row shows the same data afterpreprocessing.

channels. A 3D implementation of ResNet18 from torchvision3 was used as abase. Since the task was to estimate an age (regression), the number of outputdimensions in the last fully connected layer was set to one.

3.4.2 FusionThe high-level fusionmodel used the two unimodal networks pretrained on T1-and T2-weighted data, respectively. The fully connected layer in the last layerof both networks was removed and the output of the previous layers of eachnetwork was instead concatenated. A new fully connected layer was placedafter with the number of output dimensions equal to one. All layers except thelast fully connected layer were frozen. Figure 3.7 shows a visualization of thefusion architecture.


Figure 3.7: Overview of high-level fusion model. Modalities are concatenatedacross the channel dimension.

3.4.3 Multimodal DisentanglementThe multimodal disentanglement model uses the same architecture as in [31],but adapted for regression. The model uses four encoders, two decoders andthe 3D version of ResNet18 outlined previously. As seen in Figure 3.8, the en-coders are divided into content encoders, Ec, and shape encoders, Es, for bothmodalities. The content encoders encode the information of the brain that isused for predicting age. Content is bound to be similar across modalities (bothprovide an anatomical overview of the brain), but appearancemight differ. Thegoal of the style encoders is, therefore, to encode the style that the content ispresented in. The overlying goal of the architecture is to then disentangle theappearance of the scans from the content, and to use this to fuse the content ofthe two different modalities together. In Figure 3.8, the T1- and T2-weightedimages are first encoded into cT1w and cT2w, respectively, after which they arefused into shared representation z, of the same dimensionality as cT1w or cT2w.The contribution from each modality to the shared representation z is learnt,meaning that informative modalities can contribute more and that this can varyper voxel [31]. See Appendix A for more details on the fusion process and thestructure of Ec, Es and D.

Shape encoders EsT1wand EsT2w

are used to encode the data into appear-ance codes, both set to a size of 8-bits in accordance to [34, 31]. This code isthen concatenated with the fused representation z, filling with ones so thatdimensions match, before decoding (Figure 3.8). The decoders DT2w and

3https://pytorch.org/docs/stable/torchvision/index.html torchvision is a package for Py-Torch that contains computer vision models, datasets and more.


Figure 3.8: Architecture of the multimodal disentanglement model. T1- andT2-weighted images are encoded into content (cT1w and cT2w) and style (Ap-pearance code T1w and Appearance code T2w). The content is fused to asingle representation (z) and ResNet18 is applied to predict the age. The twodecoders attempt to reconstruct the original input using z and the correspond-ing appearance code.

DT1w then produce the reconstructed original input T1w and T2w. The over-all loss is computed as Ltot = Lreg + λ1LreconT1w

+ λ1LreconT2w+ λ2LKLT1w

+ λ2LKLT1w. The loss Lreg is computed as the mean absolute error between

the predicted age and the ground truth age. The reconstruction loss λLreconi

is similarly computed as the mean absolute error between Tiw and the recon-structed input T iw. In line with previous research[34, 33, 31], we set a normalprior for the appearance code and use the Kullback-Leiber divergence to makethe probability distribution of appearance codes close to normal. The hyper-parameters λ1 and λ2 are used to weight the different losses, and as in [31], weset λ1 = λ2 = 0.1, weighting the reconstruction and KL loss equally.

3.5 TrainingTraining was done on two Nvidia Titan XP GPUs, with 12GB memory each,using data-level parallelism. The Adam optimizer was used for all networks.For the unimodal networks the initial learning rate was chosen to be 5 · 10−4,based on a grid search of different learning rates between 10−2 and 10−5. Thesame learning rate was used for high-level fusion model. For the multimodaldisentanglement model the learning rate was set to 10−4 based on previous


Lreg =

∑Ni=0 |yipred − yi|

N(3.1)

Lreconi=

∑V |Tiw − T iw|V

(3.2)

LKLT1w=

∑(AcodeT1w)

2

8(3.3)

LKLT2w=

∑(AcodeT2w)

2

8(3.4)

Figure 3.9: Equations for the different losses of the multimodal disentangle-ment model. N is the number of subjects, V is the number of voxels andAcodeis the Apperance Code.

research. A scheduler that reduced the learning rate by a factor of 0.25 after15 epochs with no improvement in validation loss was used for the unimodalnetworks. For the high-level fusion network the same scheduler was used butwith a patience of 5 instead. A batch size of 16 was used for all networks.

During training two checkpoints were saved for each network: the lasttrained and the best performing on the validation set. When evaluating per-formance on the test set weights were loaded from the checkpoint with theminimum validation set error.

All models were trained for 100 epochs, except for the high-level fusionmodel which was trained for 30 epochs. This model was trained for less epochssince only the last fully connected layer was trainable. Total training timevaried: around 2 hours for the unimodal networks, 30 minutes for the high-level fusion and 12 hours for the multimodal disentanglement network.

3.5.1 TransformsInput data was downsampled by a factor of 4, using trilinear interpolation, inorder to reduce computational complexity. This gave an effective voxel size of4mm3 and data size of (44, 64, 64). For the unimodal networks this reducedcomputational time from over an hour per epoch to 1 minute per epoch. A sim-ilar decrease in computational time was also seen when using the multimodaldisentanglement model.


3.5.2 SplitsIn total 1599 T1-weighted images and 1878 T2-weighted images were used.For training the multimodal models 1069 pairs were created from sessions (anexamination of a subject) that had at least one T1- and T2-weighted image.Splits were then created by subject: train (70%), validation (10%) and testset (20%). This ensured that scans of the same subject were in the same splitacross all three datasets. Moreover, subjects were sorted by their average ageand assigned to split i if index mod k == i, where index is the position ofthe subject in the sorted array and k is the number of folds. This ensured asimilar distribution of ages across splits.

Chapter 4

Results

4.1 Mean Absolute Error: UnimodalTwo unimodal networks were trained on T1- and T2-weighted data, respec-tively. Figure 4.1 and Figure 4.2 show the mean absolute error over epochsand predictions made on the test set for the model trained on T1-weighteddata. Figure 4.3 and Figure 4.4 show the corresponding experiments for themodel trained on T2-weighted data. The lowest mean absolute error achievedon the validation set was 4.43 and 4.15 years using T1- and T2-weighted data,respectively. Evaluated on the test set, this error was 5.20 using T1-weighteddata and 5.17 using T2-weighted data.

Figure 4.1: Loss plot of ResNet18trained on T1-weighted data.

Figure 4.2: Predictions on test set.

21

22 CHAPTER 4. RESULTS

Figure 4.3: Loss plot of ResNet18trained on T2-weighted data.

Figure 4.4: Predictions on test set.

4.2 Mean Absolute Error: High-Level FusionThe multimodal model combining the previously trained unimodal models us-ing high-level fusion gave an validation set error of 4.28 and a test set errorof 4.87. Figure 4.5 and Figure 4.6 show the mean absolute error over epochsand predictions made on the test set. . Training convergenced quickly (variedbetween 1-10 steps) due to only the last layer being trainable.

Figure 4.5: Loss plot of high-level fu-sion model trained on T1- and T2-weighted data.

Figure 4.6: Predictions on test set us-ing high-level fusion model.

CHAPTER 4. RESULTS 23

4.3 Mean Absolute Error: Multimodal Disen-tanglement

Figure 4.7 and Figure 4.8 show the mean absolute error and loss over epochsfor the multimodal disentanglement model. Since the loss is dominated by theregression term the two plots are similar. The minimum validation set meanabsolute error was 4.27 years with a loss of 4.43. As seen in Figure 4.9, themean absolute error measured on the test set was 4.93.

Figure 4.7: Plot of mean absolute er-ror for the multimodal disentangle-ment model trained on T1- and T2-weighted data.

Figure 4.8: Loss plot for the multi-modal disentanglement model.

Figure 4.9: Test set predictions using the multimodal disentanglement model


4.4 Missing ModalitiesRobustness under missing modalities was examined in three steps: using onlyT1-weighted data, using only T2-weighted data and using both modalities.Figure 4.10 illustrates how the multimodal disentanglement model (red) per-forms under the three different scenarios compared to each baseline (blue). Wesee that that multimodal disentanglement performs similarly to the baselines,meaning that it is robust to missing data.

Figure 4.10: Multimodal disentanglement model and baselines under threedifferent scenarios. The two unimodal models were used as baselines for T1-and T2-weighted data. For multimodal data high-level fusion was instead usedas a baseline.

4.5 Effect of Using Multimodal Data on Ac-curacy

Figure 4.11 shows a boxplot of the estimated mean absolute error, using 10-fold cross-validation, for the three baselines. Averaged across all folds, themean absolute error when using T1-weighted data was 5.48 years, when usingT2-weighted data this error was 5.08 years instead. For high-level fusion, thesame error was 4.91 years. A paired permutation test was performed to testthe hypothesis that usingmultimodal data increases the accuracy of predictionscompared to using unimodal data only. The null hypothesis could be rejectedfor the alternative hypothesis at a significance level of 0.01 (p = 0.0021 forT1w and p = 0.00098 for T2w).

CHAPTER 4. RESULTS 25

Figure 4.11: Utilizing both T1- and T2-weighted images improves the ac-curacy of predicted brain age. Models were trained using 10-fold cross-validation and evaluated on the test set. Upper and lower boundaries of thebox mark the third and the first quantile, the horizontal line shows the me-dian value. The ends mark the maximum and minimum value, respectively,excluding outliers. Black dots represent test set scores.

4.6 FindingsAs depicted in Figure 4.12, when evaluated on the test set, high-level fusionperformed better than multimodal disentanglement, which in turn performedbetter than using either modality alone. As seen in the test set prediction fig-ures, the age of younger subjects had a tendency to be overestimated and theage of older subjects was instead underestimated. The overall residual, de-picted in Figure 4.13, was, however, centered around zero. Using high-levelfusion gave the lowest variance of the residuals.


Figure 4.12: Test set scores on the first fold for the four different models.

Figure 4.13: Residual calculated on the test set for each model. Upper andlower boundaries of the box mark the third and the first quantile, the horizontalline shows the median value. The ends mark the maximum and minimumvalue, respectively, excluding outliers.

Chapter 5

Discussion

5.1 General discussionDatasets seldom contain imagery of all modalities for a given subject, henceany successful multimodal model needs to be robust to missing data. Testingunder three different scenarios (missing T1-, missing T2-weighted data andno missing data) with three different baselines allowed us to analyze the per-formance penalty incurred by having a more complex model that is able tolearn dealing with missing modalities. We found that the multimodal disen-tanglement model was robust to missing data. The results also showed thatthe multimodal disentanglement model performed equally to the other base-line models, meaning that the performance loss was minor or non-existent.

For the case of two modalities, the advantage of the multimodal disen-tanglement model over the three baselines is mainly the reduced complexityof having one model as opposed to having three. The multimodal disentan-glement model, however, still takes a longer time overall to train. For morethan two modalities, training a model for each combination of missing/presentmodalities becomes infeasible. The multimodal disentanglement model, onthe other hand, is readily extensible to more modalities [31]: for each newmodality, we add one content encoder, one style encoder and one decoder.

A permutation test was also performed and showed that using high-levelfusion results in a statistically significant improvement in the accuracy of pre-dicted brain age, compared to using either modality alone. This supports thehypothesis that both modalities provide different non-overlapping informa-tion on the aging process of the brain. While previous work has found thatmodels trained on T1-weighted data perform better than those trained on T2-weighted data, our results showed the opposite; using T2-weighted data gave

27

28 CHAPTER 5. DISCUSSION

better performance than using T1-weighted data. One factor that might ex-plain this discrepancy is that feature extraction for non-deep learning-basedmodels has been tuned for T1-weighted data primarily, due to it being themost commonly used modality. Another factor is demographic differences inthe sample — our dataset, for example, had no subjects under the age of 42and the mean age was around 70 years old. Since the proportion of gray/whitematter changes throughout our lives, this might also change the contributionof different modalities.

Using high-level fusion gave the lowest overall mean absolute error of 4.87years (4.91 when averaged across all folds), while the multimodal disentan-glement model gave a slightly larger error of 4.93 years. Compared to otherresearch, these errors are quite large. For example, Cole et al. [9] achieveda mean absolute error of 4.1 years and Jonsson et al. [13] around 3.6 years.One explanation for this might be differences in the sample demographic; inparticular, it has been previously noted that predicting the brain age of oldersubjects is harder than of younger subjects [15]. The Oasis3 dataset had sig-nificantly older subjects than the datasets used in the other two studies, whichmight partly explain the higher mean absolute errors. The down samplingperformed, however, does not explain the increased mean absolute error [9].

Visualizing predictions made on the test set for each model showed thatages of younger subjects tended to be overestimated and older subjects under-estimated (Figure 4.2, 4.4, 4.6 and 4.9). The reason for this is the distributionof age. As depicted in Figure 3.2, the ages of subjects in this study are approx-imately normally distributed with a mean of 70 years. The training, validationand test set were constructed as to preserve this distribution. A reasonablecriticism is that the ages of the subjects who are part of the test set shouldinstead be approximately uniformly distributed. This would make it easier tosee how performance is on a more general population. The reason for not do-ing this was the limited sample size; constructing a uniformly distributed testset would mean either limiting the ages of subjects part of the test set frombetween 42 to 95 years to between 55 and 80 years or drastically reducing thesize of the test set.

Though several previous works have explored the potential of using mul-timodal data for brain age prediction, research exploring deep learning-basedmultimodal models has been limited. The main contribution of our work isshowing that using T1- and T2-weighted data together improves predictionaccuracy as well as presenting a model that is robust to missing modalities.Taken together, this shows that brain age estimation can benefit from usingmultimodal data and that missing modalities is dealt with in a way that is both

CHAPTER 5. DISCUSSION 29

clinically viable and extendable to more modalities.

5.2 Societal Aspects and SustainabilityAn aging population presents many challenges for society, one of them beingthe increasing prevalence of neurodegenerative diseases. Our work indicatesthat using multimodal data is promising for increasing the accuracy of brainage predictions, while also being applicable in clinical settings. Brain ageprediction has the potential to give doctors a highly interpretable variable fordetecting and monitoring progression of brain disease, which could aid devel-opment of new treatments. In terms of sustainability, accelerating the devel-opment of treatments both reduces social costs (which means resources can beallocated elsewhere) and increases healthspans. The environmental impact ofthis system is minimal; training is only done once and can be done in a countrywhere the carbon-intensity of electricity is low.

5.2.1 EthicsThe dataset used contains MRI scans along with subject data. While the sub-ject data is anonymized, the MRI data is not trivial to anonymize. Without re-moving skull tissue, a subjects’ facial features can be reconstructed and linkedto their profile on Facebook, for example. Therefore, storing the data securelyis needed in order to not reveal sensitive information about participants. Sincethis project used a publicly available dataset, these concerns are somewhatlessened (although trying to identify subjects is still strictly prohibited). Ex-tending to more datasets in the future is a possibility, which means more con-sideration has to be put into the handling of sensitive data.

5.3 Limitations

5.3.1 GeneralizabilityA major challenge in research is to draw conclusions, from limited datasetsand samples, that can be expected to generalize to the true population. Inthis thesis, the two main limiting factors, in regards to generalizability, is thedataset used and the dimensional reduction performed. A third limiting factoris the exclusion of the multimodal disentanglement model in the 10-fold cross-validation.


The subjects of the Oasis3 dataset are older than the general population andthe ages are also distributed differently (normally distributed vs approximatelyuniformly distributed). With this in mind, general comparisons in terms ofmean absolute error were avoided since we expect this to vary greatly withchanges in the underlying age distribution. The main results of this thesis,however, should be independent of age distribution.

Each MRI scan was downsampled to the resolution of (64, 44, 64) in orderto reduce computational complexity. It is likely that robustness and fusionperformance also generalizes to higher resolutions but this would need to betested. To analyse the effect of using multimodal data 10-fold cross-validationwas used; given the current dimensional reduction, this took about 1 day torun. Halving the voxel size would roughly translate to a factor 10 increase incomputational time.

Ideally, we would have liked to test the hypothesis that the multimodal dis-entanglement model is indistinguishable from the baselines, but this was notdone due to computational constraints. If true, this would mean that perfor-mance under missing modalities is optimal or close to optimal, which wouldstrongly support our work.

5.3.2 Limited Number of ModalitiesOne of the main results was the measured robustness of the multimodal disen-tanglement model. For two modalities, it was clear that the model successfullylearned to deal with missing modalities. Adding more modalities could dete-riorate performance and the robustness to missing modalities might come atthe cost of reduced performance. Investigating this trade-off in greater depthwould be interesting since extending to more modalities is a natural continua-tion of this work.

5.3.3 PreprocessingIn this thesis the preprocessing steps present two limitations to ourwork. Firstly,the FSL library used for the majority of the preprocessing is known not to beoptimal, and other alternatives exist that perform better at brain extraction andregistration. The advantage of using the FSL library was that the same pro-cedure could be used for both T1-weighted and T2-weighted data, reducingtime spent on preprocessing. Secondly, insufficient preprocessing might haveincreased the gap between the multimodal models and the unimodal models.One observed occurrence was that brain extraction sometimes failed to re-

CHAPTER 5. DISCUSSION 31

move all skull tissue. Since the multimodal models are trained on pairs ofimages, they are more robust to this since it is less likely to happen to bothimages. More concretely, this means that the measured statistically signifi-cant improvement, by using both T1- and T2-weighted data, might partly beexplained by insufficient preprocessing instead of both modalities providingdifferent underlying information of the brain structure. To address this con-cern, samples could be filtered out manually.

5.4 Future Work

5.4.1 Improving PreprocessingSeveral alternatives to the FSL library exist and exploring different options islikely to increase performance. Recently, deep learning-based models for per-forming both brain extraction and registration have shown great potential [35,36]. Currently registration is the most time-consuming step of the preprocess-ing pipeline; changing to a deep learning-based model could mean increasedperformance as well as significantly faster processing due to utilizing GPU.

5.4.2 Ablation StudyAblation can be a powerful tool for understanding a complicated system throughremoval of different parts. The multimodal disentanglement model consists oftwo major parts: the fusion of content and disentangling of content and style.Simplifying the fusion process by using mean or average instead would giveus a better understanding of how the learned fusion performs. Analyzing thedisentangling of modalities and how this affects performance could be donethrough removing the reconstruction loss imposed. Chen et al. [31] conducteda similar ablation study, but since the context and task are different, it wouldbe interesting to perform the same thing here as well.

5.4.3 Multiple DatasetsCombining more datasets with different age distributions would help to mit-igate limitations of current results. In terms of difficulty, the preprocessingpipeline should work for any other dataset with MRI imagery of the same for-mat. A particularly well-suited dataset to extend the thesis with is the HCPdataset, which contains T1- and T2-weighted MRI data of 1200 young sub-jects of ages between 22 to 35 years.


5.4.4 Multiple ModalitiesExtending the current work beyond two modalities would likely further reducethe mean absolute error and, more importantly, would shed light on how use-ful these modalities are for brain age estimation. Potential MRI modalities toexplore that are available in the Oasis3 dataset are FLAIR and T2star. Com-bining MRI with non MRI modalities, such as PET or CT, would further helpto strengthen claims of robustness and flexibility of the multimodal disentan-glement model.

Chapter 6

Conclusions

Brain age can be used to detect abnormal aging of the brain. We showed thatutilizing T1- and T2-weighted data can increase accuracy of brain age predic-tions. This increase in accuracy was shown to be statistically significant atlevel of 0.01 using a paired permutation test. We found that a model basedon multimodal disentanglement is robust to missing data without any signifi-cant loss of performance. Clinical applications of brain age estimation have,to date, been lacking, partly due to current accuracy being insufficient. Withhigher accuracy, brain age estimation might be used as one input to monitor-ing the progression of cognitive diseases. Our work used two modalities but isreadily extendable to more. For future work, both using more modalities andbigger sample sizes would be interesting to explore and could help to furtherincrease performance.

33

Bibliography

[1] R. L. Himsworth and M. J. Goldacre. “Does time spent in hospital inthe final 15 years of life increase with age at death? A population basedstudy”. In:BMJ 319.7221 (Nov. 1999), pp. 1338–1339. issn: 14685833.doi: 10.1136/bmj.319.7221.1338.

[2] J. H. Cole et al. “Brain age predictsmortality”. In:Molecular Psychiatry23.5 (2018), pp. 1385–1392. issn: 14765578. doi: 10.1038/mp.2017.62. url: http://dx.doi.org/10.1038/mp.2017.62.

[3] James H. Cole et al. Brain age and other bodily ‘ages’: implications forneuropsychiatry. 2019. doi: 10.1038/s41380-018-0098-1.

[4] Nikolaos Koutsouleris et al. “Accelerated brain aging in schizophreniaand beyond: A neuroanatomical marker of psychiatric disorders”. In:Schizophrenia Bulletin 40.5 (Sept. 2014), pp. 1140–1153. issn: 17451701.doi: 10.1093/schbul/sbt142. url: https://academic.oup.com/schizophreniabulletin/article- lookup/doi/10.1093/schbul/sbt142.

[5] Jason Steffener et al. “Differences between chronological and brain ageare related to education and self-reported physical activity”. In: Neuro-biology of Aging (2016). issn: 15581497. doi:10.1016/j.neurobiolaging.2016.01.014.

[6] Richard Bitar et al.MR pulse sequences: What every radiologist wantsto know but is afraid to ask. 2006. doi: 10.1148/rg.262055063.

[7] AndrewDixon andYurangaWeerakkody.MRI protocols.url:https://radiopaedia.org/articles/mri-protocols.

[8] Katja Franke et al. “Brain maturation: Predicting individual BrainAGEin children and adolescents using structural MRI”. In:NeuroImage 63.3(2012), pp. 1305–1312. issn: 10538119. doi:10.1016/j.neuroimage.2012.08.001.

34

https://doi.org/10.1136/bmj.319.7221.1338

https://doi.org/10.1038/mp.2017.62

https://doi.org/10.1038/mp.2017.62

http://dx.doi.org/10.1038/mp.2017.62

http://dx.doi.org/10.1038/mp.2017.62

https://doi.org/10.1038/s41380-018-0098-1

https://doi.org/10.1093/schbul/sbt142

https://academic.oup.com/schizophreniabulletin/article-lookup/doi/10.1093/schbul/sbt142



https://doi.org/10.1016/j.neurobiolaging.2016.01.014

https://doi.org/10.1016/j.neurobiolaging.2016.01.014

https://doi.org/10.1148/rg.262055063

https://radiopaedia.org/articles/mri-protocols

https://radiopaedia.org/articles/mri-protocols

https://doi.org/10.1016/j.neuroimage.2012.08.001


BIBLIOGRAPHY 35

[9] James H. Cole et al. “Predicting brain age with deep learning from rawimaging data results in a reliable and heritable biomarker”. In: Neu-roImage 163 (2017), pp. 115–124. issn: 10959572. doi: 10.1016/j.neuroimage.2017.07.059.

[10] J. Mohan, V. Krishnaveni, and Yanhui Guo. A survey on the magneticresonance image denoising methods. 2014. doi: 10.1016/j.bspc.2013.10.007.

[11] Luis Martí-Bonmatí and Angel Alberich-Bayarri. “Imaging biomark-ers: Development and clinical integration”. In: Imaging Biomarkers:Development and Clinical Integration (2016), pp. 1–376. doi: 10 .1007/978-3-319-43504-6.

[12] Aristeidis Sotiras, ChristosDavatzikos, andNikos Paragios. “Deformablemedical image registration: A survey”. In: IEEE Transactions on Med-ical Imaging (2013). issn: 02780062. doi: 10.1109/TMI.2013.2265603.

[13] B.A. Jonsson et al. “Deep learning based brain age prediction uncov-ers associated sequence variants”. In: Nature Communications (2019),p. 595801. doi: 10.1101/595801. url: https://doi.org/10 . 1038 / s41467 - 019 - 13163 - 9 % 20https : / / www .biorxiv.org/content/10.1101/595801v1.abstract.

[14] Ann-marie G De Lange et al. “Multimodal brain-age prediction andcardiovascular risk in the Whitehall II MRI cohort”. In: (2020), pp. 1–22.

[15] Franziskus Liem et al. “Predicting brain-age from multimodal imag-ing data captures cognitive impairment”. In: NeuroImage 148 (2017),pp. 179–188. issn: 10959572. doi: 10.1016/j.neuroimage.2016.11.005.

[16] Geneviève Richard et al. “Assessing distinct patterns of cognitive ag-ing using tissue-specific brain age prediction based on diffusion ten-sor imaging and brain morphometry”. In: PeerJ 2018.11 (Nov. 2018),e5908. issn: 21678359. doi: 10.7717/peerj.5908. url: https://peerj.com/articles/5908.

[17] Andrea Cherubini et al. “Aging of subcortical nuclei: Microstructural,mineralization and atrophymodificationsmeasured in vivo usingMRI”.In:NeuroImage (2009). issn: 10538119. doi:10.1016/j.neuroimage.2009.06.035.



https://doi.org/10.1016/j.bspc.2013.10.007

https://doi.org/10.1016/j.bspc.2013.10.007

https://doi.org/10.1007/978-3-319-43504-6

https://doi.org/10.1007/978-3-319-43504-6

https://doi.org/10.1109/TMI.2013.2265603

https://doi.org/10.1109/TMI.2013.2265603

https://doi.org/10.1101/595801

https://doi.org/10.1038/s41467-019-13163-9%20https://www.biorxiv.org/content/10.1101/595801v1.abstract





https://doi.org/10.7717/peerj.5908

https://peerj.com/articles/5908

https://peerj.com/articles/5908



36 BIBLIOGRAPHY

[18] Katja Franke et al. “Estimating the age of healthy subjects from T1-weighted MRI scans using kernel methods: Exploring the influence ofvarious parameters”. In: NeuroImage 50.3 (2010), pp. 883–892. issn:10538119. doi: 10.1016/j.neuroimage.2010.01.005.

[19] Tian Xia et al. “Learning to synthesise the ageing brain without longitu-dinal data”. In: (2019), pp. 1–10. url: http://arxiv.org/abs/1912.02620.

[20] Xin Niu et al. “Improved prediction of brain age using multimodal neu-roimaging data”. In: Human Brain Mapping July 2019 (2019), pp. 1–18. issn: 10970193. doi: 10.1002/hbm.24899.

[21] Nico U.F. Dosenbach et al. “Prediction of individual brain maturity us-ing fMRI”. In: Science 329.5997 (Sept. 2010), pp. 1358–1361. issn:00368075. doi: 10.1126/science.1194144.

[22] Hedieh Sajedi and Nastaran Pardakhti. “Age Prediction Based on BrainMRI Image: A Survey”. In: Journal of Medical Systems 43.8 (2019).issn: 1573689X. doi: 10.1007/s10916-019-1401-7.

[23] Tzu Wei Huang et al. “Age estimation from brain MRI images usingdeep learning”. In:Proceedings - International Symposium on Biomedi-cal Imaging 2.1 (2017), pp. 849–852. issn: 19458452. doi: 10.1109/ISBI.2017.7950650.

[24] James Cole. “Multi-modality neuroimaging brain-age in UK Biobank :relationship to biomedical , lifestyle and cognitive factors”. In: bioRxiv(2019), pp. 66–72. doi: 10.1101/812982.

[25] Andrea Cherubini et al. “Importance of Multimodal MRI in Charac-terizing Brain Tissue and Its Potential Application for Individual AgePrediction”. In: IEEE Journal of Biomedical and Health Informatics(2016). issn: 21682194. doi: 10.1109/JBHI.2016.2559938.

[26] Tongxue Zhou, Su Ruan, and Stéphane Canu. “A review: Deep learningformedical image segmentation usingmulti-modality fusion”. In:Array3-4 (2019), p. 100004. issn: 25900056. doi: 10.1016/j.array.2019.100004.

[27] Zhe Guo et al. “Deep Learning-Based Image Segmentation on Mul-timodal Medical Imaging”. In: IEEE Transactions on Radiation andPlasma Medical Sciences 3.2 (2019), pp. 162–169. issn: 2469-7311.doi: 10.1109/trpms.2018.2890359.


http://arxiv.org/abs/1912.02620

http://arxiv.org/abs/1912.02620

https://doi.org/10.1002/hbm.24899

https://doi.org/10.1126/science.1194144

https://doi.org/10.1007/s10916-019-1401-7

https://doi.org/10.1109/ISBI.2017.7950650

https://doi.org/10.1109/ISBI.2017.7950650

https://doi.org/10.1101/812982

https://doi.org/10.1109/JBHI.2016.2559938

https://doi.org/10.1016/j.array.2019.100004

https://doi.org/10.1016/j.array.2019.100004

https://doi.org/10.1109/trpms.2018.2890359

BIBLIOGRAPHY 37

[28] Masaya Misaki et al. “Contrast Enhancement by Combining T1- andT2-weighted Structural Brain MR Images”. In: Physiology and Behav-ior 176.1 (2017), pp. 139–148. issn: 1873507X. doi: 10.1016/j.physbeh.2017.03.040.

[29] Jérôme Lapuyade-Lahorgue, Jing Hao Xue, and Su Ruan. “Segment-ing multi-source images using hidden markov fields with copula-basedmultivariate statistical distributions”. In: IEEE Transactions on ImageProcessing 26.7 (2017), pp. 3187–3195. issn: 10577149. doi: 10 .1109/TIP.2017.2685345.

[30] Avi Ben-Cohen et al. “Improving CNNTraining using Disentanglementfor Liver Lesion Classification in CT”. In: Proceedings of the AnnualInternational Conference of the IEEE Engineering in Medicine and Bi-ology Society, EMBS (2019), pp. 886–889. issn: 1557170X. doi: 10.1109/EMBC.2019.8857465.

[31] Cheng Chen et al. “Robust Multimodal Brain Tumor Segmentation viaFeature Disentanglement and Gated Fusion”. In: Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artificial Intelli-gence and Lecture Notes in Bioinformatics). Vol. 11766 LNCS. Octo-ber. Springer International Publishing, 2019, pp. 447–456. isbn: 9783030322472.doi: 10.1007/978- 3- 030- 32248- 9\_ 50. url: http://dx.doi.org/10.1007/978- 3- 030- 32248- 9_ 50%20http://link.springer.com/10.1007/978-3-030-32248-9_50.

[32] Guillaume Lample et al. “Fader networks:Manipulating images by slid-ing attributes”. In: Advances in Neural Information Processing Systems2017-Decem.Nips (2017), pp. 5968–5977. issn: 10495258.

[33] Hsin Ying Lee et al. “Diverse Image-to-Image Translation via Disentan-gled Representations”. In: Lecture Notes in Computer Science (includ-ing subseries Lecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) 11205 LNCS (2018), pp. 36–52. issn: 16113349.doi: 10.1007/978-3-030-01246-5\_3.

[34] Xun Huang et al. “Multimodal Unsupervised Image-to-Image Transla-tion”. In: Lecture Notes in Computer Science (including subseries Lec-ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)11207 LNCS (2018), pp. 179–196. issn: 16113349. doi: 10.1007/978-3-030-01219-9\_11.

https://doi.org/10.1016/j.physbeh.2017.03.040

https://doi.org/10.1016/j.physbeh.2017.03.040

https://doi.org/10.1109/TIP.2017.2685345

https://doi.org/10.1109/TIP.2017.2685345

https://doi.org/10.1109/EMBC.2019.8857465

https://doi.org/10.1109/EMBC.2019.8857465

https://doi.org/10.1007/978-3-030-32248-9\_50

http://dx.doi.org/10.1007/978-3-030-32248-9_50%20http://link.springer.com/10.1007/978-3-030-32248-9_50




https://doi.org/10.1007/978-3-030-01246-5\_3

https://doi.org/10.1007/978-3-030-01219-9\_11

https://doi.org/10.1007/978-3-030-01219-9\_11

38 BIBLIOGRAPHY

[35] Jens Kleesiek et al. “Deep MRI brain extraction: A 3D convolutionalneural network for skull stripping”. In:NeuroImage (2016). issn: 10959572.doi: 10.1016/j.neuroimage.2016.01.024.

[36] Xiao Yang et al. “Quicksilver: Fast predictive image registration – Adeep learning approach”. In: NeuroImage (2017). issn: 10959572. doi:10.1016/j.neuroimage.2017.07.008.



Appendix A

Network Details

Figure A.1: Visualisation of a residual block. The input (residual) is appendedto the output. Conv3d 3 indicates a 3d convolution with a kernel of size 3.

39

40 APPENDIX A. NETWORK DETAILS

Figure A.2: Detailed overview of the components of the multimodal disentan-glement network. Convolution layers have the kernel size indicated next to it.The number of filters, stride, padding, normalization and activation is writtenabove in that order. Instance normalization (in) and layer normalization (ln)were used.

APPENDIX A. NETWORK DETAILS 41

Figure A.3: Learnable fusion used in the multimodal disentaglement model.Both scans are encoded into their content representations (cT1w and cT2w) andare concatenated. A convolutional layer (Conv1 in the figure) with kernel size3 and 2 filters is applied with sigmoid activation. The resulting weight matrixG will only differ with cT1w and cT2w in the channel dimension, given thatappropriate padding is used. G is then split into g0 and g1 along the channeldimension and is multiplied elementwise with cT1wδ1 and cT2wδ2, respectively.δi is either zero or one and is selected so that one modality is multiplied by zerowith a probability of 30%, emulating that amodality is missing. The results arethen concatenated again (Concat2) and a second convolutional layer (Conv2)with kernel size of 1 and the number of filters equal to the number of channelsof cT1w and cT2w is applied with leaky ReLu activation to produce the finalfused representation z. z is of the same shape as cT1w and cT2w.

TRITA -EECS-EX-2020:560

www.kth.se

multimodal brain age estimation1470319/fulltext01.pdf · multimodal brain age estimation oscar...

Documents