compositional networks for unsupervised tumor localization

Compositional Networks for Unsupervised Tumor

Localization in Medical Imaging Analysis

Matthew Lesko-Krleza, School of Computer Science

McGill University, Montreal

June, 2021

A thesis submitted to McGill University in partial fulfillment of the

requirements of the degree of Master of Computer Science

c©Matthew Lesko-Krleza, 2021

Abstract

There has been a significant amount of success in using deep learning models for com-

puter vision tasks. However, applying deep learning to medical imaging analysis is still

fraught with difficulty. State-of-the-art deep learning segmentation models train on com-

mon object data sets on the order of over 100,000 images containing over 800,000 in-

stances, whereas medical image data sets contain only several hundred three-dimensional

images. This is why we propose to evaluate an unsupervised occluder localization pro-

cess from Compositional Networks to perform tumour localization on medical imaging

data. In this work we show the potential of Compositional Networks in performing tu-

mour localization without the need of tumour instances within the training dataset. Al-

though we don’t conclusively determine whether CompNets can fully localize tumours

or not, we show promise and discuss future work that could alleviate concerns with our

current results.

i

Abrege

L’utilisation de modeles d’apprentissage profond pour les taches de vision par ordina-

teur a connu un succes considerable. Cependant, l’application de l’apprentissage en pro-

fondeur a l’analyse d’imagerie medicale pose toujours de nombreuses difficultes. Les

modeles de segmentation d’apprentissage s’entraınent sur des ensembles de donnees

d’objets communs de l’ordre de plus de 100 000 images contenant plus de 800 000 in-

stances, alors que les ensembles de donnees d’images medicales ne contiennent que plusieurs

centaines d’images en trois dimensions. C’est pourquoi nous proposons d’evaluer un

processus de localisation d’occluder non supervise de Compositional Networks pour ef-

fectuer la localisation de tumeurs sur des donnees d’imagerie medicale. Dans ce travail,

nous montrons le potentiel des reseaux de composition pour effectuer la localisation de

tumeurs sans avoir besoin d’instances de tumeurs dans l’ensemble de donnees de forma-

tion. Bien que nous ne determinions pas de maniere concluante si les CompNets peuvent

localiser completement les tumeurs ou non, nous sommes prometteurs et discutons des

travaux futurs qui pourraient attenuer les inquietudes concernant nos resultats actuels.

ii

Acknowledgements

I would like to thank my supervisor Peter Savadjiev, and Adam Kortylewski for their

guidance, feedback, and insightful conversations. They were both supportive of my

work, and helped me overcome the intellectual challenges that come with research. I

wouldn’t have been able to create this work, nor would I have been able to produce my

results without them. I would like to acknowledge and thank Adam for his code for the

Compositional Networks model, which involve training and testing the Compositional

Networks. I would also like to thank Compute Canada for allowing me the use of their

compute hardware, I wouldn’t have been able to run any experiments without them.

iii

Contribution of Authors

Matthew Lesko-Krleza wrote this thesis, and wrote code for the experiments.

iv

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Abrege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Contribution of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction 1

2 Literature Review 5

2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Discriminative and Generative Models . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Artificial Neural Network Theory . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 10

2.3.3 Common Training Problems and Their Solutions . . . . . . . . . . . . 12

2.3.4 Activation and Pooling Functions . . . . . . . . . . . . . . . . . . . . 17

2.3.5 Non-Medical Imaging Applications . . . . . . . . . . . . . . . . . . . 22

2.3.6 Medical Imaging Applications . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Dictionary Learning and Pattern Theory . . . . . . . . . . . . . . . . . . . . . 31

2.5 Compositional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.1 Hierarchical Compositional Models . . . . . . . . . . . . . . . . . . . 32

v

2.5.2 Recursive Cortical Networks for Object Classification under Occlu-

sion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5.3 Compositional Convolutional Neural Networks . . . . . . . . . . . . 37

3 Methodology 46

3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.1 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2.2 Pre-Trained Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.3 CompNet Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.4 CompNet Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Results 54

5 Discussion 63

5.1 Cluster Centre Patch Visualizations . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Cluster Center Activation Visualizations . . . . . . . . . . . . . . . . . . . . . 66

5.3 Synthetic Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 LiTS Tumour Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.5 MUHC PACS Tumour Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.7 Hypothesis Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . 70

6 Appendix 72

6.1 Image Registration Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

vi

List of Figures

2.1 Feedforward Neural Network Example . . . . . . . . . . . . . . . . . . . . . 9

2.2 Convolving a 3 × 3 kernel over a 4 × 4 input using a stride of 1 and zero

padding. Adapted from [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Encoder-Decoder CNN Architecture for Lung Segmentation [7] . . . . . . . 23

2.4 Two-Pathway CNN Architecture (TwoPathCNN) for Brain Tumour Seg-

mentation [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Cascaded TwoPathCNN Architectures . . . . . . . . . . . . . . . . . . . . . . 25

2.6 U-Net CNN Architecture for Electron Microscopy Cell Segmentation (ex-

ample for 32x32 pixels in the lowest resolution) [6] . . . . . . . . . . . . . . . 29

2.7 Hierarchical Compositional Model Representing a Horse [10] . . . . . . . . 32

2.8 Compositional Convolutional Neural Network Architecture [69] . . . . . . . 37

2.9 Illustration of vMF kernels by visualizing image patterns from the training

set that activates a given vMF kernel the most. Note how image patterns

that are of similar appearance and share semantic meaning are separated

into different kernels [69]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.10 Visualization of learned mixture models (4 mixtures). Each row is for a

different object class (car, train, boat, or bus) and each column represents

a different mixture for the object class. Note how different 3D viewpoints

are approximately separated into different mixtures. [69] . . . . . . . . . . . 41

vii

2.11 Occlusion localization results from [69]. Each result consists of three im-

ages: The input image, the occlusion scores of a dictionary-based composi-

tional model from a prior work by kortylewski et al. [72] and the occlusion

scores of the proposed CompNet [69]. Note how the CompNet can localize

occluders with high accuracy across different objects and occluder types

for real as well as for artificial occlusions. . . . . . . . . . . . . . . . . . . . . 43

3.1 U-Net training loss and validation accuracy plots . . . . . . . . . . . . . . . 50

4.1 First 6 CompNet cluster center patch visualizations. Every set of 16 patches

represents one cluster center who’s patches are most representative for that

given cluster center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Second 6 CompNet cluster center patch visualizations. Every set of 16

patches represents one cluster center who’s patches are most representa-

tive for that given cluster center. . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Third 6 CompNet cluster center patch visualizations. Every set of 16 patches

represents one cluster center who’s patches are most representative for that

given cluster center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 VMF cluster activations. The image in the top left is the input image. Pur-

ple pixels represent no activation, blue pixels represent a low activation,

green pixels represent moderate activation, and yellow pixels represent

high activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Occlusion generation on synthetic occluders. The left column consists of

input images with some synthetic occlusion and the right column consists

of occlusion maps where purple pixels represent no occlusion, orange pix-

els represent a low scoring of occlusion, green pixels represent a medium

scoring of occlusion, and blue pixels represent a high scoring of occlusion. . 60

viii

4.6 Occlusion generation on real tumors. Every sub-figure’s left column con-

sists of input images with some real tumour, the middle column consists

of occlusion maps, and the right column consists of the ground truth liver

and tumor segmentation maps. For the occlusion maps, the purple pixels

represent no occlusion, blue pixels represent a low scoring of occlusion,

green pixels represent a medium scoring of occlusion, and yellow pixels

represent a high scoring of occlusion. For the ground truth segmentation

maps, black, grey and white pixels represent the background, liver tissue,

and tumor tissue classes respectively. . . . . . . . . . . . . . . . . . . . . . . . 61

4.7 Occlusion generation on manually segmented real tumors from the MUHC

PACS dataset. Every sub-figure’s left column consists of input images with

some real tumours and every sub-figure’s right column consists of occlu-

sion maps where purple pixels represent no occlusion, orange pixels rep-

resent a low scoring of occlusion, green pixels represent a medium scoring

of occlusion, and blue pixels represent a high scoring of occlusion. . . . . . 62

6.1 Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

ix

Chapter 1

Introduction

Research in Artificial Intelligence (AI) has made a significant amount of progress in the

last several years thanks to the increasingly large amount of labelled data, the virtually

unlimited amount of computational resources and breakthroughs in computer vision al-

gorithms including artificial neural networks and Compositional Models [1], [2]. As a

subset of the computer vision field, the medical imaging sector is rich with annotated

data and challenges ripe for AI to innovate in. AI is unlikely to fully replace the need

for human radiology expertise, however it has an enormous potential to enhance existing

systems and in automating certain repetitive tasks. Some examples of repetitive tasks that

AI could help in medical imaging analysis include segmenting organs or diseases, classi-

fying diseases, or enhancing image quality [3]. AI could revolutionize the medical field

by helping radiology experts deliver better and faster care for patients, reduce overall

medical costs, and enable larger medical analysis coverage especially for impoverished

societies. Given the immense popularity and promising success in applying artificial neu-

ral network algorithms for common object computer vision tasks such as object classifi-

cation and segmentation [4], [5], it is only natural that they have lead to research labs in

evaluating their performance in medical imaging analysis. Applying neural networks to

medical imaging analysis has been met with some success [6], [7] and a large amount of

challenges [8], [9]. Applying AI for medical imaging tasks may sound promising, but ap-

1

plying currently popular AI algorithms to medical imaging analysis is still fraught with

difficulty.

There exist many challenges in applying AI algorithms for medical imaging. First-

and-foremost, there is the lack of data. Popular state-of-the-art deep learning segmenta-

tion models train on common object datasets on the order of over 100,000 images con-

taining over 800,000 instances, whereas medical image datasets contain only several hun-

dred volumetric data points [3]. Deep learning models require vasts amount of data to

be successful. Medical image datasets don’t normally provide such an amount of data.

Then there are the facts that modalities across medical datasets differ from one another,

and that the machines and parameters used for acquiring this data may differ from one

dataset or clinic to the next. This poses a challenge in model generalizability, because a

model may work well for one source of data, but not adapt well to a different one. One

might say that we could use forms of data augmentation to increase sample variance in

a given dataset to improve performance, and one would find that there have been such

approaches that have successfully improved predictive performance [6]. However, not

only are augmentation techniques not standardized, but empirical evidence is required

to determine which augmentation techniques work best with the given dataset, therefore

requiring more time and effort for training. Data augmentation still doesn’t solve the

main challenges in working with medical images, it only helps circumvent some issues.

Finally and most importantly in the case of our work, diseases found within medical im-

ages can be difficult to locate for the human eye and their appearance can vary greatly

from one patient to another. In the example of localizing liver tumours within computed

tomography scans of the abdomen, these tumours can hardly look darker or lighter than

the surrounding tissue, their texture can resemble similarly to that of liver tissue, and they

can appear in an overwhelming variety of shapes and sizes. This is all to prove that the

lack of data and the variance of disease appearance are some of the biggest challenges we

face in applying AI for medical imaging analysis. To summarize, applying AI algorithms

for medical imaging analysis pose a challenge for AI researchers because of the lack of

2

data, the nature of the data, the uncertainty in enhancing the given data and the variance

in disease appearance. We would like to explore a more data-efficient method for medical

imaging analysis.

In contrast to deep learning models which have garnered an immense amount of at-

tention, there exists a set of generative models that require far less training data for object

representation training, but still offer accurate object representations. This set of models

are called Compositional Models [10]. Compositional Models represent objects of their

underlying data in a compositional manner by recursively composing elemental parts of

objects, such as curves or patches of textures that are extracted from input images, into

progressively more unified and holistic object parts until the entire target object can be

reconstructed. They have garnered some popularity in computer vision because of their

data efficiency [2], but usually under-perform against deep learning models for common

tasks such as object classification. Nevertheless, they’re an attractive approach to evaluate

for medical imaging analysis because of their data-efficiency in learning object represen-

tation and more importantly because they still haven’t been applied for medical imaging

analysis tasks. Evaluating Compositional Models for medical imaging analysis is a new

area of research which we would like to explore.

We believe we need to explore a direction that uses Compositional Models, but we

also don’t want to disregard deep learning algorithms entirely. Artificial neural networks

may require a large amount of data, but they have proven to provide strong abilities in

discriminative feature extraction, whereas generative models typically need less data but

are usually outperformed by their deep learning counterparts in vision tasks. This is why

we propose to evaluate the potential of a newly existing imaging analysis algorithm that

combines an artificial neural network with a generative model. This algorithm is called

the Compositional Convolutional Neural Network [11], or Compositional Network for

short, and we wish to evaluate its potential in disease localization. It consists of a pre-

trained deep learning model for image feature extraction and a generative Compositional

model to learn from and perform inference on the feature maps extracted from the neural

3

network model. Not only has this never been done before, but an important novel feature

of this algorithm is in its ability to perform unsupervised object occlusion localization. In

other words, it can localize instances of object occlusions without the need of ever having

seen any forms of occlusion during the training phase. In this work, we formulate the

problem of tumour localization as an occlusion localization problem. We wish to use the

Compositional Network’s occlusion localization ability to localize tumours in an unsu-

pervised fashion. The benefit of this method is that the training phase of the Composi-

tional Network wouldn’t require instances of the disease we wish to locate at test time. If

successful, this could enable a promising method in localizing diseases in medical images

without the need of an exhaustive amount of data to capture the full variance in dis-

ease appearance. This could revolutionize the way we train machine learning models for

disease localization. Given its success in occlusion localization, we experiment with the

Compositional Network for tumour localization without the need of tumour data within

the training dataset.

Our hypothesis is that CompNets can accurately localize tumours without the need

of tumour training data and little liver training data. To outline the content in this work,

we conduct a literature review on deep learning and its applications in medical imaging,

and a literature review on on Compositional models. We describe our methodology and

results in evaluating the Compositional Network’s performance for tumour localization.

Finally, we discuss the promise Compositional Networks have in medical imaging anal-

ysis which include an analysis of the areas they currently succeed in and an analysis of

the areas they currently need more work in. Overall, we contribute to the AI and medical

imaging analysis fields by showing promising potential in Compositional Networks for

disease localization thanks to their data-efficiency and ability to localize parts of tumours

in an unsupervised way.

4

Chapter 2

Literature Review

This work assumes that the reader has a computer science background and understands

basic machine learning concepts such as: supervised learning, unsupervised learning, features

and training data. We first describe the difference between discriminative and generative

machine learning models because their differences highlight some of the reasons why

we chose to use Compositional Models for medical imaging analysis. Then we give a

review of deep learning including fundamental theory, convolutional feature extraction,

pooling and activation functions, non-medical imaging applications and medical imaging

applications. Next we describe dictionary learning and pattern theory as the fundamental

ideas to compositional models. Finally, we describe the motivation and theory behind

compositional models and compositional neural networks.

2.1 Machine Learning

Machine Learning (ML) is a subset of Artificial Intelligence (AI). It is defined as a set of

methods that automatically discover patterns in data, and use the discovered patterns for

decision-making under uncertainty or to predict future patterns on new data. The main

role of ML is to learn patterns on its own and proceed to make decisions without being

explicitly programmed to do so. ML algorithms can vary in performance, and complex-

5

ity, from simple classifiers to Deep Artificial Neural Networks, which we discuss later in

Section 2.3. Before discussing some examples of ML algorithms, we discuss the differ-

ence between discriminative and generative models, since their distinction highlights the

attraction to using Compositional Models for medical imaging analysis, the domain that

we wish to apply Compositional Models to.

2.2 Discriminative and Generative Models

Here we’d like to discuss the difference between discriminative and generative models

as well as the theory of classical Compositional Models and Hierarchical Compositional

Models as primers to understanding Compositional Networks. In ML, our goal is to

learn the relationships between the world and the data we sample the world from. We

can approximate these relationships by training a model with some set of parameters.

The model could either be generative or discriminative. With generative models we aim

to understand the dataset from which a sample x labelled as y is sampled from:

P (x|y)P (y)

Where P is a probability, x is a sample from some dataset, such as an input feature vector,

and y is a label to classify that sample, such as a classification label or a discrete value

from a continuous distribution. This allows us to capture and represent the underlying

distribution of the sample x to which we can reference this distribution to generate new

data instances. This makes generative models useful for unsupervised ML approaches

because the output allows us to understand the underlying distribution of data. Whereas

with discriminative models, learning the dataset labelled as y from which x is sampled

from is irrelevant, the only goal is to classify or discriminate samples:

P (y|x)

6

Where P , x, and y are a probability, sample, and label as the same with generative models.

Discriminative models learn the boundaries between classes within a given dataset, mak-

ing them computationally cheaper and more robust to outliers than generative models at

the expense of not learning the underlying distribution of data.

Examples of generative models include Mixture Models, and Hierarchical Composi-

tional Models (HCMs). A mixture model is a probabilistic model for representing the

presence of subpopulations within an overall population. This mixture model corre-

sponds to a mixture or weighted set of underlying distributions, which allows for flexi-

bility in data representation. We’ll see mixture models used and explained in more detail

in Section 2.5.3, they form an important component in Compositional Networks. HCMs

are the earliest form of Compositional Models that Compositional Networks base their

compositionality aspect from, they are reviewed in Section 2.5.1.

An example of a discriminative model is the Artificial Neural Network, which we

describe in Section 2.3.

Finally, a Compositional Network is a new ML model system which combines the dis-

criminative power of Artificial Neural Networks with the generative model advantages

of Compositional Models. The generative aspect of Compositional Networks enables ML

practitioners to learn the parts and compositions of parts that make up the training data.

In this work, we experiment with the use of this Compositional Network and its abil-

ity in learning parts and compositions of parts from medical images while using a deep

neural network for performing feature extraction. Compositional Networks are further

described in Section 2.5.3.

2.3 Deep Learning

2.3.1 Artificial Neural Network Theory

Deep learning is a subset of ML. It usually refers to the learning of a special type of model

named the Artificial Neural Network (ANN) which mimics the multilayered human cog-

7

nitive system. We want to discuss the fundamentals of deep learning and convolutional

neural networks, because they are used as feature extractors in compositional networks,

and their training algorithm (Backpropagation [12]) is used to train compositional networks

as well. Understanding the fundamentals of deep learning will help one understand the

theory in compositional networks.

Artificial Neural Networks are a set of acyclic and interconnected nodes loosely in-

spired by the neurons in the human brain. The purpose of an ANN is to learn an internal

representation of the dataset it trains on, which could then be used for a downstream task

such as prediction or pattern recognition. The main advantage of an ANN is its ability to

perform pattern recognition on raw signals. There is no need for any feature-engineering

nor any preprocessing. The ANN was previously limited in its ability to solve real-world

problems because of the lack of sufficient data and the lack of computing power to train

the system. However, in recent years, thanks to larger datasets, the virtually unlimited

amount of compute resources available through the use of Graphics Processing Units

(GPUs) and cloud computing, and solutions to the vanishing gradient problem which we

describe in Section 2.3.3, ANNs have been widely used for predictive tasks, especially in

general computer vision tasks [4] and medical imaging analysis [13]. The quintessential

deep learning model is the feedforward neural network or multilayer perceptron (MLP). The

goal of an MLP is to approximate some function f ∗ to estimate a mapping y, to an in-

put vector x. y can either be a continuous or discrete value (such as a class label). The

MLP’s weights are defined as a set of parameters θwhich are learned to best approximate

the mapping y = f ∗(x;θ). MLPs are characterized as networks because they represent

a chaining-composition of many functions. For example, we might have three different

functions f (1), f (2), and f (3) connected in a chain to form f(x) = f (3)(f (2)(f (1)(x))). In this

situation, f (1) is called the first layer, f (2) is called the second layer and so on. Graphically,

the input vector is represented as the input nodes, the function layers are represented as

intermediary nodes hnm and the output vector y is the set of output nodes within the MLP

computation graph. Figure 2.1 illustrates an example of a two-layer feedforward neural

8

x1

x2

x3

InputLayer

h(1)1

h(1)2

h(1)3

h(1)4

HiddenLayer 1

θ(1)

h(2)1

h(2)2

h(2)3

HiddenLayer 2

θ(2)

y1

y2

OutputLayer

θ(3)

Figure 2.1: Feedforward Neural Network Example

network. The input and output of the nodes are defined by the direction of the edges

connecting the nodes together into an acyclic network. The MLP’s parameters θ are a set

of weights that weigh the respective output for every hidden layer. These are the set of

parameters that are optimized during training via Gradient-Based Learning.

Training a neural network is not much different from any other ML model which uses

gradient descent. A differentiable loss function (or cost function), defined as Loss(f(x), y),

is used to measure the neural network’s performance during training. Gradient descent is

performed to optimize the neural network’s parameters. Every gradient step is a function

of the derivative of the Loss with respect to the weights: δLδw

. δLδw

is computed by perform-

ing the chain rule on all the stored local gradients. The partial derivative of every layer

is computed with respect to its input, these local partial derivatives are then multiplied

in the chain rule fashion to compute the partial derivative of the Loss with respect to the

weights. For example, given three consecutive and connected hidden layers h(1), h(2) and

h(3) and a loss function Loss, their respective outputs are z1, z2, z3 and L. This loss func-

tion is a metric function between the estimates y and the ground truth y. Every output’s

partial derivative is defined as δz1δw, δz2δz1, δz3δz2

and δLδz3

. By the chain rule:

δL

δw=δz1

δw∗ δz2

δz1

∗ δz3

δz2

∗ δLδz3

(2.1)

9

Figure 2.2: Convolving a 3 × 3 kernel over a 4 × 4 input using a stride of 1 and zero

padding. Adapted from [14]

For the stochastic gradient descent example, the model’s weights w are updated as

follows:

w ←− w − α ∗ dLdw

(2.2)

Where α is a learning rate 0 < α ≤ 1. There exist various optimization algorithms, but

for the sake of simplicity, we’ve limited ourselves to describing the basic stochastic gra-

dient descent. During training, the MLP stores local gradients and Backpropagation (the

application of the chain rule and weight updates) is performed to optimize the model’s

weights. This form of training is generalizable to any kind of deep learning network.

During testing, the MLP simply performs the forward pass without tuning any weights.

Performance is measured with some metric, such as Cross-Entropy (for classification) or

Mean-Squared Error (for regression).

Ultimately, the MLP is an encoder that learns a representation of its training data by

encoding raw data into a feature space. This representation can be thought of as a com-

pressed version of the training data.

2.3.2 Deep Convolutional Neural Networks

Deep Convolutional Neural Networks (DCNNs or CNNs for short) are a class of deep

learning models that use convolutional filters for feature extraction. Over recent years,

they have gained immense use because of their promising results in computer vision

10

tasks. Their convolution filters perform convolutional operations on multi-dimensional

data. In the case of a 2-dimensional input, a kernel K passes over points (x, y) and per-

forms a convolution operation on the values within the window centred at a point (x, y).

The stride between points is a parameter determined by the model designed. The kernel

K consists of weights w that are multiplied with the input to achieve a weighted Convo-

lution operation. During training, the weights are updated to optimize the loss function.

Given that a window is used to select a set of pixels and aggregate their values together,

commonly convolutional filters will reduce the spatial resolution of the input feature vec-

tor. This resolution reduction effect is depicted in Figure 2.2. The output is a feature map

in which each pixel is a sum of its neighbours’ features or pixel values. This feature aggre-

gation enables the neural network to extract a hierarchy of increasingly complex features,

making CNNs very appealing for image analysis. However, if either a padding is added

to the output feature map such that its output resolution matches the input resolution,

or if the kernel size is 1 × 1 and the stride is a value of 1, then the output feature map’s

resolution matches that of the input. Therefore, convolutions don’t necessarily reduce the

spatial resolution of the input vector. This feature map can either be fed to more func-

tions within a neural network for further feature extraction or fed to a prediction layer

to perform a predictive task such as image classification, object segmentation, or feature

clustering.

Receptive Fields

Here we describe one of the basic concepts in CNNs which is the Receptive Field, or Field

of View, of a unit in a certain layer in the network. We describe their use and how to

compute their size, because they become relevant when we discuss our results in Section

4. In fully connected networks, the value of each feature unit depends on the entire input

to the network, whereas a feature unit within a CNN only depends on a specific region

of the input. The receptive field is the region for that unit, intuitively it is the region in

the input space that a particular CNN feature is paying attention to. This concept helps

11

in understanding and diagnosing how CNNs work. We can compute the receptive field

size r0 of an input image as follows:

r0 =L∑l=1

((kl − 1)l−1∏i=1

si) + 1 (2.3)

Where kl and si are the respective kernel size and stride size used at layer l within the

CNN. If the stride is greater than 1 for a particular layer, the region increases propor-

tionally for all layers below the given one. Receptive fields are important in visualizing

and understanding the feature extraction process in CNNs. We use them to visualize the

patches of images which are representative of meaningful features extracted during the

training of a Compositional Network as seen in Section 5.1.

2.3.3 Common Training Problems and Their Solutions

As a disclaimer, this section is not directly relevant to our main contribution, but readers

may find them interesting since several techniques are referenced in the Applications for

Medical Imaging Section 2.3.6.

Vanishing and Exploding Gradients

During training, the local gradient is computed for every function within the forward

pass with respect to the input. Then during Backpropagation, the chain rule is com-

puted like in Equation 2.1 to compute the gradient of the loss function with respect to

the model’s weight values. However, if the ANN has many layers and the gradients are

increasingly smaller or larger throughout the feed-forward function, these gradients can

either vanish to zero or explode in value respectively. This is also called gradient satu-

ration and can significantly affect convergence and training [15]. This problem has been

largely addressed by normalized initialization [16] and batch normalization layers [17]

which normalize the feature maps and gradients across layers. Batch Normalization is

a function applied to intermediate layers within the network that normalizes each input

12

training batch:

Mini-batch mean: µB =1

m

m∑i=1

xi

Mini-batch variance: σ2B =

1

m

m∑i=1

(xi − µB)2

Normalization: xi =xi − µB√σ2B + ε

Scale and Shift: output = γxi + β

where x is some d-dimensional input and each dimension of the input gets normalized.

B is a batch of m number of inputs, γ and β are scaling and offset parameters respec-

tively, and ε is a hyperparameter for numerical stability to avoid dividing by zero when

normalizing. Not only does Batch Normalization eliminate the vanishing and exploding

gradient problem by normalizing features, it has been shown that merely adding it to a

state-of-the-art image classification model yields a substantial speedup in training and

significant increase in performance [17].

Overfitting

Overfitting is the problem when a model fits too closely to a particular training dataset,

which can occur for various reasons. It can occur if the model to be trained is too compli-

cated, if the model’s training duration is too long, or most often if there is too little training

data for the given task. Intuitively, when the model is overfitting, it can be thought of as

the model attempting to memorize the dataset. This is undesirable because the model has

difficulty generalizing to new and unseen data. ANNs are prone to overfitting because of

their increased complexity, which introduces a significantly larger need for training data

as opposed to their classical ML model counterparts. Methods such as increasing the

amount of training data, transfer learning, early-stopping, data augmentation, weight

regularization, and dropout help reduce the possibility of overfitting in neural networks.

13

Medical imaging training data sets are usually significantly smaller than common

computer vision datasets [3]. The required sample training size is an ongoing area of

research in ML, but the rule of thumb suggests that the number of samples be at least 10

times the amount of training parameters [3]. Unfortunately, most publicly available med-

ical imaging datasets contain only hundreds or several thousand samples [3] and neural

networks can contain millions of training parameters [4], [6], [7]. Increasing the amount

of training data within the medical domain is costly: requiring medical equipment, pa-

tients and medical expertise to properly label samples. Since small training datasets are

common within the medical imaging context, it’s attractive to use pre-trained networks

within a transfer learning paradigm, many state-of-the-art neural network solutions in

medical imaging analysis use some form transfer-learning [18].

Transfer learning involves using a model whose parameters have already been trained

on a source task, and adapting the model to a target task [19]. Given a source domain Ds

with a corresponding source task Ts, and a target domain Dt with a target task Tt trans-

fer learning is the process of improving the model’s target task function f ∗t (�) by using

related information from the source domain and tasks where Ds 6= Dt or Ts 6= Tt. One

will commonly find within the vision context that models are pre-trained on large image

datasets such as ImageNet or CIFAR-100, then either all their parameters or exclusively

the model’s predication layer are updated for the target task by training on the target

dataset [18].

Early-stopping is the act of stopping a model’s training before the intended stopping

criteria, such as a number of iterations or some accuracy threshold, is reached. The exact

criterion used for validation-based early stopping is either chosen in an ad-hoc fashion,

performed interactively or picked when the validation set performance ceases to increase

for some given duration [20]. If it is used, it’s commonly in conjunction with other over-

fitting solutions such as dropout and weight regularization as seen in [21].

Data augmentation is commonly applied to image datasets when training convolu-

tional deep neural networks [22]. The goal is to increase the variance of the training

14

dataset’s distribution in hopes to learn a more generalizable representation of the data.

Some data augmentation function f(�) is applied on an input training sample x such that

f(x) yields a similar but new training sample. Examples of data augmentation functions

for image data include flipping, colour space transformation, cropping, rotation, transla-

tion, noise injection, kernel filter application, image mixing, and random erasing [22]. In

the majority of training cases, it is beneficial to use data augmentation to achieve several

more points in task performance. However, if the task to be performed contains task in-

stances that are out-of-the distribution of the training dataset, data augmentation can do

little to help with those cases. In the case for oncology, disease appearance can be highly

variant, which leads to issues when medical imaging datasets are small and the desired

method for some computer vision task involves deep neural networks. In these situations,

data augmentation can do very little. In recent years, there has been the introduction of

Generative Adversarial Networks (GANs) which consist of a framework for estimating

generative models by simultaneously training a generative model that captures the data

distribution and a discriminative model that estimates the probability of a sample having

come from the training data rather than the generative model [23]. Another data aug-

mentation strategy consists of using GANs as a way to generate new training data, and

one could generate new images of tumour instances. However, the issue is that we don’t

know the relationship between tumour appearance and their clinical outcome, so even

if we generated new instances, we wouldn’t be sure how to classify them. Additionally,

this adds an extra layer of complexity, and the method hasn’t been shown to dramatically

increase performance. In [24], the authors used a Generative Adversarial Network to gen-

erate new tumour segmentation instances for the training set but this only helped achieve

an increase of 3.4% performance in its segmentation results. Applying data augmentation

within the medical imaging world is not straightforward, nor are there any standards.

Weight regularization aims to stabilize a neural network from overfitting by penal-

izing largely valued weights within the network. Regularization terms are penalties or

constraints added to the loss function, some examples of common terms and their penal-

15

ization to the cost are the following:

L1 Regularization Term = λM∑i=0

|Wi|

L2 Regularization Term = λ

M∑i=0

W 2i

Cost = Loss(y, y) + Regularization Term

Where Wi is the value of a particular weight and λ is a hyperparameter term which con-

trols the regularization’s affect on the loss. By adding a weight penalty, the overall cost

increases and the optimizer, such as Stochastic Gradient Descent, is forced to minimize

the weights of the network that contribute to the loss. Given the increase in overall loss,

the error gradient increases with respect to the value of the weights, which results in

an increased change in weight update. With the increase in the error gradient’s value,

the larger weight values are decreased, thus stabilizing the network. Within the vision

domain, the L1 regularization term produces features which are spatially localized and

results in most weight values being near zero, whereas L2 regularization produces fea-

tures with higher spatial variance and allows weight values to grow further away from

zero [25]. Therefore, L1 regularization is preferred with datasets which have spatially lo-

cal features, such as targeting objects within a large scene, whereas, L2 regularization is

preferred with datasets which have spatially global features, such as targeting objects or

scenes that make up an entire image [25]. As an ML practitioner, one should experiment

to decide which regularization term to use. Weight regularization is popularly used and

can even be found in training Compositional Networks [11], [26].

Dropout is a regularization function employed within the neural network during

training. It’s been shown that backpropagation builds up co-adaptations between neurons

that work for the training data but do not generalize well to unseen data at test time [26].

Co-adaption is the effect of when some neurons learn to be highly dependent to another

one. So if the independent neurons receive a ”bad” input, the dependent neurons will be

16

affected as well, which alters the model’s performance. This is ultimately the behaviour

seen in a model that has suffered from overfitting, and it is this phenomenon that we try

to prevent. Dropout deactivates a certain number of neurons at a given layer from acti-

vating during training, preventing units from co-adapting too much [26]. Neurons that

have dropout applied to them, have some probability p of not activating or zeroing out

during training, but during testing, they always activate. Intuitively, applying dropout

to a network results in sampling a thinned version of the network, which helps reduce

co-adaptation during training. So given n number of neurons, there exist 2n number of

thinned versions of the network, then for each presentation of a training sample, a new

thinned version of the network is sampled and trained. By doing this sampling, 2n net-

works with shared weights can be combined into a single averaged neural network at test

time. Results have shown that when training CNNs on several large-scale image classi-

fication tasks such as ImageNet, CIFAR-10, and CIFAR-100, state-of-the-art results used

dropout [26].

2.3.4 Activation and Pooling Functions

Here we describe Activation and Pooling functions. As a disclaimer, this section is not

directly relevant to our main contribution, but readers may find them interesting since

Activation and Pooling functions are always found in artificial neural networks. ANN

architectures will almost always consist of them, and they can be found in all the neural

network architectures discussed in our work.

Activation Functions (AFs) are used in neural networks to compute the weighted sum

of input and biases, which is then used to decide whether a neuron can be fired or not,

effectively controlling the neural network’s outputs [27]. These AFs are differentiable,

hence allowing for Backpropagation learning, and can either be linear or non-linear. A

linear mapping is given by an affine transformation, as in most cases [28]:

y = f(x) = wTx+ b

17

Where x is the input feature vector, w are the AF’s weights, and b is a bias vector. The out-

put from each layer is fed into the next layer for multilayered networks until a final output

is obtained. We’re more interested in non-linear AFs. The main advantage of using non-

linear AFs over linear ones, is that they enable estimating non-linear prediction functions,

as opposed to only linear ones. This aids in learning of high order polynomial prediction

functions for the neural networks. This drastically increases the hypothesis search space,

allowing for greater generalization to more prediction tasks. Some of the most common

examples of non-linear AFs include the Sigmoid, Hyperbolic Tangent (Tanh), Rectified

Linear Unit (ReLU), and Softmax functions.

The Sigmoid AF, also referred to as the logistic or squashing function [29], is defined

as:

f(x) =1

1 + e−x

The Sigmoid AF is a simple AF which can be used for binary classification in an ANN’s

prediction layer or for hidden layer neuron activation, however it suffers from sharp gra-

dients during backpropagation from deep hidden layers, gradient saturation, and slow

convergence [27]. The Hyperbolic Tangent function was proposed to remedy some of the

Sigmoid AF’s drawbacks.

The Hyperbolic Tangent (Tanh) is defined as:

f(x) =ex − e−x

ex + e−x

The Tanh AF is smoother than the Sigmoid function, and gives better training perfor-

mance for multi-layered networks as opposed to the Sigmoid function by producing zero-

centred outputs thereby aiding backpropagation [30]. However, a noticeably dangerous

property of the Tanh function is that it is expensive to compute because of its exponential

and division terms [27]. This led to the development of the Rectified Linear Unit (ReLU).

18

The Rectified Linear Unit (ReLU) is proposed in [31] and defined as:

f(x) = max(0, x) =

xi, xi ≥ 0

0, xi < 0

The function rectifies any signal below a value of 0 and sets it to 0, whereas the signal

maintains its value if its value is equal to 0 or greater. ReLU offers faster convergence,

and better performance and generalization compared to the Sigmoid and Tanh AFs [32].

It is nearly a linear function, therefore preserving the ease of optimization with gradient-

descent learning algorithms [32]. The gradient computation is faster because it does not

need to compute exponential nor any division terms. Thanks to its advantages, the ReLU

AF has been the most widely-used AF, is found in many state-of-the-art results [32] and is

found in the majority of neural network architectures discussed in our work. A noticeable

drawback of the ReLU AF is that it can create dead neurons, weight updates that don’t

activate during inference, because its gradient can reach zero. This had led to develop-

ments in the leaky ReLU, an AF that we won’t go into detail here because it’s out of the

scope of our work. We urge the reader to learn more about it for the sake of their curiosity.

The Softmax AF is used to compute a probability distribution from a real-numbered

vector and is defined as:

f(xi) =exi∑Kj=0 e

xj

Where x is a vector of K real numbers. Every value within the output vector is within the

range of 0 to 1, with all values summing to 1. It’s popularly used in multi-class classifica-

tion and segmentation models, with the final class for an entire image or a select pixel to

be that with the highest probability. The Sigmoid and Softmax share similarities in their

use, the Sigmoid function is used for binary classification whereas the Softmax function

can be used for multiple classes.

19

Activation functions are important components of neural network architectures and

can be found in all the architectures we discuss in this work, especially those reviewed in

Section 2.3.6.

Pooling, also known as sub-sampling or down-sampling, is used to transform joint fea-

ture representations into spatially reduced ones that preserve important information and

either disregard irrelevant details in the case with min and max-pooling or blend details

with features of interest in the case of average-pooling [33]. Its primary purpose is to re-

duce the number of features passed on to later layers, to prevent exponential increase in

the number of features in a deep network. Pooling can allow for spatial position invari-

ance, lighting invariance, robustness to clutter and compactness of representation [33]. It

can also be used to increase the receptive field of a neural network for a given input im-

age. In the case with CNNs, the feature maps resulting from convolutional operations can

have high dimensionality, which may cause overfitting with the application of a classifier

[34]. To reduce the overall size of the signal, a max, min or average pooling function can

be used. For example, in [35], a CNN architecture was used for object classification and a

2-dimensional average pooling layer was proposed:

Output =1

kheight ∗ kwidth

kheight−1∑m=0

kwidth−1∑n=0

Input(Cj, stride[0]× h+m, stride[1]× w + n)

(2.4)

Where Output and Input are the output and input feature maps respectively, kheight, kwidth

are the kernel height and width, Cj is a given channel within the input feature map,

stride[0], stride[1] are the stride of the window along the x and y coordinates respectively,

and h,w are the height and width of the input feature map. Input(Cj, stride[0] × h +

m, stride[1] × w + n) selects the pixel value at channel Cj and (x, y) position stride[0] ×

h + m, stride[1] × w + n. This layer performs a local averaging of neighbouring pixels

within a given kernel, thereby reducing the feature’s spatial position precision, which

helps enable spatial invariance, robustness against noise and more compact feature repre-

20

sentation. Whereas a 2-dimensional max-pooling operation would replace the averaging

operation in Equation 2.4 with respective max operations:

Output = maxm=0,...,kheight−1

maxn=0,...,kwidth−1

Input(Cj, stride[0]× h+m, stride[1]× w + n) (2.5)

Instead of computing the average value within the given window of pixels, the max-

pooling function selects the maximum pixel value. A min-pooling operation would sim-

ply replace the max operators with min and select the minimum pixel value within its

respective window. Within the computer vision context, the choice of pooling operation

depends on the dataset at hand. Max-pooling selects the brightest pixel from the image,

and is useful when the objects of interest are brighter than the background and the ML

practitioner wishes to entirely disregard background information. Min-pooling is helpful

for the opposite reason, and selects features that are darker than those around it. Average-

pooling smoothens out the background and object features, and is useful when we desire

to keep information from the background. Pooling enables compact feature representa-

tion and higher-level feature extraction.

Pooling functions are ubiquitous in neural network architectures and are found in the

ones that we describe later on Section 2.3.6.

Un-Pooling is the backwards function of the max-pooling operation. We discuss it

here because it is seen in several neural network architectures. The goal of un-pooling

is to map a feature space to a higher spatial domain, which is useful for when some en-

coded low-dimension feature map needs to be up-sampled to the same size as the input

image, as seen in segmentation tasks [21]. Un-pooling selects an element in position p of

the input feature map, places it in a higher spatial feature map at a position in the new

feature map z within the sub-region selected by the kernel, and sets every element out-

side that position and within the kernel’s sub-region to 0. There exists a non-uniqueness

problem with this operation, the selected position in the z feature map is arbitrary within

the kernel sub-region. One solution is to pair every un-pooling operation with a prior

21

respective pooling operation, to which that pooling operation saves the pooled values’

indices. These indices are called switches and are denoted as s. Then, during the un-

pooling process, the switches s specify the positions where the elements taken from p will

be placed in z. We will see the un-pooling operation used in some of the neural network

architectures discussed in this work.

2.3.5 Non-Medical Imaging Applications

The first pioneering work in CNNs was LeNet which was designed for handwritten digit

classification without needing image pre-processing [36]. However, due to the lack of

training data and computing power, this design failed to generalize to more complex

problems. With the introduction of large internet datasets and compute power, CNNs

started gaining traction for more complex imaging applications. CNNs first started gain-

ing popularity in image classification [4]. Since then, CNNs have since made their way to

a multitude of other computer vision tasks such as pose estimation [37], object detection

[38], object segmentation [39], visual saliency detection [40], action recognition [41], scene

labelling [42], and more [43].

Thanks to recent innovations in the deep learning field, and in compute and data re-

sources, deep learning has delivered promising results in medical imaging analysis tasks,

which we discuss examples of more thoroughly in Section 2.3.6.

2.3.6 Medical Imaging Applications

Deep learning techniques have been introduced for medical imaging analysis with en-

couraging results in segmentation, registration and image enhancement applications. As

for segmentation, there have been successful segmentation approaches for lungs from

chest X-Rays [7], [44] discussed in Section 2.3.6, tumours and brain structures [21], [45]

discussed in Section 2.3.6, biological cells and membranes [6] discussed in Section 2.3.6,

knee cartilage [46], pancreas [47], bone tissue [48] and cell mitosis [49].

22

Figure 2.3: Encoder-Decoder CNN Architecture for Lung Segmentation [7]

Lung X-Ray Segmentation

The first exploratory stage of research and development on lung segmentation in X-Ray

chest images using a purely deep learning method was presented by [7], the purpose of

their study was to examine the ability of deep learning methods and Encoder-Decoder

CNNs to segment lung components in chest X-Ray images. Their CNN architecture con-

sists of an encoder and a decoder component. The encoder component’s purpose is to

encode the input images into low-resolution feature maps. The feature maps are fed to

the decoder network which maps the low-resolution feature maps to full input resolution

feature maps for pixel-wise classification. Their architecture is visualized in Figure 2.3, it

consists of Convolutional, Batch Normalization, ReLU, Pooling, UpSampling, and Soft-

Max layers, a common approach to Encoder-Decoder networks that we’ll see again when

discussing U-Net, one of the most popular segmentation networks in medical imaging.

There exists the non-uniqueness index problem in the unpooling layers of the decoder

layers. To solve this problem, the authors stored and used the max-pooling indices from

the corresponding encoder layer. Two pre-processing techniques are performed to reduce

X-Ray intensity variation’s influence: intensity transformation using histogram equaliza-

23

Figure 2.4: Two-Pathway CNN Architecture (TwoPathCNN) for Brain Tumour Segmen-

tation [21]

tion [50], and a Local Contrast Normalization [51]. Their data set consists of 354 X-Ray

chest images originating from two different database sources: an online tuberculosis por-

tal [52] and the open Japanese JSRT database [53]. It was hoped that the use of inhomoge-

neous data sets acquired from different sources would be more helpful for obtaining more

objective and conclusive testing results. Training is performed with Stochastic Gradient

Descent. When testing, the average accuracy was estimated as a Dice score of 0.962, with

minimum and maximum values being 0.926 and 0.974 respectively.

Dice =S ∩ TS ∪ T

Where T is the set of pixels within the ground-truth lung area from manual segmentation

and S is the set of pixels within the area obtained from automatic segmentation. Unfortu-

nately, given that the authors used a dataset consisting of data from two different sources,

there are no prior results to compare their methodology to for this particularly mixed

dataset. Comparisons to prior non-deep-learning results would have shown more clearly

whether their methodology was as promising as they set it out to be. Fortunately, there

have been non-deep-learning methods prior to this work that were evaluated on JSRT,

one of the source datasets. In particular, [54] proposed a novel unsupervised method for

lung segmentation which achieved a Dice score of 0.958. This comparison helps highlight

that a deep learning method such as the Encoder-Decoder CNN could achieve promising

results, but hasn’t yet achieved state-of-the-art results at the time.

24

(a) Cascaded architecture, using input concatenation (InputCascadeCNN).

(b) Cascaded architecture, using local pathway concatenation (LocalCascadeCNN).

(c) Cascaded architecture, using pre-output concatenation, which is an architecture with proper-

ties similar to that of learning using a limited number of mean-field inference iterations in a CRF

(MFCascadeCNN).

Figure 2.5: Cascaded TwoPathCNN Architectures

25

Magnetic Resonance Brain Tumour Segmentation

In [21], a fully automatic brain tumour segmentation method tailored to glioblastomas, an

aggressive brain tumour, pictured in Magnetic Resonance (MR) images based on DNNs

is proposed. Glioblastomas can vary wildly in shape, size, contrast, and position in the

brain. Their variance in appearance motivate the exploration of an efficient ML solution

that exploits DNNs. At the time, CNNs have already been successfully applied to com-

mon vision segmentation problems and several showed promise in brain tumour seg-

mentation. A prior promising CNN brain tumour segmentation method in [55] divides

the 3-dimensional MR images into 2-dimensional slices and trains a CNN to predict its

centre pixel class. Whereas, the methods proposed in [21] expand on and surpass the

prior work by using a two-pathway CNN architecture referred to as TwoPathCNN and a

framework for cascading CNNs. The TwoPathCNN architecture is made up of two paral-

lel feature-extraction paths and is visualized in Figure 2.4. The pathways consist of: one

path with smaller 7 × 7 and 3 × 3 receptive fields referred to as the local pathway and

another with larger 13× 13 receptive fields referred to as the global pathway. The authors

motivate this design by wanting to take account of the fine-grained visual details around

a given pixel and its larger context. They believe this cascaded use of pathways would

increase segmentation performance because of the additional information available for

prediction. The outputs of the two pathways are concatenated with the help of the 3 × 3

receptive field in the local pathway. The resulting concatenated feature map is fed to a

prediction layer consisting of a coupled Convolutional 21 × 21 filter and Softmax func-

tion activation layer. The Convolutional filters in both pathways make use of the Maxout

activation function proposed in [56]. Maxout is used differently from max-pooling, the

difference is that maxout selects the maximum value at each position over multiple fea-

ture maps as opposed to pooling which selects the maximum value in a sub-window.

Given a set of K feature maps O:

Zs,i,j = max{Os,i,j, Os+1,i,j, ..., Os+K,i,j}

26

The result of the maxout operation is the Zs,i,j feature map where, at each spatial posi-

tion i, j, the value within the Z feature map is the maximum value across the given set

of feature maps O at that respective spatial position. It has been shown to be effective at

modelling useful features from different feature maps [56], hence allowing the authors to

experiment with concatenating or cascading different feature maps together. Training is

performed using Stochastic Gradient Descent to maximize all labels in the training set, or

equivalently, minimizing the negative log-probability − log p(Y|X) =∑

i,j − log p(Yi,j|X)

where Y is the set of predicted pixels and X is the input brain slice. To improve parameter

optimization, the authors implemented a momentum strategy, a training strategy seen be-

fore in [4], which uses temporally averaged gradients to damp the optimization velocity:

vi+1 = µ ∗ vi − αOwi

wi+1 = wi + vi+1

Where wi is the CNN’s parameters at the ith iteration, Owi is the gradient of the loss

function at wi, v is the integrated velocity initialized at zero, α is a learning rate, and

µ is a momentum coefficient. µ is gradually increased during training, starting at 0.5

and increasing to 0.9. Intuitively, momentum allows building velocity in a certain direc-

tion within the parameter search space which helps reduce gradient oscillations, thereby

ultimately improving convergence. L1 and L2 weight regularization, dropout, and early-

stopping are used to prevent overfitting since the authors believed their training dataset

did not contain enough training samples. The work proposes three different cascaded

CNN architectures visualized in Figure 2.5. The three architectures each use some varia-

tion of cascading a feature map extracted from one TwoPathCNN without its prediction

layer, to either the input image as seen in Figure 2.5a, the local pathway’s feature map

as seen in Figure 2.5b or the pre-output concatenation as seen in Figure 2.5c to the sec-

ond TwoPathCNN. The proposed architectures were tested on the BRATS 2013 challenge

dataset [57], comprising of 3 sub-datasets of real patient brain tumour data. The goal is

27

to correctly classify 5 different segmentation labels in each brain slice: non-tumor, necro-

sis, edema, non-enhancing tumour, and enhancing tumour. The training set consists of 30

patients all with pixel-wise accurate segmentations and the test set contains 10 other dif-

ferent patients. Each patient data instance is a 3D volume of MR slices, each available in

four different modalities. The authors decided to work with 2D slices because the MR

volumes do not possess an invariant resolution and the spacing in the third dimension

is inconsistent across volumes. The TwoPathCNN achieved a Dice score of 0.85, whereas

the cascaded architectures InputCascadeCNN, MFCascadeCNN, and LocalCascadeCNN

achieved scores of 0.88, 0.86, and 0.88 respectively, whereas the prior CNN method in [55]

achieved a score of 83.7. The proposed architectures show how cascading and context

information can help achieve higher segmentation accuracy. However, given the added

complexity for only increases of 1.3% to 4.3% accuracy leaves one to wonder whether a

different approach could do better.

Electron Microscopy Neural Structure Segmentation

One of the most influential models in medical imaging segmentation is the U-Net model

proposed in [6]. Segmentation of Electron Microscopy membranes has been done be-

fore using deep learning in [58]. The previous method in [58] is slow because it uses a

sliding-window to classify patches one-by-one over the input image. Additionally, it has

trouble in maintaining a balance between segmentation accuracy, where a smaller win-

dow allows for higher accuracy, and the amount of context information made available

to the network, where a larger window allows for more information. The U-Net model

makes it possible to achieve good segmentation accuracy, while also making use of as

much context information it can, and making faster predictions. The U-Net model won

the Electron Microscopy ISBI 2012 segmentation challenge in 2015 by a significant mar-

gin. The network architecture consists of an encoder component which makes up the

former half of the U shape, a bottleneck component which makes up the bottom of the

U shape, and a decoder component which makes up the latter half of the U shape. A vi-

28

Figure 2.6: U-Net CNN Architecture for Electron Microscopy Cell Segmentation (example

for 32x32 pixels in the lowest resolution) [6]

sualization of its architecture is available in Figure 2.6. The encoder component consists

of 3 × 3 Convolution filters paired with ReLU activation functions, 2 × 2 max-pooling

layers, and copy-and-crop functions. The encoder extracts a feature map from the input

image, and down-samples or reduces the spatial resolution of the feature map in the pro-

cess. Additionally, after every two consecutive applications of the Convolutional filter

and the ReLU activation, a centre-patch is copied-and-cropped from the current feature

map and fed to the feature map in the decoder component’s at the same respective height

in the U shape. The cropping is necessary because border pixels are lost in the encoder’s

Convolutional filters. The authors believed that the decoder would make more accurate

predictions if context information was preserved from the encoder through the use of

the copy-and-crop function. We’ve seen a similar idea in Section 2.3.6 where the authors

29

in [21], used feature map concatenation from one encoder to another as a means to pro-

vide more context information for prediction. The bottleneck simply applies two more

Convolutional-ReLU operations on the input feature map and up-samples the resulting

feature map to a higher spatial resolution so that it can be fed to the decoder. The up-

sampling operation makes use of the Convolutional-Tranpose (up-conv) filter to map its

given input to a higher spatial domain. The decoder component consists of 3 × 3 Con-

volutional filters paired with ReLU activation functions, 2 × 2 Convolutional-Transpose

(up-conv) filters, and a prediction layer consisting of a 1 × 1 Convolutional filter used to

map each component feature vector to the desired number of classes. Training the net-

work consists of optimizing the Cross Entropy loss over the pixel-wise Softmax on the

final feature map:

Loss =∑x∈Σ

w(x) log(pl(x)(x))

where

pk(x) = Softmax(x)

The model was applied to three different segmentation tasks, each with minimal train-

ing datasets compared to common computer vision datasets such as ImageNet. Given the

tiny training sets, the authors make use of several data augmentation methods: smooth

deformations using random displacement, and dropout. Their method achieved the low-

est Warping Error of 0.000353 out of all methods in the first EM challenge in 2015, where

the second-best algorithm had a score of 0.000355. It was applied to a cell segmentation

task achieving a 92% Intersection over Union (IoU) score versus the second-best algo-

rithm with a score of 83%. Finally, it was applied to another cell segmentation task and

achieved an IoU score of 77.5% whereas the second-best algorithm scored with 46%. We

see that U-Net significantly outperformed its competition in the cell segmentation tasks,

and achieved first place in the neural structure segmentation. This style of architecture

heavily influenced the deep learning and medical imaging fields. Ample variations of

30

the model exist including an Attention U-Net on abdominal datasets for multi-class im-

age segmentation [59], a 3D U-Net for volumetric segmentation [60], a Variational U-Net

for conditional appearance and shape generation [61], and a Multi-Resolution-U-Net for

multimodal biomedical image segmentation [62].

2.4 Dictionary Learning and Pattern Theory

Here we discuss the main theory behind Compositional Models which is the idea of for-

mulating vision as pattern theory [10]. The premise is to describe the world’s visual sig-

nals as patterns that could be modelled, generated and used for inference [63]. To model

these signals, we need to store relevant features and patterns in dictionaries. This dic-

tionary is a key-value pair lookup data structure. Given some key, the dictionary would

look up a value by that key. However, due to the highly complex nature of vision, there

exists an astronomical number of images and objects, so defining a dictionary of all pos-

sible patterns would be impossible. This motivates hierarchies and recursive composi-

tions. By recursively composing mid and high-level patterns by lower-level equivalents,

the dictionary would only need to store elementary patterns which can be shared across

objects. Therefore, it would make it possible to learn this dictionary data structure be-

cause it wouldn’t need to contain every possible image pattern, and it would be compact

thanks to the ability to reusing different components contained within it. We care about

learning this dictionary because it can be used in downstream tasks such as image classi-

fication, and generation. Additionally, by developing a graphical probability structure of

these compositions, we can define a framework to impose geometric constraints so that

patterns make up meaningful object contours and textures. The dictionaries themselves

contain a concatenation of bases, also known as atoms. These bases are image patterns ex-

tracted from a training set of images, they are elementary patterns that describe the given

set of images. They can be lines, curves and texture patches. The learned dictionaries

could then be used for tasks such as face recognition [64], [65], image classification [66],

31

Figure 2.7: Hierarchical Compositional Model Representing a Horse [10]

and numerous other image processing applications. So, it’s enticing to learn how to cor-

rectly choose the dictionary to represent the data. There are numerous ways of doing so,

by building it with linear or locally linear structure, or by explicitly optimizing various

measures of how informative the dictionary is via some energy functions [67]. This idea

of dictionary building of bases and compositions of bases forms one of the fundamental

ideas to developing Compositional Models.

2.5 Compositional Models

2.5.1 Hierarchical Compositional Models

In this section, we describe the graph structure, learning algorithm and inference algo-

rithm of Hierarchical Compositional Models (HCMs) as presented by Zhu et al. [10].

Their composition of parts highlights the theme of compositionality that is prevalent in

Compositional Networks. An HCM is a tree data structure consisting of image features

represented within nodes and relationships between nodes modelled by edges. The goal

of an HCM is very similar to what discussed in Dictionary Learning in Section 2.4. The

32

goal of an HCM is to capture elementary patterns that describe a set of images, and re-

cursively compose them into increasingly larger and more complex parts as they move

up the hierarchy of parts. These complex parts are representative of meaningful features

and objects that are contained within the set of training images. The trained HCM can

be used for downstream tasks such as object localization, object classification, and ob-

ject generation. It requires less training data than a CNN to be able to learn meaningful

features for object classification, thereby highlighting one of the advantages of genera-

tive models over discriminative ones. We will describe the HCM’s graph structure, state

variables, probability distributions for geometric constraints. Their mathematical formu-

lations form the basis for many other Compositional Models seen in the literature and

their idea of compositionality form the fundamental idea for Compositional Networks.

One particular version of an HCM, which is discussed in this review, is defined by a sex-

tuplet (V,E, ψ, φ, λP , λD) which specifies its graph structure’s vectors and edges (V,E), its

feature functions (ψ, φ) and its parameters (λP , λD).

Graph Structure

An HCM’s structure is strongly inspired from the Stochastic Context Free Grammar (SCFG)

structure as seen in the field of Natural Language Processing (NLP) [68]. In NLP, a sen-

tence can be decomposed into intermediate phrase parts and words that make up an

SCFG. The SCFG is a tree structure, where the root node is a sentence, intermediary nodes

are non-terminals (parts of sentence such as verb phrases, noun phrases, etc.), leaf nodes

are terminals (i.e. words extracted from the sentence), and edges between nodes are rules

set by the grammar. These rules are sampled from a probability distribution.

Similarly, an HCM’s structure is defined as a tree structure with a root node R, ver-

tices VR, and edges ER. Within the vision context, the root node represents an object, in-

termediary nodes are compositions of parts and contours, and leaf nodes are elementary

tokens extracted from the image. Examples of elementary tokens include lines, corners,

and curves. Edges represent geometric constraints between parts. The edges include

33

parent-child relations, however it’s possible to have child-child constraints (i.e. closed-

loop constraints). An example of a geometric constraint includes the distance between

two different tokens represented by the children nodes. Figure 2.7 illustrates an exam-

ple of an HCM structure representing a horse. In [10], each node is restricted to having at

most one parent. The children of a node u are defined as ch(u). Leaf nodes V leafR are nodes

without any children. The level l of a given node in an HCM is defined as the number

of hops away from a leaf node. So, a leaf node is of level 0, its parent is of level 1, and

so on. This is the general structure, however, there is a notable limitation in the SCFG

structure, it ignores spatial positions and other attributes such as colour, size, etc. Hence,

state variables are introduced to represent these properties.

State Variables

Each node u ∈ VR has a set of state variables wu. These state variables correspond to

the pose (scale and orientation) or attributes (colour, size) of their parts. All nodes are

required to have the same state variable types. So if a parent’s state variable contains the

pose of a part, its children’s state variables contain more fine-grained pose information of

their parts.

Probability Distributions

The probability distribution over a graph’s state variables is a Gibbs distribution, whose

energy is the sum of the data and prior potential functions. They are used to model the

size and orientation of every node’s feature contents. During learning and inference, these

distributions are optimized so that the HCM best fits to the input data. The data poten-

tials of form λDu φu(wu, I) relate the state of node u to the image I . The prior potential func-

tions of the form λPuψu,ch(u)(wu, wch(u)) or λPuψu,v(wu, wv) depending on whether children

are linked or not impose statistical constraints on the states of nodes in a clique. For ex-

ample, two forms of the prior potential functions involve representing ψANDu,ch(u)(wu, wch(u))

as an AND-function between children and ψORu,ch(u)(wu, wch(u)) as an OR-function between

34

children. These are used if the parent node selects different children (OR) or if it selects

all children (AND). Finally, the probability distribution described as a Gibbs distribution

is the following:

P (WR|I) =1

Zexp{−E(WR, I)} (2.6)

where E(WR, I) = λ ∗ φ(WR, I) (2.7)

with λ ∗ φ(WR, I) =∑

u∈ VR

VleafR

λPu,ch(u) ∗ ψu,ch(u)(wu, wch(u)) +∑u∈Vv

λDv ∗ φv(wv, I) (2.8)

Modelling the features that make up the set of images by using probability distributions

is common in Compositional Models, and we’ll see it used again in Compositional Net-

works in Section 2.5.3.

Recursive Formulation of the Energy

The fundamental equation for HCMs combines recursion and composition in a single

equation and is the following:

Eu(Wu, I) = λPu,ch(u) ∗ ψu,ch(u)(wu, wch(u)) + λDu ∗ ψu(wu, I) +∑

p∈ch(u)

Ep(Wp, I) (2.9)

It shows that the energy Eu(Wu, I) for a subtree with root node wu can be computed

recursively in terms of the energies of its descendantsWp for p ∈ ch(u). Equation 2.9 is the

fundamental equation of HCMs. It combines the recursion and composition key elements

of HCMs into a single equation. This composition of parts is seen again in Compositional

Networks in Section 2.5.3.

We’ve discussed the building components of HCMs to highlight the use of composi-

tionality to model visual objects.

35

2.5.2 Recursive Cortical Networks for Object Classification under Oc-

clusion

Here we describe the work on a Recursive Cortical Network (RCN) from [2] which is

a form of an HCM that mimics the human brain’s structure’s use of lateral connections

between neurons. The RCN’s application to partially-occluded object classification is a

motivation to developing Compositional Networks for occluded object classification. The

RCN builds upon the ideas of hierarchical composition from HCMs, lateral connections

for selectivity, contour-surface factorization and joint-explanation parsing. The model

separates the representation of contours from surfaces, which enables it to recognize ob-

jects with dramatically different appearances without needing to exhaustively train on

every possible shape and surface combination. The structure consists of nodes and OR

and POOL layers. Each node encodes an AND relation and each layer encodes an OR

function of its nodes’ features. Each pooling layer pools different deformations, transla-

tions and scales. Lateral connections are introduced between pools which provide a con-

straint. Laterally connected layers are affected by the choice of features of one-another.

This creates samples that vary more smoothly. As for inference, an input image is passed

within the network, performing hypothesis assignments in a forward pass and backward

pass performs a Maximum A-Posteriori configuration from the hypotheses. This cov-

ers the best joint configuration including object classifications and segmentations. As for

learning, contour connectivity features are learned at each level from input images. The

lateral connections are learned from the contour connectivity of input images. Features

at the topmost layer represent whole objects, similar to what we described in Section

2.5.1. The resulting model can beat CAPTCHA with an accuracy of 86.2% with a sig-

nificantly smaller number of training images. Given the generative nature of the RCN,

it can achieve high performance for partially-occluded character classification but lacks

the ability to discriminate hundreds of possible classes, which forms the main motivation

towards developing the Compositional Convolutional Neural Network method.

36

Figure 2.8: Compositional Convolutional Neural Network Architecture [69]

2.5.3 Compositional Convolutional Neural Networks

Here we describe the Compositional Convolutional Neural Network (CompNet), also known

as Compositional Network, which is a novel approach in combining Compositional Models

and Convolutional Neural Networks into one integrated system [11], [69]. The Recursive

Cortical Network and other Compositional Models have shown to be robust against oc-

clusions for object detection, however they lack the same discriminative power as neural

networks. Therefore, they lack the ability in classifying hundreds or more of different

instances of objects. Kortylewski et al. present a solution in which a CNN is used as a

feature extractor which feeds into a Compositional Model inference head. A visualization

of its architecture is available in Figure 2.8. The novelty in this architecture is the use of a

pre-trained CNN as a feature extractor, which feeds feature maps extracted from an input

image, to a Compositional Model for inference.

vMF Kernels and Class Mixtures

Kortylewski et al. propose a differentiable generative compositional model of the feature

activations p(F |y) for some object class y. Where F is a CNN feature map extracted from

an input image using a pre-trained CNN. To cluster visual features, they need to first be

represented by some probability distribution. CNN feature map dimensionality can reach

37

a high order such as 512 or 1024, which makes it difficult to use common distributions

such as a normal distribution, which is otherwise commonly used for clustering [70]. In

this work, the von Mises-Fisher (vMF) distribution is used to describe the underlying

distribution of the high-dimensional CNN feature maps extracted from the input images

within our training set. The vMF distribution is suited for high-dimensionality vectors,

which makes it appropriate for representing CNN feature maps [70]. In our case, the CNN

feature maps represent features that are extracted from liver slices by the CompNet’s

backbone feature extractor. The probability density function of the vMF distribution for

some n-dimensional unit vector x is given by:

fn(x;µ, σ) =e(σµTx)

Zn(σ)

where σ ≥ 0, ||µ|| = 1 and the normalization constant Zn(σ) is defined as:

Zn(σ) =(2π)n/2In/(2−1)(σ)

σn/2−1

where Iv denotes the modified Bessel function of the first kind at order v [71]. µ and

σ are called the mean direction and concentration parameter, respectively. Intuitively,

the greater the value of σ, the higher the concentration of the distribution around the

mean direction µ. The CompNet’s object representations are modelled as mixtures of

vMF distributions:

p(F |θy) =∏p

p(fp|Ap,y,Λ) (2.10)

p(fp|Ap,y,Λ) =∑k

αp,k,yp(fp|λk) (2.11)

θy = {Ay,Λ} are model parameters where Ay = Ap,y are mixture model parameters at

some position p ∈ P on the feature map F (purple tensor in Figure 2.8). The feature map

F is extracted by the CompNet’s backbone-CNN. Ap,y = {αp,0,y, ..., αp,K,y|∑K

k=0 αp,k,y = 1}

38

are the object mixture coefficient, where K is the number of mixtures and Λ = λk =

{σk, µk|k = 1, ..., K} are vMF distribution parameters:

p(fp|λk) =eσkµ

Tk fp

Z(σk), ||fp|| = 1, ||µk|| = 1 (2.12)

Numerical estimation of the vMF distribution parameters is non-trivial and non-unique

in high dimensions, since it involves functional inversion of ratios of Bessel functions [70].

Therefore, an iterative Expectation-Maximization-like procedure is used to estimate them

by iterating between vMF clustering of the feature vectors of all training images and max-

imum likelihood parameter estimation until convergence. After cluster training, the vMF

cluster centres µk will represent feature activation patterns that frequently occur in the

training data. Figure 2.9 shows visualizations of vMF kernels representing common ob-

jects from the training dataset used in [69]. Note how features of similar appearance and

that share semantic meaning are separated into different clusters. We see further exam-

ples of vMF kernel visualizations within the medical imaging context later on in Section

4 in Figures 4.1 and 4.2. The mixture coefficients αp,k,y can also be learned with maximum

likelihood estimation. They describe the expected activation of a cluster centre µk at a po-

sition p within the 2D lattice in a feature map F for an object class y. The object classes are

known prior, they are supervised labels given by a human annotator. Given that an ob-

ject may have different poses, 3D objects are represented with a generalized model using

mixtures of compositional models:

p(F |Θy) =∑m

vmp(F |θmy ) (2.13)

with V = {vm ∈ {0, 1},∑

m vm = 1} and Θy = {θmy ,m = 1, ...,M}. M is the number

of mixture models, vm is a binary assignment variable indicating which mixture is ac-

tive. The mixture components are learned by iterating between estimating the assignment

variables V and maximum likelihood estimation of the components. Figure 2.10 shows

39

Figure 2.9: Illustration of vMF kernels by visualizing image patterns from the training set

that activates a given vMF kernel the most. Note how image patterns that are of similar

appearance and share semantic meaning are separated into different kernels [69].

visualizations of mixture models learned for different object classes and vMF kernels.

Note how different 3D viewpoints of a given object are separated into different mixture

models.

40

Figure 2.10: Visualization of learned mixture models (4 mixtures). Each row is for a dif-

ferent object class (car, train, boat, or bus) and each column represents a different mixture

for the object class. Note how different 3D viewpoints are approximately separated into

different mixtures. [69]

Occlusion Modeling

At each position p in the image, either some object model p(fp|Amp,y,Λ) or an occluder

model p(fp|β,Λ) is active:

p(F |θmy , β) =∏p

p(fp, zmp = 0)1−zpmp(fp, z

mp = 1)z

mp (2.14)

p(fp, zmp = 1) = p(fp|β,Λ)p(zmp = 1) (2.15)

p(fp, zmp = 0) = p(fp|Amp,y,Λ)(1− p(zmp = 1)) (2.16)

The binary variables {zmp ∈ {0, 1}|p ∈ P} indicate if the object is occluded at the position

p for some mixture component m. Figure 2.11 shows examples of occlusion localization

results. Note how different occluders, either real or artificial occlusions, are accurately

localized by the CompNet. The occlusion prior p(zmp = 1) is fixed a-priori. Multiple

41

occluder models are learned in an unsupervised manner:

p(fp|β,Λ) =∏n

p(fp|βn,Λ)τn (2.17)

=∏n

(∑k

βn,kp(fp|σk, µk))τn (2.18)

where τn indicates which occluder model explains the data best. The parameters of the

occluder models βn are learned from clustered features of random natural images that do

not contain any object of interest. In this work, we forfeit the use of an occluder model,

and use a thresholding solution instead. This thresholding solution involves a threshold

over the object estimation map, in a way such that any set of pixels below our threshold

are deemed as an occlusion. We explain this solution in further detail in Section 3.

Feed-Forward Inference

The backbone CNN is used as a feature extractor to extract a feature map F = φ(I, ω) ∈

RH×W×D from some input image I , which ω is the set of parameters of the feature extrac-

tor. The vMF likelihood function L = p(fp|λk) (yellow tensor in Figure 2.8) is computed

from an inner product, equivalent to a 1× 1 Convolution, between the feature map F and

the cluster centres µk:

L = {N (F ∗ µk)|k = 1, ..., K} ∈ RH×W×K (2.19)

where N = exp(σkµTk fp)/Z(σk) is a non-linear transformation that results in a normal

distribution. The likelihood map L describes which cluster centre µk ”activate” or ”rep-

resent” each 2D lattice patch p within the input image’s extracted feature map. Every

channel, within the resulting feature map L, corresponds to an activation map for a re-

spective vMF kernel. A high activation within a set of pixels represents a part of an object

being identified. All the activation maps, or channels, which represent different object

parts are used together for the rest of the inference process. The idea of compositionality

42

Figure 2.11: Occlusion localization results from [69]. Each result consists of three images:

The input image, the occlusion scores of a dictionary-based compositional model from a

prior work by kortylewski et al. [72] and the occlusion scores of the proposed CompNet

[69]. Note how the CompNet can localize occluders with high accuracy across different

objects and occluder types for real as well as for artificial occlusions.

43

is highlighted here, different parts of objects learned during vMF clustering are selected

and combined to describe data during the feed-forward inference and test time. The

mixture likelihoods (blue planes in Figure 2.8) are computed at every position p as the

dot-product between the mixture coefficients and the corresponding vector lp ∈ RK from

the likelihood tensor L:

Emy = {lTpAmp,y|∀p ∈ P} ∈ RH×W (2.20)

Similarly, the occlusion likelihood (red planes in Figure 2.8) are computed as:

O = {maxn

lTp βn|∀p ∈ P} ∈ RH×W (2.21)

The occlusion likelihood and mixture likelihoods Emy are used together to estimate the

overall likelihood of the individual mixtures as smy = P (F |θmy , β) =∑

p max(Emp,y, Op). The

final likelihood is computed as sy = maxm smy . The resulting occlusion map is defined as

Zy = Zmy ∈ RH×W , where Zmy is the set of pixels covering portions of an object class y.

m = argmaxmsmy is the set of pixels with the highest scores sm classified as object y. As

demonstrated in Figure 2.11, we can see how different occlusions are accurately localized.

End-to-End Parameter Optimization

The trainable parameters of the CompNet are T = {ω,Λ, Ay}. Recall how ω is the set of

weight parameters of the feature extractor, Λ is the set of vMF distribution parameters,

and y is the set of mixture model parameters. The parameters are optimized via Stochastic

Gradient Descent and backpropagation. The loss function is therefore defined as:

L(y, y′, F, T ) = L(y, y′) + γ1Lweight(ω) + γ2Lvmf (F,Λ) + γ3Lmix(F,Ay) (2.22)

L(y, y′) is the cross-entropy loss between the CompNet’s estimates y′ and the ground truth

label y. Lweight = ||ω||22 is a weight regularization loss for the CNN backbone. Lvmf and

44

Lmix regularize the CompNet’s parameters to have maximal likelihood for the features in

F . More specifically:

Lvmf = −∑p

maxk

log p(fp|µk) (2.23)

Lmix = −∑p

(1− z↑p log[∑k

αm↑

p,k,yp(fp|λk)]) (2.24)

where z↑p and m↑ denote the respective occlusion patch and object mixture variables in-

ferred in the forward process. {γ1, γ2, γ3} control the trade-off between the losses.

45

Chapter 3

Methodology

3.1 Datasets

In this work, we rely on Computed Tomography (CT) scans of healthy abdominal organs

coupled with liver segmentation maps from the online CHAOS dataset’s training set [73]

to train and test our CompNet across experiments. The CHAOS dataset consists of 40

different patients. Each patient was injected with a contrast agent to help increase the

contrast of fluids and structures in the images captured from them. The CT images were

acquired from the upper abdomen area of every patient during the portal venous phase

of the contrast agent injection. This phase is obtained 70 to 80 seconds after the contrast

agent injection. During this phase, the liver tissue and its blood vessel characteristics are

enhanced maximally through blood supply of the portal vein. This phase is widely used

for liver and vessel segmentation in medical imaging analysis prior to surgery. Three dif-

ferent modalities are used: Philips SecuraCT with 16 detectors, Philips Mx8000CT with

64 detectors and Toshiba AquilionOne with 320 detectors. The CHAOS dataset consists

of 20 CT abdominal volumes for training and 20 CT abdominal volumes for testing, each

volume represents a different patient. Each volume consists of 512 × 512 16-bit DICOM

images, and x-y spacing between 0.7-0.8 mm and 3 to 3.2 mm of inter-slice distance (ISD).

There is an average of 90 slices per patient, with the minimum number of slices being 77

46

and the maximum being 105. In total, there are 1367 slices for training and 1408 slices for

testing. Every slice within the training set also consists of a respective 512× 512 segmen-

tation map for the liver, whereas the test set doesn’t. This isn’t a concern because we only

make use of the training set.

We also make use of the online Liver Tumour Segmentation (LiTS) challenge dataset

[74]. The LiTS dataset consists of 201 different patients and CT volumes stored as 32-bit

NifTi files. The dataset consists of a training set consisting of 131 CT abdominal scan

volumes with respective segmentation maps for liver and tumour tissue and a test set

consisting of 70 CT abdominal scan volumes. The dataset contains instances of tumours

such as Hepatocellular Carcinoma (HCC). The tumours have varying contrast enhance-

ments, such as hyper or hypo-dense contrast. Some images also contain imaging artifacts,

such as metal artifacts, which are present in real life clinical data. The image data was ac-

quired with different CT scanners and acquisition protocols, and is diverse with respect

to resolution and image quality. Image resolution ranges from 0.56 mm to 1.0 mm in ax-

ial and 0.45 mm and 6.0 mm in the z direction. The number of slices in the z direction

ranges from 42 to 1026. The LiTS dataset is used to train a U-Net for liver and tumour

segmentation and to test the CompNet’s ability in tumour localization. The trained U-

Net’s encoder network is then used as our CompNet’s backbone feature extractor. The

U-Net architecture used is identical to that in Section 2.3.6 as illustrated in Figure 2.6. We

also only make use of the LiTS training set and disregard its test set.

The final dataset we make use of is from the McGill University Health Centre (MUHC)

Picture Archiving and Communication System (PACS). For ease of reading, we call it the

MUHC PACS dataset. It consists of 25 different CT abdominal volumes stored as NifTi

files, each from a different patient. Every patient has some form of a liver tumour.

47

3.2 Methods

We make use of the CompNet method for tumour localization. The implementation is

made available in the Python 3.6 programming language and uses the PyTorch Python

software package. Additionally, we make use of a pre-trained U-Net’s encoder compo-

nent as the CompNet’s backbone feature extractor. The U-Net is pre-trained to perform

liver and tumour segmentation on the LiTS dataset, we go into further detail about how

we trained the U-Net in section 3.2.2.

For CompNet training and testing, we segment the CHAOS, LiTS and MUHC PACS

liver slices because it would remove the interference from other organs within the ev-

ery CT slice. For the CHAOS and LiTS liver slices, we use segmented liver slices. As

for the MUHC PACS liver slices, we manually segment the livers using the GIMP image

manipulation software. Additionally, we want to simplify our problem to localizing tu-

mours on livers of a similar scale, spatial position, and rotation. The livers in the CHAOS

dataset can present themselves in various sizes and positions. We wish to constrain the

variance in liver size and position so that we can simplify the CompNet’s training and

testing phases. We constrained the variance of liver appearance in our dataset because

we wanted to limit the difficulty of our CompNet to learn a liver representation. We want

to avoid needing a large amount of training data to capture a large variance in liver shape

and position, therefore we chose to perform linear affine registration on the CHAOS train-

ing volumes, which we go into further detail in Section 3.2.1.

3.2.1 Image Registration

As for image registration, we chose to perform a rigid-body linear image registration

including the translation, rotation, and uniform scaling transformations with 4 degrees-

of-freedom. We chose rigid-body because we wanted to use the simplest registration

process that could align our liver slices into a similar position, scale and orientation. We

48

want to simplify our problem in a way such that the livers within our training set share a

common size, position, and rotation.

Linear registration is the process of applying a linear function on a source image to

map it to a target domain. This involves translation, rotation and uniform scaling trans-

formations. The rotation and uniform scaling transforms can be expressed by a multipli-

cation between the source image and a matrix representing the amount of rotation, scal-

ing and translation. This matrix is called the registration matrix. To perform linear affine

registration, we need to first compute the nearest neighbour. The nearest neighbour is the

source volume that is the most similar to all other volume instance within a given training

set. The notion of similarity is expressed as the amount of work needed to map one image

to another domain. So a nearest neighbour is a volume whose average work to perform a

linear mapping to every other volume is the lowest amongst them all. It is chosen as the

target volume because it offers the most similar spatial coordinates, rotation, and scale

for all the volumes in the dataset. Solving for the nearest neighbour involves computing

the linear affine registration matrix for every volume pair, computing every registration

matrix’s determinant and selecting the volume whose average determinant amongst all

other volumes is the lowest. Linear affine registration is performed by first solving for

the nearest neighbour amongst all CT volumes, then registering every CT volume to the

same spatial coordinates, rotation, and scale of that nearest neighbour. We made use of

the FLIRT software package for computing registration matrices and performing image

registration [75]. Computing the registration matrices involves running the fsl reg script

on the training set of volumes. This script can be viewed in the Appendix Section 6.1.

Once the nearest neighbour is solved, the FLIRT command for performing a linear reg-

istration is run. The fsl reg source target output −a −flirt\”−out output\” command

is performed for every source-to-nearest-neighbour-pair, with target being the nearest

neighbour volume.

49

3.2.2 Pre-Trained Feature Extractor

The original CompNet’s feature extractor uses a pre-trained VGG16 model. The VGG16

feature extractor is pre-trained on ImageNet, a dataset consisting of thousands of different

instances of common objects [76]. This feature extractor therefore isn’t well tailored to

extracting features from liver slices because its source task dataset is greatly different from

the target task. Therefore, we trained a U-Net model to segment livers and tumours from

the online Liver Tumour Segmentation (LiTS) challenge dataset [74]. The U-Net’s encoder

and decoder layers’ weights are randomly initialized. Training the U-Net first consists of

using 80% of the training set for training and 20% of the training set for validation. The

training parameters used are: a learning rate of 0.001, a batch size of 4, and a number of

training epochs of 50. The optimizer used is Adam, and a DICE loss is used to measure

the performance. These gave empirically the highest validation score of 0.9803%. The

training loss and validation accuracy across the number of epochs are available as plots

in Figure 3.1.

(a) U-Net training loss (b) U-Net validation accuracy

Figure 3.1: U-Net training loss and validation accuracy plots

Afterwards, for feature extraction, we make use of the pre-trained U-Net’s encoder

component for the CompNet. The encoder component consists of 5 levels, including the

bottleneck region. We sample a feature map from the 3rd level, starting from the top,

with 256 feature channels. Up until and including the 3rd level of the U-Net, there is a

50

total of 6 3 × 3 convolutional operations each coupled with ReLU activation functions,

and 2 max-pooling operations. Thereby reducing the input image to 0.237762237762× its

original dimension.

3.2.3 CompNet Training

Training the CompNet usually requires 3 different parts, but we only make use of one.

In this work, we don’t require performing parameter optimization using Stochastic Gra-

dient Descent (SGD) because we aren’t interested in optimizing the CompNet’s classifi-

cation performance. We also don’t require training an occlusion model, since we treat

any patch p that’s not recognized to be part of a liver as an occlusion. Instead, we sim-

ply threshold object likelihoods, which we discuss in further detail in Section 4. There-

fore, for training, we’re only concerned with vMF kernel and object mixture model train-

ing. We do so by performing vMF Expectation-Maximization clustering on feature maps

extracted from our training dataset. The feature cluster probabilities are then used as

weights within the Cluster Activation Convolutional 1×1 filter, as seen in Equation 2.5.3,

to produce the object likelihood tensor. The object mixture model training is performed

via Expectation-Maximization iteration, but we constrained our experiments to using 1

mixture model. The intuition behind using more than 1 mixture model in [11] is to be

able to capture multiple viewpoints, angles, or spatial differences, such as rotation and

scale, under which an object can present itself within a given training dataset. Given that

the top-down viewpoint for every liver instance is the same within the training set and

that every volume has been linearly registered to have a similar scale, rotation and spatial

coordinates, we only required training 1 mixture model. We used the following parame-

ters for training the CompNet: 100 vMF cluster centres, a kappa of 55 for vMF clustering,

and a maximum of 2000 features sampled from every feature map extracted from every

image. We saw empirically that a kappa of 55 results in cluster centres that allowed for

a slightly higher variance in patch appearance, and that 100 vMF cluster centres allowed

for enough variance in cluster centre appearance. We determined that these parameters

51

yielded the highest-performing occlusion maps. Since we only desire learning a repre-

sentation for one object, being the liver, we constrained the number of classes trained to

1.

3.2.4 CompNet Testing

Liver Representation

Understanding the CompNet’s ability to learn an accurate representation of the liver im-

ages in the training set is done by visual inspection of the cluster centre patches learned

from vMF clustering, and by the visual inspection of the cluster centre activation maps

that are a result of the 1× 1 convolution of the cluster centres and a given input image.

Tumour Occlusions

Testing the CompNet’s ability to localize tumours is done with visual inspection of the

occlusion maps generated from the feed-forward pass of the test images from two dif-

ferent test sets. The first test set consists of segmented and registered liver slices held

out from the training set, each with some manually added occlusion such as a tumour

from a sick liver, a noise patch, or a common object of our choosing. The common object

occlusions are either a car or a monkey face, which were segmented from images found

online. The noise patches are of a colour similar to the surrounding liver tissue. We chose

to do this because we wanted to pose a challenge for our CompNet, we want to see if

it can still extract occluders that are of a very similar colour. Their size is selected to be

large enough to cover a significant portion of either the bottom-left or the top-right por-

tions of the liver. We chose these two areas because we believe they were the parts of

the liver with the largest amount of variance amongst the training set in terms of shape.

The manually added tumour occluders are from the MUHC PACS dataset as well and

their respective liver slices are from patients 001 and 439. The second test set consists of

segmented liver slices with real tumours sampled from the LiTS training set. The third

52

test set consists of manually segmented liver slices with real tumours sampled from the

MUHC PACS dataset. We only select 1 or 2 slices from 3 different volumes whose livers

are most representative, in terms of shape and texture, of the livers from the MUHC PACS

dataset. More specifically, we select slices #14 and #50 from patient #0, #90 from patient

#439, and #126 from patient #29. We chose those patients because their livers were the

most representative of those in the CHAOS dataset amongst all the other MUHC PACS

patients, and we only needed to choose 1 or 2 slices because we believed it was sufficient

to simply test for real tumour localization. We use these slices to test our trained Comp-

Net’s ability to localize real tumours. We also use these test slices to extract real tumours,

and manually add them to segmented livers in the held-out test portion of the CHAOS

training set. That way, we can attempt to localize real tumours on livers that are repre-

sentative of the training set. We use the GNU Image Manipulation Program (GIMP) [77]

to manually segment the livers from the private dataset, and to manually extract tumours

from the private dataset and add them to livers from the held-out portion of the CHAOS

training set.

53

Chapter 4

Results

Here we display the results we’ve made by visualizing several CompNet’s trained clus-

ter patches displayed in Figures 4.1, 4.2, and 4.3. The size of the cluster center patch is

determined by the receptive field size respective to the layer to which the feature map

was extracted from. Each set of patches such as Figure 4.1a represents one specific clus-

ter. The patches within each set represent the top most representative patches for the

clsuter center. Those patches map to feature vectors that are sampled and clustered dur-

ing VMF cluster initialization. Overall, these figures highlight the CompNet’s ability to

extract meaningful features from the liver, and to successfully cluster the extracted fea-

tures based on similarity.

We display the VMF cluster center activations in Figure 4.4. These activations are

sampled from the resulting feature map channels from the 1× 1 VMF cluster center con-

volution. Each image represents the activation for a different cluster. The annotation in

Figure 4.4 explains the intensity values, but overall we can see that different parts of the

liver are activated depending on the cluster. This highlights the CompNet’s ability to

learn different parts of the liver and to distinguish between them during inference.

We show the CompNet’s occlusion generation on synthetic occluders displayed in

Figure 4.5, its occlusion generation on real tumors from the LiTS training set displayed

in Figure 4.6 and its occlusion generation on real tumors from the MUHC PACS dataset

54

in Figure 4.7. Within the synthetic occluder figure, we can see that the CompNet can

highly accurately localize common objects and synthetic patches. With regards to tumors

that were synthetically added, we see that the CompNet can localize parts of the tumors,

but fails to localize the entire content of a tumor, hence a high amount of false negatives.

With regards to real tumor slices as in Figures 4.6 and 4.7, we see a similar behavior as

with the synthetically added tumors. The CompNet can localize some parts of the tumor

border, but fails to localize the entire contents of the tumors. We also see a significant

amount of false negatives in every occlusion map, a significant amount of pixels within

the images are classified as tumors when they actually aren’t. We empirically discovered

that a threshold of value 21 (a non-unit value) over the occlusion scores yielded the best

occlusion map results.

We discuss the significance of our results in greater detail in Section 5.

55

(a) Cluster Center 1 (b) Cluster Center 2 (c) Cluster Center 3

(d) Cluster Center 4 (e) Cluster Center 5 (f) Cluster Center 6

Figure 4.1: First 6 CompNet cluster center patch visualizations. Every set of 16 patches

represents one cluster center who’s patches are most representative for that given cluster

center.

56



Figure 4.2: Second 6 CompNet cluster center patch visualizations. Every set of 16 patches


center.

57



Figure 4.3: Third 6 CompNet cluster center patch visualizations. Every set of 16 patches


center.

58

Figure 4.4: VMF cluster activations. The image in the top left is the input image. Purple

pixels represent no activation, blue pixels represent a low activation, green pixels repre-

sent moderate activation, and yellow pixels represent high activation.

59

(a) Test 1 (b) Test 2 (c) Test 3

(d) Test 4 (e) Test 5 (f) Test 6

(g) Test 7 (h) Test 8 (i) Test 9

Figure 4.5: Occlusion generation on synthetic occluders. The left column consists of input

images with some synthetic occlusion and the right column consists of occlusion maps

where purple pixels represent no occlusion, orange pixels represent a low scoring of oc-

clusion, green pixels represent a medium scoring of occlusion, and blue pixels represent

a high scoring of occlusion.

60


(d) Test 4 (e) Test 5 (f) Test 6

Figure 4.6: Occlusion generation on real tumors. Every sub-figure’s left column consists

of input images with some real tumour, the middle column consists of occlusion maps,

and the right column consists of the ground truth liver and tumor segmentation maps.

For the occlusion maps, the purple pixels represent no occlusion, blue pixels represent

a low scoring of occlusion, green pixels represent a medium scoring of occlusion, and

yellow pixels represent a high scoring of occlusion. For the ground truth segmentation

maps, black, grey and white pixels represent the background, liver tissue, and tumor

tissue classes respectively.

61


(d) Test 4 (e) Test 5

Figure 4.7: Occlusion generation on manually segmented real tumors from the MUHC

PACS dataset. Every sub-figure’s left column consists of input images with some real

tumours and every sub-figure’s right column consists of occlusion maps where purple

pixels represent no occlusion, orange pixels represent a low scoring of occlusion, green

pixels represent a medium scoring of occlusion, and blue pixels represent a high scoring

of occlusion.

62

Chapter 5

Discussion

5.1 Cluster Centre Patch Visualizations

The cluster centre patch visualizations represent patches of the liver extracted from the

training set that best explain that given cluster. The idea of compositionality in Compo-

sitional Networks is highlighted through the visualizations and use of the cluster centres

in Figures 4.1 and 4.2. We can see how different parts of the liver have been learned dur-

ing training, and how they are composed together via cluster activation during inference.

This highlights the idea of compositionality: making compositional use of parts of objects

to create a data-efficient and holistic representation of an object.

These patches are representative of feature vectors sampled from the feature maps

from input images within the training set. The feature vectors are clustered together by

a measure of similarity, as seen in Equation 2.5.3. During feed-forward inference, the

cluster centres µk are used as weights within a 1×1 convolution, as described in Equation

2.5.3. The feed-forward inference produces the cluster activation or likelihood map L.

When a set of pixels within a given likelihood map channel have a high activation, this

signals that the cluster centre for the respective channel is representative of that set of

pixels. The patches related to that cluster centre are representative of the set of pixels

with a high activation. Therefore, it’s important to learn and understand ”good” cluster

63

centres, because the cluster centres contain a representation of what the liver should look

like. If the cluster centres are ”bad,” then inference becomes inaccurate and leads to a loss

in occlusion localization performance. We discuss what makes a cluster centre ”good”

and what makes it ”bad.”

We can see that for the most part, the patches are selected from features that represent

similar parts of the liver for the respective cluster. However, it’s not perfect, we can see

discontinuity in several clusters, especially the ones in Figures 4.1f, 4.2c, 4.2d, and 4.2f.

For example, in cluster centre #6, two patches are small liver pieces, whereas the rest of

the patches are selected from middle-to-lower parts of the liver, and in cluster #12, the

first patch is the right-portion of a large liver, several of the middle patches are corners

of other livers, and one of the middle and the last two patches are the right portion of

a larger but of a different liver shape. There seems to always be at least one patch that

is different from the rest. The discontinuity poses a problem, because the same cluster

might activate multiple significantly different parts of the liver. We want clusters to ac-

tivate different parts of the liver, however, those parts should ideally be similar to one

another. If they are grossly different, then that activation signifies a misrepresentation of

that part of the liver. If multiple clusters each activate significantly different parts of the

liver, the discontinuity between patches can result in false negative occlusion predictions.

The discontinuity leads to low object likelihood scores, and once a threshold is applied

on the object likelihood map to detect occlusions, the patches with low scores are deemed

as occlusions, whereas they wouldn’t have been discriminated as occlusions if the object

likelihood score was high enough. This discontinuity is what can make a ”bad” cluster

centre. This discontinuity occurs when feature vectors that represent different parts of

the liver get clustered together via the Expectation-Maximization equation seen in Sec-

tion 2.5.3. We believe the root causes to this are: the backbone feature extractor doesn’t

extract features that are discriminative enough between different parts of the liver, the

vMF kappa parameter which allows for a larger variance in features for every cluster

isn’t optimized (if it’s too high, greatly different patches get clustered to the same cluster

64

centre), or there are simply not enough cluster centres to accommodate all the different

parts of the liver.

Additionally, we see that some patches select minuscule features, as seen in cluster

centre #8 in Figure 4.2b. The only liver tissue seen is in the upper-left corner of the re-

ceptive field, and the rest of the patch consists of the background. Cluster centres with

almost entirely black patches would be considered as ”bad” patches, because the back-

ground is considered an object. This can lead to false positive occlusion localization,

because the background can be considered as an object or an occlusion if the object like-

lihood threshold is too low or too high respectively. We see this effect occur in 4.5 where

the background is considered as a light occlusion (orange pixels). This occurs because the

features selected to be sampled from the extracted feature maps are outside the liver. We

could rectify this issue by introducing a constraint to the sampling function so that only

features within the liver are sampled.

There are also instances where no meaningful liver patch is selected, seen as entirely

black patches at the bottom as seen in Figures 4.1a and 4.1c for example. This means

that there aren’t at least 16 liver patches that best represent the given cluster. This sim-

ply highlights that there aren’t enough training patches to use for certain clusters, this

can be alleviated by using more training data. This may not necessarily cause lower per-

formance, but it shows that not all the cluster centres have learned the same amount of

patches. If one wants to learn a better representation of the liver object, then more training

data is needed.

Finally, there are clusters that represent extremely similar liver patches, therefore be-

ing redundant. An example of this is seen when comparing clusters #17 and #18 in Figures

4.3e and 4.3f. This poses a problem because it introduces confusion in the object likelihood

prediction function 2.5.3. Since the object likelihoods from Equation 2.5.3 are normalized,

as opposed to having one cluster centre generating a high likelihood of activation, there

can be multiple redundant centres that generate low-to-mid-likelihoods of activation in-

stead. This is a problem because if the object likelihoods are low enough, those patches

65

may be classified as occlusions if they don’t reach the object threshold. Therefore, redun-

dant clusters can cause false positive occlusion predictions, which is an effect we wish

to minimize. We believe redundant clusters occur when either: the backbone feature

extractor fails to discriminate between similar liver patches, the vMF kappa parameter

which allows for a larger variance in features for every cluster isn’t optimized (if it’s too

low, similar patches are clustered in different cluster centres), or there are too many vMF

cluster centres.

Ideally, every cluster should only contain similar patches, and every cluster should

represent a different set of patches. That way it there is little redundancy across patches,

and every cluster represents a distinct and meaningful part of the liver. If this were the

case, then it becomes significantly easier to diagnose why false negatives or positives

occur in the occlusion maps. If all cluster centres were ideal, then false negatives or posi-

tives in the occlusion map are a result of not having enough training data to learn a larger

variance in liver shape, size, and texture. This generative aspect of the CompNet makes

it very attractive to use for medical imaging, because it’s easier to diagnose the cause of

lower occlusion or classification performance through the use of cluster visualizations.

This offers the human researcher an understanding of the representation learned for the

training objects.

5.2 Cluster Center Activation Visualizations

The activation maps in Figure 4.4 are a result of the 1×1 convolution operation of the VMF

cluster centres, or kernels, with the input image. The activation visualizations provide an

understanding as to how well or how poorly the cluster centres have been learned during

training. Each image is a different slice from the 256 channel feature map, for brevity, we

only included the first 35 maps, they’re enough to serve a meaningful discussion about

our cluster activations. Each channel within the feature map corresponds to a different

cluster centre. Intuitively, subsets of the activation map with high activation highlight

66

portions of the input image that are representative of the cluster respective for that fea-

ture map. Therefore, this gives the human researcher the ability to understand whether

the object parts learned during training offer a good representation of the object. Ideally,

every activation map should represent a different part of the liver getting activated. How-

ever, this isn’t the case. We can see that there are certain activation maps that signal that

the entire liver is being activated, as seen in the third map from the left in the top row,

and the fourth map from the left in the second to top row.

This means that there exists clusters that activate the entire liver, as opposed to smaller

regions. This shows that the object likelihood map isn’t accurate enough to select smaller

regions of the liver, proving that the representation learned for the liver object hasn’t cap-

tured a large variance in shape, size, or texture of the liver. This poses a problem because

it becomes too difficult to localize more granular occlusions, such as smaller tumours, or

tumours with a similar texture to that of the liver. We see this issue occur in Figure 4.5

where the tumours’ contents aren’t being localized as occlusions.

We can see that there are activation maps where the background is activated, or even

if the background and liver get activated. This is a troublesome concern because the

activation of those clusters represent that the background is an object, this explains why

we see the background get categorized as a light occlusion in Figures 4.5 and ??.

5.3 Synthetic Occlusions

We can see in Figure 4.5 that given some manually added occlusion to a liver that is held-

out from the training set, we’re able to localize that occlusion, whether it be a tumour, a

common object or a noise patch. In the first row of Figure 4.5, there are three tumours,

each of different texture, shape, and size. In Figure 4.5a, we detect the tumour’s boundary,

but not its contents. Similarly, in Figure 4.5b we detect a portion of the left border of the

tumour and half of its contents, but its scores are low (orange pixels). There are some

high-scoring blue pixels on that border of the liver, but a portion of those blue pixels can

67

be concluded to be false positives when compared to the other input images with the same

liver but different occlusions. In Figure 4.5c we detect the border of the tumour and some

of its content. Finally, in Figure 4.5f, we again only detect the border of the tumour. This

leads us to believe that the CompNet has difficulty in discriminating between liver and

tumour tissue. The fact that we can detect the border of the tumour but have difficulty

in capturing its contents, shows that the CompNet can detect a significant difference in

contrast and texture when the different textures neighbour one another. We believe it

continues to classify tumour content as liver tissue because the CompNet hasn’t been

trained well enough to discriminate between tumour and liver tissue, since their textures

are similar. This shows that the false negatives are a result of a lack of learning a highly

accurate representation of the liver. However, more work is needed to prove that a highly

accurate representation of the liver using a CompNet would be able to correctly localize

a large variety of tumours.

Whereas in Figures 4.5d and 4.5e, the monkey face and the car occlusions are very

easily localized, their border and contents are categorized as a high-likeliness of occlusion

(blue pixels). Their textures and colours are incredibly different from liver tissue, hence

they’re easily localized by the CompNet. This shows that using a somewhat acceptable

representation of the liver with a CompNet allows discovering grossly different objects

as occluders on the object of interest.

We can also see blood vessels, which are a bright white texture on the liver, are con-

sidered as false positives (blue pixels). We believe this is a result of not enough training

data, since the appearance of blood vessels can vary greatly from one liver to another. We

also see that blood vessels have been captured within several cluster centre patches, this

means that the CompNet understands that a blood vessel may be part of the liver, but it

lacks an understanding of a larger variance and variety of blood vessels, hence the false

positives occurring.

Finally, the manually added grey patches that are of a similar colour to the liver tissue

are easily localized (blue pixels) as seen in Figures 4.5g, 4.5h, and 4.5i. The fact that these

68

patches, with an extremely similar colour to the liver tissue are so easily localized, shows

that the main bottleneck in the CompNet’s performance for localizing tumours correctly

lies in discriminating the texture between liver and tumour tissue. We see this proven

again when discussing results from localizing real tumours in liver slices.

5.4 LiTS Tumour Occlusions

After analyzing Figures 4.6a through 4.6f, we can see that the bottom portion of the liver is

being localized as an occlusion because a shape like that hasn’t been seen in the training

set. Additionally, the tumour tissue texture and contrast in the majority of these input

images is extremely similar to the liver tissue texture and contrast, again proving that

the CompNet’s bottleneck in discriminating between liver and tumour texture and that

the success in localizing the tumour’s border is because of its easily detectable difference

in contrast in the synthetic occlusion examples. We can also see that the left part of the

liver is being classified as an occlusion, showing that the CompNet hasn’t learned a liver

representation of a liver with that of a left side characteristic of that in the LiTS dataset.

We can also see in the cluster centre Figures 4.1, 4.2, and 4.3 that the upper left and middle

left portions aren’t well captured by the vMF clusters. This shows that the CompNet has

difficulty in learning the left portion of the liver, proven by the fact that the left-most part

of the liver in the LiTS test images is classified as an occlusion. This problem is seen again

when analyzing the MUHC PACS liver slices in Section 5.5.

5.5 MUHC PACS Tumour Occlusions

We can see in Figures 4.7a, 4.7b, and 4.7d that several portions of the liver are classified

as occlusions, such as the left and right portions of the Test 1 and 2 livers, and the right

portion of the Test 2 liver. Thereby showing that those parts of the liver aren’t very rep-

resentative of the average liver in the training set. However, we can see in Figures 4.7a,

69

4.7b, and 4.7e that portions of the tumours are successfully classified as tumours. We see

part of the large left tumour in the Test 1 liver, the boundary of the middle tumour in the

Test 2 liver and the middle and right tumours in the Test 5 liver are successfully classified

as occlusions. This shows promise in our work, but evidently there is more work needed,

as we can see that some obvious tumours in the Test 3 and Test 4 livers aren’t localized.

None of the tumours in the Test 3 liver are localized, and neither are the two large bottom

tumours in the Test 4 liver localized.

5.6 Future Work

Future work should include an ability to measure the occlusion localization performance

quantitatively. As of now, there are problems with analyzing performance via visualiza-

tions. Analyzing via visualization is slow, small changes may not be caught by the human

eye, a visual check of the cluster centre patches needs to be performed, and it takes more

time and memory to generate the visualizations. In future work, we could create ground

truth expertly annotated segmentation maps for the synthetic occlusions or use the seg-

mentation maps from the LiTS dataset, and perform a DICE scoring between the ground

truth and the occlusion maps generated from the CompNet for each respective experi-

ment. The quantitative measure would enable the user to use an optimization function to

pick optimal values for the vMF kappa, occlusion threshold, the number of vMF cluster

centres, and the number of sampled features parameters. Therefore, other future work

entails using an optimization function for picking optimal said parameters.

5.7 Hypothesis Discussion and Conclusion

Our results show that CompNets are not currently able to localize the entirety of a tu-

mour. However, we have seen in our experiments that they can at least detect parts of

synthetically added liver tumours, even if the tumour appears in different shapes, sizes,

70

and textures. Most notably, we have shown that it can detect and localize the boundaries

of synthetically added tumours. Currently, our results indicate that the implementation

of CompNets for tumour localization isn’t finished, but have promising potential in un-

supervised tumour localization since the CompNet is at least able to detect boundaries

of synthetically-added tumours. The fact that we can accurately localize entire common

objects that occlude livers, shows that the CompNet possesses an ability to discriminate

well between grossly different anomalies and liver tissue. We have also shown through

our visualization of the cluster centre patches that the CompNet can learn a representa-

tion of the liver from little training data. We have shown the potential in CompNets for

medical imaging analysis, but there are still several challenges that need to be addressed.

By potential, we mean that we have shown the CompNet can at least localize parts of

synthetically added tumours, without the need of tumour training data and little liver

training data. We believe that the main challenge to address involves learning ideal clus-

ter centres so that the tumour tissue is more accurately discriminated between liver tissue.

More experiments are needed for learning better parts, compositions, or representation

of the liver to prove more conclusively whether CompNets can successfully localize the

entirety of synthetically added and natural tumours.

We have shown that this area of research isn’t finished, and more investigation is

needed. We cannot conclusively prove if CompNets can localize the entirety of tumours,

but we have shown that it can localize borders and edges of synthetically added tumours.

To conclude, we cannot prove our hypothesis, but we believe there is potential in explor-

ing Compositional Networks for tumour localization. More work is needed to prove

that they can be reliably and successfully used for tumour localization. We have shown

through our experiments and discussion that Compositional Networks can detect parts of

synthetically added tumours in an unsupervised way, even with a tiny amount of training

data for learning a liver representation.

71

Chapter 6

Appendix

6.1 Image Registration Script

#!/bin/sh

Usage() {

cat <<EOF

Usage: affine_reg [options]

Target-selection options - choose ONE of:

-T : use FMRIB58_FA_1mm as target

for nonlinear registrations (recommended)

-t <target> : use <target> image as target

for nonlinear registrations

-n : find best target from all images

in nii_volumes

EOF

72

exit 1

}

estimate_reg(){

f=$1

for g in *.nii ; do

o=${g}_to_$f

if [ $f != $g ] ; then

#Usage: fsl_reg <input> <reference> <output>

echo "$FSLDIR/bin/fsl_reg $g $f ${g}_to_${f} -e \

-a -flirt \"-omat ${o}.mat\"">> .commands

fi

done

}

do_reg() {

target=$1

out_dir=$2

for src in *.nii ; do

o=${out_dir}/${src}_to_$target

73

if [ $target != $src ] ; then

echo "$FSLDIR/bin/fsl_reg $src $target ${o} \

-a -flirt\"-out ${o}\"">> .commands

fi

done

}

[ "$1" = "" ] && Usage

echo [‘date‘] [‘hostname‘] [‘uname -a‘] [‘pwd‘] [$0 $@]

if [ $1 = -n ] ; then

for f in *.nii ; do

estimate_reg $f

done

else

OUTDIR=out

if [ $1 = -T ] ; then

TARGET=$FSLDIR/data/standard/nii_volumes

elif [ $1 = -t ] ; then

TARGET=$2

else

Usage

fi

if [ ‘${FSLDIR}/bin/imtest $TARGET‘ = 0 ] ; then

74

Figure 6.1: Caption

echo ""

echo "Error: target image $TARGET not valid"

Usage

fi

mkdir -p $OUTDIR

#$FSLDIR/bin/imcp $TARGET out/target

do_reg $TARGET $OUTDIR

fi

echo "Running .commands"

${FSLDIR}/bin/fsl_sub -l logs -T 60 -N affine_reg \

-t .commands

rm .commands

75

Bibliography

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553,

pp. 436–444, 2015.

[2] D. George, W. Lehrach, K. Kansky, M. Lazaro-Gredilla, C. Laan, B. Marthi, X. Lou,

Z. Meng, Y. Liu, H. Wang, A. Lavin, and D. S. Phoenix, “A generative vision model

that trains with high data efficiency and breaks text-based CAPTCHAs,” en, Science,

vol. 358, no. 6368, Dec. 2017, Publisher: American Association for the Advancement

of Science Section: Research Article, ISSN: 0036-8075, 1095-9203. DOI: 10.1126/

science.aag2612. [Online]. Available: https://science.sciencemag.

org/content/358/6368/eaag2612 (visited on 05/22/2020).

[3] M. D. Kohli, R. M. Summers, and J. R. Geis, “Medical Image Data and Datasets in

the Era of Machine Learning—Whitepaper from the 2016 C-MIMI Meeting Dataset

Session,” en, Journal of Digital Imaging, vol. 30, no. 4, pp. 392–399, Aug. 2017, ISSN:

1618-727X. DOI: 10.1007/s10278-017-9976-3. [Online]. Available: https:

//doi.org/10.1007/s10278-017-9976-3 (visited on 04/07/2021).

[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep

Convolutional Neural Networks,” en, p. 9,

[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Se-

mantic image segmentation with deep convolutional nets, atrous convolution, and

fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence,

vol. 40, no. 4, pp. 834–848, 2017.

76

https://doi.org/10.1126/science.aag2612

https://doi.org/10.1126/science.aag2612

https://science.sciencemag.org/content/358/6368/eaag2612

https://science.sciencemag.org/content/358/6368/eaag2612

https://doi.org/10.1007/s10278-017-9976-3

https://doi.org/10.1007/s10278-017-9976-3

https://doi.org/10.1007/s10278-017-9976-3

[6] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomed-

ical Image Segmentation,” en, in Medical Image Computing and Computer-Assisted In-

tervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi,

Eds., ser. Lecture Notes in Computer Science, Cham: Springer International Pub-

lishing, 2015, pp. 234–241, ISBN: 978-3-319-24574-4. DOI: 10.1007/978-3-319-

24574-4_28.

[7] A. Kalinovsky and V. Kovalev, “Lung image segmentation using deep learning

methods and convolutional neural networks,” XIII International Conference on Pat-

tern Recognition and Information Processing, 2016. [Online]. Available: https://

elib.bsu.by/bitstream/123456789/158557/1/Kallinovsky_Kovalev.

pdf.

[8] M. I. Razzak, S. Naz, and A. Zaib, “Deep learning for medical image processing:

Overview, challenges and the future,” Classification in BioApps, pp. 323–350, 2018.

[9] P. Savadjiev, J. Chong, A. Dohan, M. Vakalopoulou, C. Reinhold, N. Paragios, and

B. Gallix, “Demystification of ai-driven medical image interpretation: Past, present

and future,” European radiology, vol. 29, no. 3, pp. 1616–1624, 2019.

[10] L. Zhu, Y. Chen, and A. Yuille, “Recursive Compositional Models for Vision: De-

scription and Review of Recent Work,” en, Journal of Mathematical Imaging and Vi-

sion, vol. 41, no. 1-2, pp. 122–146, Sep. 2011, ISSN: 0924-9907, 1573-7683. DOI: 10.

1007/s10851-011-0282-2. [Online]. Available: http://link.springer.

com/10.1007/s10851-011-0282-2 (visited on 11/26/2020).

[11] A. Kortylewski, Q. Liu, H. Wang, Z. Zhang, and A. Yuille, “Combining Compo-

sitional Models and Deep Networks For Robust Object Classification under Occlu-

sion,” arXiv:1905.11826 [cs], Jan. 2020, arXiv: 1905.11826. [Online]. Available: http:

//arxiv.org/abs/1905.11826 (visited on 05/04/2020).

[12] P. J. Werbos, “Backpropagation through time: What it does and how to do it,” Pro-

ceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.

77

https://doi.org/10.1007/978-3-319-24574-4_28

https://doi.org/10.1007/978-3-319-24574-4_28

https://elib.bsu.by/bitstream/123456789/158557/1/Kallinovsky_Kovalev.pdf



https://doi.org/10.1007/s10851-011-0282-2

https://doi.org/10.1007/s10851-011-0282-2

http://link.springer.com/10.1007/s10851-011-0282-2

http://link.springer.com/10.1007/s10851-011-0282-2

http://arxiv.org/abs/1905.11826


[13] J.-G. Lee, S. Jun, Y.-W. Cho, H. Lee, G. B. Kim, J. B. Seo, and N. Kim, “Deep learning

in medical imaging: General overview,” Korean journal of radiology, vol. 18, no. 4,

p. 570, 2017.

[14] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,”

en, arXiv:1603.07285 [cs, stat], Jan. 2018, arXiv: 1603.07285. [Online]. Available: http:

//arxiv.org/abs/1603.07285 (visited on 07/30/2020).

[15] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradi-

ent descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–

166, Mar. 1994, Conference Name: IEEE Transactions on Neural Networks, ISSN:

1941-0093. DOI: 10.1109/72.279181.

[16] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient backprop bt-neural

networks: Tricks of the trade,” Neural Networks: Tricks of the Trade, 2012.

[17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training

by reducing internal covariate shift,” in International conference on machine learning,

PMLR, 2015, pp. 448–456.

[18] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and

R. M. Summers, “Deep convolutional neural networks for computer-aided detec-

tion: Cnn architectures, dataset characteristics and transfer learning,” IEEE transac-

tions on medical imaging, vol. 35, no. 5, pp. 1285–1298, 2016.

[19] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal

of Big data, vol. 3, no. 1, pp. 1–40, 2016.

[20] L. Prechelt, “Early Stopping - But When?” en, in Neural Networks: Tricks of the Trade,

ser. Lecture Notes in Computer Science, G. B. Orr and K.-R. Muller, Eds., Berlin,

Heidelberg: Springer, 1998, pp. 55–69, ISBN: 978-3-540-49430-0. DOI: 10.1007/3-

540-49430-8_3. [Online]. Available: https://doi.org/10.1007/3-540-

49430-8_3 (visited on 04/06/2021).

78



https://doi.org/10.1109/72.279181

https://doi.org/10.1007/3-540-49430-8_3

https://doi.org/10.1007/3-540-49430-8_3

https://doi.org/10.1007/3-540-49430-8_3

https://doi.org/10.1007/3-540-49430-8_3

[21] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal,

P.-M. Jodoin, and H. Larochelle, “Brain tumor segmentation with Deep Neural Net-

works,” en, Medical Image Analysis, vol. 35, pp. 18–31, Jan. 2017, ISSN: 1361-8415.

DOI: 10.1016/j.media.2016.05.004. [Online]. Available: https://www.

sciencedirect.com/science/article/pii/S1361841516300330 (vis-

ited on 03/08/2021).

[22] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for

Deep Learning,” en, Journal of Big Data, vol. 6, no. 1, p. 60, Jul. 2019, ISSN: 2196-1115.

DOI: 10.1186/s40537-019-0197-0. [Online]. Available: https://doi.org/

10.1186/s40537-019-0197-0 (visited on 04/24/2020).

[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.

Courville, and Y. Bengio, “Generative Adversarial Nets,” en, NIPS’14: Proceedings

of the 27th International Conference on Neural Information Processing Systems, vol. 2,

pp. 2672–2680, 2014.

[24] T. C. W. Mok and A. C. S. Chung, “Learning Data Augmentation for Brain Tumor

Segmentation with Coarse-to-Fine Generative Adversarial Networks,” en, arXiv:1805.11291

[cs], vol. 11383, pp. 70–80, 2019, arXiv: 1805.11291. DOI: 10.1007/978-3-030-

11723-8_7. [Online]. Available: http://arxiv.org/abs/1805.11291 (vis-

ited on 12/07/2020).

[25] S. Jaiswal, A. Mehta, and G. C. Nandi, “Investigation on the effect of l1 an l2 regu-

larization on image features extracted using restricted boltzmann machine,” in 2018

Second International Conference on Intelligent Computing and Control Systems (ICICCS),

2018, pp. 1548–1553. DOI: 10.1109/ICCONS.2018.8663071.

[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:

A simple way to prevent neural networks from overfitting,” The journal of machine

learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

79

https://doi.org/10.1016/j.media.2016.05.004

https://www.sciencedirect.com/science/article/pii/S1361841516300330


https://doi.org/10.1186/s40537-019-0197-0

https://doi.org/10.1186/s40537-019-0197-0

https://doi.org/10.1186/s40537-019-0197-0

https://doi.org/10.1007/978-3-030-11723-8_7

https://doi.org/10.1007/978-3-030-11723-8_7


https://doi.org/10.1109/ICCONS.2018.8663071

[27] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation Functions:

Comparison of trends in Practice and Research for Deep Learning,” arXiv:1811.03378

[cs], Nov. 2018, arXiv: 1811.03378. [Online]. Available: http://arxiv.org/abs/

1811.03378 (visited on 04/07/2021).

[28] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http:

//www.deeplearningbook.org.

[29] J. Turian, J. Bergstra, and Y. Bengio, “Quadratic features and deep architectures for

chunking,” in Proceedings of Human Language Technologies: The 2009 Annual Confer-

ence of the North American Chapter of the Association for Computational Linguistics, Com-

panion Volume: Short Papers, 2009, pp. 245–248.

[30] B. Karlik and A. V. Olgac, “Performance analysis of various activation functions in

generalized mlp architectures of neural networks,” International Journal of Artificial

Intelligence and Expert Systems, vol. 1, no. 4, pp. 111–122, 2011.

[31] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann ma-

chines,” in Icml, 2010.

[32] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv

preprint arXiv:1710.05941, 2017.

[33] Y.-L. Boureau, J. Ponce, and Y. LeCun, “A Theoretical Analysis of Feature Pooling

in Visual Recognition,” en, Proceedings of the 27th International Conference on Machine

Learning (ICML-10), p. 8, 2010.

[34] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep

neural network architectures and their applications,” en, Neurocomputing, vol. 234,

pp. 11–26, Apr. 2017, ISSN: 0925-2312. DOI: 10.1016/j.neucom.2016.12.038.

[Online]. Available: https://www.sciencedirect.com/science/article/

pii/S0925231216315533 (visited on 04/12/2021).

[35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to

document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

80



http://www.deeplearningbook.org

http://www.deeplearningbook.org

https://doi.org/10.1016/j.neucom.2016.12.038



[36] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and

L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in

Proceedings of the 2nd International Conference on Neural Information Processing Sys-

tems, 1989, pp. 396–404.

[37] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural net-

works,” in Proceedings of the IEEE conference on computer vision and pattern recognition,

2014, pp. 1653–1660.

[38] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on com-

puter vision, 2015, pp. 1440–1448.

[39] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proceedings of the

IEEE international conference on computer vision, 2017, pp. 2961–2969.

[40] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep

learning,” in Proceedings of the IEEE conference on computer vision and pattern recogni-

tion, 2015, pp. 1265–1274.

[41] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “De-

caf: A deep convolutional activation feature for generic visual recognition,” in In-

ternational conference on machine learning, PMLR, 2014, pp. 647–655.

[42] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for

scene labeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35,

no. 8, pp. 1915–1929, 2012.

[43] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for

image restoration,” in Proceedings of the IEEE conference on computer vision and pattern

recognition, 2017, pp. 3929–3938.

[44] Y. Gordienko, P. Gang, J. Hui, W. Zeng, Y. Kochura, O. Alienin, O. Rokovyi, and

S. Stirenko, “Deep Learning with Lung Segmentation and Bone Shadow Exclusion

Techniques for Chest X-Ray Analysis of Lung Cancer,” en, in Advances in Computer

Science for Engineering and Education, Z. Hu, S. Petoukhov, I. Dychka, and M. He,

81

Eds., vol. 754, Series Title: Advances in Intelligent Systems and Computing, Cham:

Springer International Publishing, 2019, pp. 638–647, ISBN: 978-3-319-91007-9 978-

3-319-91008-6. DOI: 10.1007/978-3-319-91008-6_63. [Online]. Available:

http://link.springer.com/10.1007/978-3-319-91008-6_63 (visited

on 07/15/2020).

[45] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.-M.

Jodoin, and H. Larochelle, “Brain tumor segmentation with deep neural networks,”

Medical image analysis, vol. 35, pp. 18–31, 2017.

[46] A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, and M. Nielsen, “Deep feature

learning for knee cartilage segmentation using a triplanar convolutional neural net-

work,” in International conference on medical image computing and computer-assisted

intervention, Springer, 2013, pp. 246–253.

[47] H. R. Roth, A. Farag, L. Lu, E. B. Turkbey, and R. M. Summers, “Deep convolutional

networks for pancreas segmentation in ct imaging,” in Medical Imaging 2015: Image

Processing, International Society for Optics and Photonics, vol. 9413, 2015, 94131G.

[48] C. Cernazanu-Glavan and S. Holban, “Segmentation of bone structure in x-ray im-

ages using convolutional neural network,” Adv. Electr. Comput. Eng, vol. 13, no. 1,

pp. 87–94, 2013.

[49] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Mitosis detection

in breast cancer histology images with deep neural networks,” in International con-

ference on medical image computing and computer-assisted intervention, Springer, 2013,

pp. 411–418.

[50] J. A. Stark, “Adaptive image contrast enhancement using generalizations of his-

togram equalization,” IEEE Transactions on image processing, vol. 9, no. 5, pp. 889–

896, 2000.

82

https://doi.org/10.1007/978-3-319-91008-6_63

http://link.springer.com/10.1007/978-3-319-91008-6_63

[51] S. Lyu and E. P. Simoncelli, “Nonlinear image representation using divisive normal-

ization,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE,

2008, pp. 1–8.

[52] [Online]. Available: http://tuberculosis.by/.

[53] J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K.-i. Komatsu,

M. Matsui, H. Fujita, Y. Kodera, and K. Doi, “Development of a digital image database

for chest radiographs with and without a lung nodule: Receiver operating charac-

teristic analysis of radiologists’ detection of pulmonary nodules,” American Journal

of Roentgenology, vol. 174, no. 1, pp. 71–74, 2000.

[54] W. S. H. M. Wan Ahmad, W. M. D. W Zaki, and M. F. Ahmad Fauzi, “Lung segmen-

tation on standard and mobile chest radiographs using oriented Gaussian deriva-

tives filter,” en, BioMedical Engineering OnLine, vol. 14, no. 1, p. 20, Mar. 2015, ISSN:

1475-925X. DOI: 10.1186/s12938-015-0014-8. [Online]. Available: https:

//doi.org/10.1186/s12938-015-0014-8 (visited on 04/07/2021).

[55] D. Zikic, Y. Ioannou, M. Brown, and A. Criminisi, “Segmentation of brain tumor

tissues with convolutional neural networks,” Proceedings MICCAI-BRATS, vol. 36,

pp. 36–39, 2014.

[56] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout

networks,” in International conference on machine learning, PMLR, 2013, pp. 1319–

1327.

[57] B. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N.

Porz, J. Slotboom, R. Wiest, L. Lanczi, E. Gerstner, M.-A. Weber, T. Arbel, B. Avants,

N. Ayache, P. Buendia, L. Collins, N. Cordier, J. Corso, A. Criminisi, T. Das, H.

Delingette, C. Demiralp, C. Durst, M. Dojat, S. Doyle, J. Festa, F. Forbes, E. Geremia,

B. Glocker, P. Golland, X. Guo, A. Hamamci, K. Iftekharuddin, R. Jena, N. John,

E. Konukoglu, D. Lashkari, J. Antonio Mariz, R. Meier, S. Pereira, D. Precup, S. J.

Price, T. Riklin-Raviv, S. Reza, M. Ryan, L. Schwartz, H.-C. Shin, J. Shotton, C. Silva,

83

http://tuberculosis.by/

https://doi.org/10.1186/s12938-015-0014-8

https://doi.org/10.1186/s12938-015-0014-8

https://doi.org/10.1186/s12938-015-0014-8

N. Sousa, N. Subbanna, G. Szekely, T. Taylor, O. Thomas, N. Tustison, G. Unal, F.

Vasseur, M. Wintermark, D. Hye Ye, L. Zhao, B. Zhao, D. Zikic, M. Prastawa, M.

Reyes, and K. Van Leemput, “The Multimodal Brain Tumor Image Segmentation

Benchmark (BRATS),” IEEE Transactions on Medical Imaging, p. 33, 2014. DOI: 10.

1109/TMI.2014.2377694. [Online]. Available: https://hal.inria.fr/

hal-00935640.

[58] D. C. Ciresan, L. M. Gambardella, and A. Giusti, “Deep Neural Networks Segment

Neuronal Membranes in Electron Microscopy Images,” en, p. 9,

[59] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S.

McDonagh, N. Y. Hammerla, B. Kainz, et al., “Attention u-net: Learning where to

look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.

[60] O. Cicek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net:

Learning dense volumetric segmentation from sparse annotation,” in International

conference on medical image computing and computer-assisted intervention, Springer,

2016, pp. 424–432.

[61] P. Esser, E. Sutter, and B. Ommer, “A variational u-net for conditional appearance

and shape generation,” in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2018, pp. 8857–8866.

[62] N. Ibtehaz and M. S. Rahman, “Multiresunet: Rethinking the u-net architecture for

multimodal biomedical image segmentation,” Neural Networks, vol. 121, pp. 74–87,

2020.

[63] D. Mumford, “Pattern Theory: A Unifying Perspective,” en, Pattern Theory, vol. 3,

p. 38, 1992. [Online]. Available: https://link.springer.com/chapter/10.

1007/978-3-0348-9110-3_6.

[64] Xue Mei, Haibin Ling, and D. W. Jacobs, “Sparse representation of cast shadows via

1-regularized least squares,” in 2009 IEEE 12th International Conference on Computer

Vision, 2009, pp. 583–590. DOI: 10.1109/ICCV.2009.5459185.

84

https://doi.org/10.1109/TMI.2014.2377694

https://doi.org/10.1109/TMI.2014.2377694

https://hal.inria.fr/hal-00935640

https://hal.inria.fr/hal-00935640

https://link.springer.com/chapter/10.1007/978-3-0348-9110-3_6

https://link.springer.com/chapter/10.1007/978-3-0348-9110-3_6

https://doi.org/10.1109/ICCV.2009.5459185

[65] X. Li, T. Jia, V. Tech, and H. Zhang, “Expression-Insensitive 3D Face Recognition us-

ing Sparse Representation,” en, 2009 IEEE Conference on Computer Vision and Pattern

Recognition, p. 8, 2009. DOI: 10.1109/CVPR.2009.5206613.

[66] K. Huang and S. Aviyente, “Sparse Representation for Signal Classification,” en,

Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Confer-

ence, p. 8, 2006.

[67] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse Representation

for Computer Vision and Pattern Recognition,” Proceedings of the IEEE, vol. 98, no. 6,

pp. 1031–1044, Jun. 2010, Conference Name: Proceedings of the IEEE, ISSN: 1558-

2256. DOI: 10.1109/JPROC.2010.2044470.

[68] Y.-Y. Wang, M. Mahajan, and X. Huang, “A unified context-free grammar and n-

gram model for spoken language processing,” in 2000 IEEE International Conference

on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), IEEE,

vol. 3, 2000, pp. 1639–1642.

[69] A. Kortylewski, J. He, Q. Liu, and A. Yuille, “Compositional Convolutional Neu-

ral Networks: A Deep Architecture with Innate Robustness to Partial Occlusion,”

arXiv:2003.04490 [cs], Apr. 2020, arXiv: 2003.04490. [Online]. Available: http://

arxiv.org/abs/2003.04490 (visited on 05/04/2020).

[70] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra, “Clustering on the Unit Hypersphere

using von Mises-Fisher Distributions,” en, 20th International Conference on Pattern

Recognition, p. 38, 2010. DOI: 10.1109/ICPR.2010.522.

[71] G. N. Watson, A treatise on the theory of Bessel functions. Cambridge university press,

1995.

[72] A. Kortylewski, Q. Liu, H. Wang, Z. Zhang, and A. Yuille, “Localizing Occlud-

ers with Compositional Convolutional Networks,” en, in 2019 IEEE/CVF Interna-

tional Conference on Computer Vision Workshop (ICCVW), Seoul, Korea (South): IEEE,

Oct. 2019, pp. 2029–2032, ISBN: 978-1-72815-023-9. DOI: 10.1109/ICCVW.2019.

85

https://doi.org/10.1109/CVPR.2009.5206613

https://doi.org/10.1109/JPROC.2010.2044470



https://doi.org/10.1109/ICPR.2010.522

https://doi.org/10.1109/ICCVW.2019.00253


00253. [Online]. Available: https://ieeexplore.ieee.org/document/

9022239/ (visited on 06/28/2021).

[73] A. E. Kavur, M. A. Selver, O. Dicle, M. Barıs, and N. S. Gezer, CHAOS - Combined

(CT-MR) Healthy Abdominal Organ Segmentation Challenge Data, version v1.03, Zen-

odo, Apr. 2019. DOI: 10.5281/zenodo.3362844. [Online]. Available: https:

//doi.org/10.5281/zenodo.3362844.

[74] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han,

P.-A. Heng, J. Hesser, et al., “The liver tumor segmentation benchmark (lits),” arXiv

preprint arXiv:1901.04056, 2019.

[75] FLIRT - FslWiki. [Online]. Available: https://fsl.fmrib.ox.ac.uk/fsl/

fslwiki/FLIRT (visited on 05/17/2021).

[76] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale

hierarchical image database,” in 2009 IEEE conference on computer vision and pattern

recognition, Ieee, 2009, pp. 248–255.

[77] [Online]. Available: https://www.gimp.org/.

86



https://ieeexplore.ieee.org/document/9022239/

https://ieeexplore.ieee.org/document/9022239/

https://doi.org/10.5281/zenodo.3362844



https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT

https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT

https://www.gimp.org/

compositional networks for unsupervised tumor localization

Documents