compositional networks for unsupervised tumor localization
TRANSCRIPT
Compositional Networks for Unsupervised Tumor
Localization in Medical Imaging Analysis
Matthew Lesko-Krleza, School of Computer Science
McGill University, Montreal
June, 2021
A thesis submitted to McGill University in partial fulfillment of the
requirements of the degree of Master of Computer Science
c©Matthew Lesko-Krleza, 2021
Abstract
There has been a significant amount of success in using deep learning models for com-
puter vision tasks. However, applying deep learning to medical imaging analysis is still
fraught with difficulty. State-of-the-art deep learning segmentation models train on com-
mon object data sets on the order of over 100,000 images containing over 800,000 in-
stances, whereas medical image data sets contain only several hundred three-dimensional
images. This is why we propose to evaluate an unsupervised occluder localization pro-
cess from Compositional Networks to perform tumour localization on medical imaging
data. In this work we show the potential of Compositional Networks in performing tu-
mour localization without the need of tumour instances within the training dataset. Al-
though we don’t conclusively determine whether CompNets can fully localize tumours
or not, we show promise and discuss future work that could alleviate concerns with our
current results.
i
Abrege
L’utilisation de modeles d’apprentissage profond pour les taches de vision par ordina-
teur a connu un succes considerable. Cependant, l’application de l’apprentissage en pro-
fondeur a l’analyse d’imagerie medicale pose toujours de nombreuses difficultes. Les
modeles de segmentation d’apprentissage s’entraınent sur des ensembles de donnees
d’objets communs de l’ordre de plus de 100 000 images contenant plus de 800 000 in-
stances, alors que les ensembles de donnees d’images medicales ne contiennent que plusieurs
centaines d’images en trois dimensions. C’est pourquoi nous proposons d’evaluer un
processus de localisation d’occluder non supervise de Compositional Networks pour ef-
fectuer la localisation de tumeurs sur des donnees d’imagerie medicale. Dans ce travail,
nous montrons le potentiel des reseaux de composition pour effectuer la localisation de
tumeurs sans avoir besoin d’instances de tumeurs dans l’ensemble de donnees de forma-
tion. Bien que nous ne determinions pas de maniere concluante si les CompNets peuvent
localiser completement les tumeurs ou non, nous sommes prometteurs et discutons des
travaux futurs qui pourraient attenuer les inquietudes concernant nos resultats actuels.
ii
Acknowledgements
I would like to thank my supervisor Peter Savadjiev, and Adam Kortylewski for their
guidance, feedback, and insightful conversations. They were both supportive of my
work, and helped me overcome the intellectual challenges that come with research. I
wouldn’t have been able to create this work, nor would I have been able to produce my
results without them. I would like to acknowledge and thank Adam for his code for the
Compositional Networks model, which involve training and testing the Compositional
Networks. I would also like to thank Compute Canada for allowing me the use of their
compute hardware, I wouldn’t have been able to run any experiments without them.
iii
Contribution of Authors
Matthew Lesko-Krleza wrote this thesis, and wrote code for the experiments.
iv
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abrege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Contribution of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 1
2 Literature Review 5
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Discriminative and Generative Models . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Artificial Neural Network Theory . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 10
2.3.3 Common Training Problems and Their Solutions . . . . . . . . . . . . 12
2.3.4 Activation and Pooling Functions . . . . . . . . . . . . . . . . . . . . 17
2.3.5 Non-Medical Imaging Applications . . . . . . . . . . . . . . . . . . . 22
2.3.6 Medical Imaging Applications . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Dictionary Learning and Pattern Theory . . . . . . . . . . . . . . . . . . . . . 31
2.5 Compositional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Hierarchical Compositional Models . . . . . . . . . . . . . . . . . . . 32
v
2.5.2 Recursive Cortical Networks for Object Classification under Occlu-
sion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.3 Compositional Convolutional Neural Networks . . . . . . . . . . . . 37
3 Methodology 46
3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Pre-Trained Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 CompNet Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.4 CompNet Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Results 54
5 Discussion 63
5.1 Cluster Centre Patch Visualizations . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Cluster Center Activation Visualizations . . . . . . . . . . . . . . . . . . . . . 66
5.3 Synthetic Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 LiTS Tumour Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 MUHC PACS Tumour Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7 Hypothesis Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . 70
6 Appendix 72
6.1 Image Registration Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
vi
List of Figures
2.1 Feedforward Neural Network Example . . . . . . . . . . . . . . . . . . . . . 9
2.2 Convolving a 3 × 3 kernel over a 4 × 4 input using a stride of 1 and zero
padding. Adapted from [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Encoder-Decoder CNN Architecture for Lung Segmentation [7] . . . . . . . 23
2.4 Two-Pathway CNN Architecture (TwoPathCNN) for Brain Tumour Seg-
mentation [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Cascaded TwoPathCNN Architectures . . . . . . . . . . . . . . . . . . . . . . 25
2.6 U-Net CNN Architecture for Electron Microscopy Cell Segmentation (ex-
ample for 32x32 pixels in the lowest resolution) [6] . . . . . . . . . . . . . . . 29
2.7 Hierarchical Compositional Model Representing a Horse [10] . . . . . . . . 32
2.8 Compositional Convolutional Neural Network Architecture [69] . . . . . . . 37
2.9 Illustration of vMF kernels by visualizing image patterns from the training
set that activates a given vMF kernel the most. Note how image patterns
that are of similar appearance and share semantic meaning are separated
into different kernels [69]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10 Visualization of learned mixture models (4 mixtures). Each row is for a
different object class (car, train, boat, or bus) and each column represents
a different mixture for the object class. Note how different 3D viewpoints
are approximately separated into different mixtures. [69] . . . . . . . . . . . 41
vii
2.11 Occlusion localization results from [69]. Each result consists of three im-
ages: The input image, the occlusion scores of a dictionary-based composi-
tional model from a prior work by kortylewski et al. [72] and the occlusion
scores of the proposed CompNet [69]. Note how the CompNet can localize
occluders with high accuracy across different objects and occluder types
for real as well as for artificial occlusions. . . . . . . . . . . . . . . . . . . . . 43
3.1 U-Net training loss and validation accuracy plots . . . . . . . . . . . . . . . 50
4.1 First 6 CompNet cluster center patch visualizations. Every set of 16 patches
represents one cluster center who’s patches are most representative for that
given cluster center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Second 6 CompNet cluster center patch visualizations. Every set of 16
patches represents one cluster center who’s patches are most representa-
tive for that given cluster center. . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Third 6 CompNet cluster center patch visualizations. Every set of 16 patches
represents one cluster center who’s patches are most representative for that
given cluster center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 VMF cluster activations. The image in the top left is the input image. Pur-
ple pixels represent no activation, blue pixels represent a low activation,
green pixels represent moderate activation, and yellow pixels represent
high activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Occlusion generation on synthetic occluders. The left column consists of
input images with some synthetic occlusion and the right column consists
of occlusion maps where purple pixels represent no occlusion, orange pix-
els represent a low scoring of occlusion, green pixels represent a medium
scoring of occlusion, and blue pixels represent a high scoring of occlusion. . 60
viii
4.6 Occlusion generation on real tumors. Every sub-figure’s left column con-
sists of input images with some real tumour, the middle column consists
of occlusion maps, and the right column consists of the ground truth liver
and tumor segmentation maps. For the occlusion maps, the purple pixels
represent no occlusion, blue pixels represent a low scoring of occlusion,
green pixels represent a medium scoring of occlusion, and yellow pixels
represent a high scoring of occlusion. For the ground truth segmentation
maps, black, grey and white pixels represent the background, liver tissue,
and tumor tissue classes respectively. . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Occlusion generation on manually segmented real tumors from the MUHC
PACS dataset. Every sub-figure’s left column consists of input images with
some real tumours and every sub-figure’s right column consists of occlu-
sion maps where purple pixels represent no occlusion, orange pixels rep-
resent a low scoring of occlusion, green pixels represent a medium scoring
of occlusion, and blue pixels represent a high scoring of occlusion. . . . . . 62
6.1 Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
ix
Chapter 1
Introduction
Research in Artificial Intelligence (AI) has made a significant amount of progress in the
last several years thanks to the increasingly large amount of labelled data, the virtually
unlimited amount of computational resources and breakthroughs in computer vision al-
gorithms including artificial neural networks and Compositional Models [1], [2]. As a
subset of the computer vision field, the medical imaging sector is rich with annotated
data and challenges ripe for AI to innovate in. AI is unlikely to fully replace the need
for human radiology expertise, however it has an enormous potential to enhance existing
systems and in automating certain repetitive tasks. Some examples of repetitive tasks that
AI could help in medical imaging analysis include segmenting organs or diseases, classi-
fying diseases, or enhancing image quality [3]. AI could revolutionize the medical field
by helping radiology experts deliver better and faster care for patients, reduce overall
medical costs, and enable larger medical analysis coverage especially for impoverished
societies. Given the immense popularity and promising success in applying artificial neu-
ral network algorithms for common object computer vision tasks such as object classifi-
cation and segmentation [4], [5], it is only natural that they have lead to research labs in
evaluating their performance in medical imaging analysis. Applying neural networks to
medical imaging analysis has been met with some success [6], [7] and a large amount of
challenges [8], [9]. Applying AI for medical imaging tasks may sound promising, but ap-
1
plying currently popular AI algorithms to medical imaging analysis is still fraught with
difficulty.
There exist many challenges in applying AI algorithms for medical imaging. First-
and-foremost, there is the lack of data. Popular state-of-the-art deep learning segmenta-
tion models train on common object datasets on the order of over 100,000 images con-
taining over 800,000 instances, whereas medical image datasets contain only several hun-
dred volumetric data points [3]. Deep learning models require vasts amount of data to
be successful. Medical image datasets don’t normally provide such an amount of data.
Then there are the facts that modalities across medical datasets differ from one another,
and that the machines and parameters used for acquiring this data may differ from one
dataset or clinic to the next. This poses a challenge in model generalizability, because a
model may work well for one source of data, but not adapt well to a different one. One
might say that we could use forms of data augmentation to increase sample variance in
a given dataset to improve performance, and one would find that there have been such
approaches that have successfully improved predictive performance [6]. However, not
only are augmentation techniques not standardized, but empirical evidence is required
to determine which augmentation techniques work best with the given dataset, therefore
requiring more time and effort for training. Data augmentation still doesn’t solve the
main challenges in working with medical images, it only helps circumvent some issues.
Finally and most importantly in the case of our work, diseases found within medical im-
ages can be difficult to locate for the human eye and their appearance can vary greatly
from one patient to another. In the example of localizing liver tumours within computed
tomography scans of the abdomen, these tumours can hardly look darker or lighter than
the surrounding tissue, their texture can resemble similarly to that of liver tissue, and they
can appear in an overwhelming variety of shapes and sizes. This is all to prove that the
lack of data and the variance of disease appearance are some of the biggest challenges we
face in applying AI for medical imaging analysis. To summarize, applying AI algorithms
for medical imaging analysis pose a challenge for AI researchers because of the lack of
2
data, the nature of the data, the uncertainty in enhancing the given data and the variance
in disease appearance. We would like to explore a more data-efficient method for medical
imaging analysis.
In contrast to deep learning models which have garnered an immense amount of at-
tention, there exists a set of generative models that require far less training data for object
representation training, but still offer accurate object representations. This set of models
are called Compositional Models [10]. Compositional Models represent objects of their
underlying data in a compositional manner by recursively composing elemental parts of
objects, such as curves or patches of textures that are extracted from input images, into
progressively more unified and holistic object parts until the entire target object can be
reconstructed. They have garnered some popularity in computer vision because of their
data efficiency [2], but usually under-perform against deep learning models for common
tasks such as object classification. Nevertheless, they’re an attractive approach to evaluate
for medical imaging analysis because of their data-efficiency in learning object represen-
tation and more importantly because they still haven’t been applied for medical imaging
analysis tasks. Evaluating Compositional Models for medical imaging analysis is a new
area of research which we would like to explore.
We believe we need to explore a direction that uses Compositional Models, but we
also don’t want to disregard deep learning algorithms entirely. Artificial neural networks
may require a large amount of data, but they have proven to provide strong abilities in
discriminative feature extraction, whereas generative models typically need less data but
are usually outperformed by their deep learning counterparts in vision tasks. This is why
we propose to evaluate the potential of a newly existing imaging analysis algorithm that
combines an artificial neural network with a generative model. This algorithm is called
the Compositional Convolutional Neural Network [11], or Compositional Network for
short, and we wish to evaluate its potential in disease localization. It consists of a pre-
trained deep learning model for image feature extraction and a generative Compositional
model to learn from and perform inference on the feature maps extracted from the neural
3
network model. Not only has this never been done before, but an important novel feature
of this algorithm is in its ability to perform unsupervised object occlusion localization. In
other words, it can localize instances of object occlusions without the need of ever having
seen any forms of occlusion during the training phase. In this work, we formulate the
problem of tumour localization as an occlusion localization problem. We wish to use the
Compositional Network’s occlusion localization ability to localize tumours in an unsu-
pervised fashion. The benefit of this method is that the training phase of the Composi-
tional Network wouldn’t require instances of the disease we wish to locate at test time. If
successful, this could enable a promising method in localizing diseases in medical images
without the need of an exhaustive amount of data to capture the full variance in dis-
ease appearance. This could revolutionize the way we train machine learning models for
disease localization. Given its success in occlusion localization, we experiment with the
Compositional Network for tumour localization without the need of tumour data within
the training dataset.
Our hypothesis is that CompNets can accurately localize tumours without the need
of tumour training data and little liver training data. To outline the content in this work,
we conduct a literature review on deep learning and its applications in medical imaging,
and a literature review on on Compositional models. We describe our methodology and
results in evaluating the Compositional Network’s performance for tumour localization.
Finally, we discuss the promise Compositional Networks have in medical imaging anal-
ysis which include an analysis of the areas they currently succeed in and an analysis of
the areas they currently need more work in. Overall, we contribute to the AI and medical
imaging analysis fields by showing promising potential in Compositional Networks for
disease localization thanks to their data-efficiency and ability to localize parts of tumours
in an unsupervised way.
4
Chapter 2
Literature Review
This work assumes that the reader has a computer science background and understands
basic machine learning concepts such as: supervised learning, unsupervised learning, features
and training data. We first describe the difference between discriminative and generative
machine learning models because their differences highlight some of the reasons why
we chose to use Compositional Models for medical imaging analysis. Then we give a
review of deep learning including fundamental theory, convolutional feature extraction,
pooling and activation functions, non-medical imaging applications and medical imaging
applications. Next we describe dictionary learning and pattern theory as the fundamental
ideas to compositional models. Finally, we describe the motivation and theory behind
compositional models and compositional neural networks.
2.1 Machine Learning
Machine Learning (ML) is a subset of Artificial Intelligence (AI). It is defined as a set of
methods that automatically discover patterns in data, and use the discovered patterns for
decision-making under uncertainty or to predict future patterns on new data. The main
role of ML is to learn patterns on its own and proceed to make decisions without being
explicitly programmed to do so. ML algorithms can vary in performance, and complex-
5
ity, from simple classifiers to Deep Artificial Neural Networks, which we discuss later in
Section 2.3. Before discussing some examples of ML algorithms, we discuss the differ-
ence between discriminative and generative models, since their distinction highlights the
attraction to using Compositional Models for medical imaging analysis, the domain that
we wish to apply Compositional Models to.
2.2 Discriminative and Generative Models
Here we’d like to discuss the difference between discriminative and generative models
as well as the theory of classical Compositional Models and Hierarchical Compositional
Models as primers to understanding Compositional Networks. In ML, our goal is to
learn the relationships between the world and the data we sample the world from. We
can approximate these relationships by training a model with some set of parameters.
The model could either be generative or discriminative. With generative models we aim
to understand the dataset from which a sample x labelled as y is sampled from:
P (x|y)P (y)
Where P is a probability, x is a sample from some dataset, such as an input feature vector,
and y is a label to classify that sample, such as a classification label or a discrete value
from a continuous distribution. This allows us to capture and represent the underlying
distribution of the sample x to which we can reference this distribution to generate new
data instances. This makes generative models useful for unsupervised ML approaches
because the output allows us to understand the underlying distribution of data. Whereas
with discriminative models, learning the dataset labelled as y from which x is sampled
from is irrelevant, the only goal is to classify or discriminate samples:
P (y|x)
6
Where P , x, and y are a probability, sample, and label as the same with generative models.
Discriminative models learn the boundaries between classes within a given dataset, mak-
ing them computationally cheaper and more robust to outliers than generative models at
the expense of not learning the underlying distribution of data.
Examples of generative models include Mixture Models, and Hierarchical Composi-
tional Models (HCMs). A mixture model is a probabilistic model for representing the
presence of subpopulations within an overall population. This mixture model corre-
sponds to a mixture or weighted set of underlying distributions, which allows for flexi-
bility in data representation. We’ll see mixture models used and explained in more detail
in Section 2.5.3, they form an important component in Compositional Networks. HCMs
are the earliest form of Compositional Models that Compositional Networks base their
compositionality aspect from, they are reviewed in Section 2.5.1.
An example of a discriminative model is the Artificial Neural Network, which we
describe in Section 2.3.
Finally, a Compositional Network is a new ML model system which combines the dis-
criminative power of Artificial Neural Networks with the generative model advantages
of Compositional Models. The generative aspect of Compositional Networks enables ML
practitioners to learn the parts and compositions of parts that make up the training data.
In this work, we experiment with the use of this Compositional Network and its abil-
ity in learning parts and compositions of parts from medical images while using a deep
neural network for performing feature extraction. Compositional Networks are further
described in Section 2.5.3.
2.3 Deep Learning
2.3.1 Artificial Neural Network Theory
Deep learning is a subset of ML. It usually refers to the learning of a special type of model
named the Artificial Neural Network (ANN) which mimics the multilayered human cog-
7
nitive system. We want to discuss the fundamentals of deep learning and convolutional
neural networks, because they are used as feature extractors in compositional networks,
and their training algorithm (Backpropagation [12]) is used to train compositional networks
as well. Understanding the fundamentals of deep learning will help one understand the
theory in compositional networks.
Artificial Neural Networks are a set of acyclic and interconnected nodes loosely in-
spired by the neurons in the human brain. The purpose of an ANN is to learn an internal
representation of the dataset it trains on, which could then be used for a downstream task
such as prediction or pattern recognition. The main advantage of an ANN is its ability to
perform pattern recognition on raw signals. There is no need for any feature-engineering
nor any preprocessing. The ANN was previously limited in its ability to solve real-world
problems because of the lack of sufficient data and the lack of computing power to train
the system. However, in recent years, thanks to larger datasets, the virtually unlimited
amount of compute resources available through the use of Graphics Processing Units
(GPUs) and cloud computing, and solutions to the vanishing gradient problem which we
describe in Section 2.3.3, ANNs have been widely used for predictive tasks, especially in
general computer vision tasks [4] and medical imaging analysis [13]. The quintessential
deep learning model is the feedforward neural network or multilayer perceptron (MLP). The
goal of an MLP is to approximate some function f ∗ to estimate a mapping y, to an in-
put vector x. y can either be a continuous or discrete value (such as a class label). The
MLP’s weights are defined as a set of parameters θwhich are learned to best approximate
the mapping y = f ∗(x;θ). MLPs are characterized as networks because they represent
a chaining-composition of many functions. For example, we might have three different
functions f (1), f (2), and f (3) connected in a chain to form f(x) = f (3)(f (2)(f (1)(x))). In this
situation, f (1) is called the first layer, f (2) is called the second layer and so on. Graphically,
the input vector is represented as the input nodes, the function layers are represented as
intermediary nodes hnm and the output vector y is the set of output nodes within the MLP
computation graph. Figure 2.1 illustrates an example of a two-layer feedforward neural
8
x1
x2
x3
InputLayer
h(1)1
h(1)2
h(1)3
h(1)4
HiddenLayer 1
θ(1)
h(2)1
h(2)2
h(2)3
HiddenLayer 2
θ(2)
y1
y2
OutputLayer
θ(3)
Figure 2.1: Feedforward Neural Network Example
network. The input and output of the nodes are defined by the direction of the edges
connecting the nodes together into an acyclic network. The MLP’s parameters θ are a set
of weights that weigh the respective output for every hidden layer. These are the set of
parameters that are optimized during training via Gradient-Based Learning.
Training a neural network is not much different from any other ML model which uses
gradient descent. A differentiable loss function (or cost function), defined as Loss(f(x), y),
is used to measure the neural network’s performance during training. Gradient descent is
performed to optimize the neural network’s parameters. Every gradient step is a function
of the derivative of the Loss with respect to the weights: δLδw
. δLδw
is computed by perform-
ing the chain rule on all the stored local gradients. The partial derivative of every layer
is computed with respect to its input, these local partial derivatives are then multiplied
in the chain rule fashion to compute the partial derivative of the Loss with respect to the
weights. For example, given three consecutive and connected hidden layers h(1), h(2) and
h(3) and a loss function Loss, their respective outputs are z1, z2, z3 and L. This loss func-
tion is a metric function between the estimates y and the ground truth y. Every output’s
partial derivative is defined as δz1δw, δz2δz1, δz3δz2
and δLδz3
. By the chain rule:
δL
δw=δz1
δw∗ δz2
δz1
∗ δz3
δz2
∗ δLδz3
(2.1)
9
Figure 2.2: Convolving a 3 × 3 kernel over a 4 × 4 input using a stride of 1 and zero
padding. Adapted from [14]
For the stochastic gradient descent example, the model’s weights w are updated as
follows:
w ←− w − α ∗ dLdw
(2.2)
Where α is a learning rate 0 < α ≤ 1. There exist various optimization algorithms, but
for the sake of simplicity, we’ve limited ourselves to describing the basic stochastic gra-
dient descent. During training, the MLP stores local gradients and Backpropagation (the
application of the chain rule and weight updates) is performed to optimize the model’s
weights. This form of training is generalizable to any kind of deep learning network.
During testing, the MLP simply performs the forward pass without tuning any weights.
Performance is measured with some metric, such as Cross-Entropy (for classification) or
Mean-Squared Error (for regression).
Ultimately, the MLP is an encoder that learns a representation of its training data by
encoding raw data into a feature space. This representation can be thought of as a com-
pressed version of the training data.
2.3.2 Deep Convolutional Neural Networks
Deep Convolutional Neural Networks (DCNNs or CNNs for short) are a class of deep
learning models that use convolutional filters for feature extraction. Over recent years,
they have gained immense use because of their promising results in computer vision
10
tasks. Their convolution filters perform convolutional operations on multi-dimensional
data. In the case of a 2-dimensional input, a kernel K passes over points (x, y) and per-
forms a convolution operation on the values within the window centred at a point (x, y).
The stride between points is a parameter determined by the model designed. The kernel
K consists of weights w that are multiplied with the input to achieve a weighted Convo-
lution operation. During training, the weights are updated to optimize the loss function.
Given that a window is used to select a set of pixels and aggregate their values together,
commonly convolutional filters will reduce the spatial resolution of the input feature vec-
tor. This resolution reduction effect is depicted in Figure 2.2. The output is a feature map
in which each pixel is a sum of its neighbours’ features or pixel values. This feature aggre-
gation enables the neural network to extract a hierarchy of increasingly complex features,
making CNNs very appealing for image analysis. However, if either a padding is added
to the output feature map such that its output resolution matches the input resolution,
or if the kernel size is 1 × 1 and the stride is a value of 1, then the output feature map’s
resolution matches that of the input. Therefore, convolutions don’t necessarily reduce the
spatial resolution of the input vector. This feature map can either be fed to more func-
tions within a neural network for further feature extraction or fed to a prediction layer
to perform a predictive task such as image classification, object segmentation, or feature
clustering.
Receptive Fields
Here we describe one of the basic concepts in CNNs which is the Receptive Field, or Field
of View, of a unit in a certain layer in the network. We describe their use and how to
compute their size, because they become relevant when we discuss our results in Section
4. In fully connected networks, the value of each feature unit depends on the entire input
to the network, whereas a feature unit within a CNN only depends on a specific region
of the input. The receptive field is the region for that unit, intuitively it is the region in
the input space that a particular CNN feature is paying attention to. This concept helps
11
in understanding and diagnosing how CNNs work. We can compute the receptive field
size r0 of an input image as follows:
r0 =L∑l=1
((kl − 1)l−1∏i=1
si) + 1 (2.3)
Where kl and si are the respective kernel size and stride size used at layer l within the
CNN. If the stride is greater than 1 for a particular layer, the region increases propor-
tionally for all layers below the given one. Receptive fields are important in visualizing
and understanding the feature extraction process in CNNs. We use them to visualize the
patches of images which are representative of meaningful features extracted during the
training of a Compositional Network as seen in Section 5.1.
2.3.3 Common Training Problems and Their Solutions
As a disclaimer, this section is not directly relevant to our main contribution, but readers
may find them interesting since several techniques are referenced in the Applications for
Medical Imaging Section 2.3.6.
Vanishing and Exploding Gradients
During training, the local gradient is computed for every function within the forward
pass with respect to the input. Then during Backpropagation, the chain rule is com-
puted like in Equation 2.1 to compute the gradient of the loss function with respect to
the model’s weight values. However, if the ANN has many layers and the gradients are
increasingly smaller or larger throughout the feed-forward function, these gradients can
either vanish to zero or explode in value respectively. This is also called gradient satu-
ration and can significantly affect convergence and training [15]. This problem has been
largely addressed by normalized initialization [16] and batch normalization layers [17]
which normalize the feature maps and gradients across layers. Batch Normalization is
a function applied to intermediate layers within the network that normalizes each input
12
training batch:
Mini-batch mean: µB =1
m
m∑i=1
xi
Mini-batch variance: σ2B =
1
m
m∑i=1
(xi − µB)2
Normalization: xi =xi − µB√σ2B + ε
Scale and Shift: output = γxi + β
where x is some d-dimensional input and each dimension of the input gets normalized.
B is a batch of m number of inputs, γ and β are scaling and offset parameters respec-
tively, and ε is a hyperparameter for numerical stability to avoid dividing by zero when
normalizing. Not only does Batch Normalization eliminate the vanishing and exploding
gradient problem by normalizing features, it has been shown that merely adding it to a
state-of-the-art image classification model yields a substantial speedup in training and
significant increase in performance [17].
Overfitting
Overfitting is the problem when a model fits too closely to a particular training dataset,
which can occur for various reasons. It can occur if the model to be trained is too compli-
cated, if the model’s training duration is too long, or most often if there is too little training
data for the given task. Intuitively, when the model is overfitting, it can be thought of as
the model attempting to memorize the dataset. This is undesirable because the model has
difficulty generalizing to new and unseen data. ANNs are prone to overfitting because of
their increased complexity, which introduces a significantly larger need for training data
as opposed to their classical ML model counterparts. Methods such as increasing the
amount of training data, transfer learning, early-stopping, data augmentation, weight
regularization, and dropout help reduce the possibility of overfitting in neural networks.
13
Medical imaging training data sets are usually significantly smaller than common
computer vision datasets [3]. The required sample training size is an ongoing area of
research in ML, but the rule of thumb suggests that the number of samples be at least 10
times the amount of training parameters [3]. Unfortunately, most publicly available med-
ical imaging datasets contain only hundreds or several thousand samples [3] and neural
networks can contain millions of training parameters [4], [6], [7]. Increasing the amount
of training data within the medical domain is costly: requiring medical equipment, pa-
tients and medical expertise to properly label samples. Since small training datasets are
common within the medical imaging context, it’s attractive to use pre-trained networks
within a transfer learning paradigm, many state-of-the-art neural network solutions in
medical imaging analysis use some form transfer-learning [18].
Transfer learning involves using a model whose parameters have already been trained
on a source task, and adapting the model to a target task [19]. Given a source domain Ds
with a corresponding source task Ts, and a target domain Dt with a target task Tt trans-
fer learning is the process of improving the model’s target task function f ∗t (�) by using
related information from the source domain and tasks where Ds 6= Dt or Ts 6= Tt. One
will commonly find within the vision context that models are pre-trained on large image
datasets such as ImageNet or CIFAR-100, then either all their parameters or exclusively
the model’s predication layer are updated for the target task by training on the target
dataset [18].
Early-stopping is the act of stopping a model’s training before the intended stopping
criteria, such as a number of iterations or some accuracy threshold, is reached. The exact
criterion used for validation-based early stopping is either chosen in an ad-hoc fashion,
performed interactively or picked when the validation set performance ceases to increase
for some given duration [20]. If it is used, it’s commonly in conjunction with other over-
fitting solutions such as dropout and weight regularization as seen in [21].
Data augmentation is commonly applied to image datasets when training convolu-
tional deep neural networks [22]. The goal is to increase the variance of the training
14
dataset’s distribution in hopes to learn a more generalizable representation of the data.
Some data augmentation function f(�) is applied on an input training sample x such that
f(x) yields a similar but new training sample. Examples of data augmentation functions
for image data include flipping, colour space transformation, cropping, rotation, transla-
tion, noise injection, kernel filter application, image mixing, and random erasing [22]. In
the majority of training cases, it is beneficial to use data augmentation to achieve several
more points in task performance. However, if the task to be performed contains task in-
stances that are out-of-the distribution of the training dataset, data augmentation can do
little to help with those cases. In the case for oncology, disease appearance can be highly
variant, which leads to issues when medical imaging datasets are small and the desired
method for some computer vision task involves deep neural networks. In these situations,
data augmentation can do very little. In recent years, there has been the introduction of
Generative Adversarial Networks (GANs) which consist of a framework for estimating
generative models by simultaneously training a generative model that captures the data
distribution and a discriminative model that estimates the probability of a sample having
come from the training data rather than the generative model [23]. Another data aug-
mentation strategy consists of using GANs as a way to generate new training data, and
one could generate new images of tumour instances. However, the issue is that we don’t
know the relationship between tumour appearance and their clinical outcome, so even
if we generated new instances, we wouldn’t be sure how to classify them. Additionally,
this adds an extra layer of complexity, and the method hasn’t been shown to dramatically
increase performance. In [24], the authors used a Generative Adversarial Network to gen-
erate new tumour segmentation instances for the training set but this only helped achieve
an increase of 3.4% performance in its segmentation results. Applying data augmentation
within the medical imaging world is not straightforward, nor are there any standards.
Weight regularization aims to stabilize a neural network from overfitting by penal-
izing largely valued weights within the network. Regularization terms are penalties or
constraints added to the loss function, some examples of common terms and their penal-
15
ization to the cost are the following:
L1 Regularization Term = λM∑i=0
|Wi|
L2 Regularization Term = λ
M∑i=0
W 2i
Cost = Loss(y, y) + Regularization Term
Where Wi is the value of a particular weight and λ is a hyperparameter term which con-
trols the regularization’s affect on the loss. By adding a weight penalty, the overall cost
increases and the optimizer, such as Stochastic Gradient Descent, is forced to minimize
the weights of the network that contribute to the loss. Given the increase in overall loss,
the error gradient increases with respect to the value of the weights, which results in
an increased change in weight update. With the increase in the error gradient’s value,
the larger weight values are decreased, thus stabilizing the network. Within the vision
domain, the L1 regularization term produces features which are spatially localized and
results in most weight values being near zero, whereas L2 regularization produces fea-
tures with higher spatial variance and allows weight values to grow further away from
zero [25]. Therefore, L1 regularization is preferred with datasets which have spatially lo-
cal features, such as targeting objects within a large scene, whereas, L2 regularization is
preferred with datasets which have spatially global features, such as targeting objects or
scenes that make up an entire image [25]. As an ML practitioner, one should experiment
to decide which regularization term to use. Weight regularization is popularly used and
can even be found in training Compositional Networks [11], [26].
Dropout is a regularization function employed within the neural network during
training. It’s been shown that backpropagation builds up co-adaptations between neurons
that work for the training data but do not generalize well to unseen data at test time [26].
Co-adaption is the effect of when some neurons learn to be highly dependent to another
one. So if the independent neurons receive a ”bad” input, the dependent neurons will be
16
affected as well, which alters the model’s performance. This is ultimately the behaviour
seen in a model that has suffered from overfitting, and it is this phenomenon that we try
to prevent. Dropout deactivates a certain number of neurons at a given layer from acti-
vating during training, preventing units from co-adapting too much [26]. Neurons that
have dropout applied to them, have some probability p of not activating or zeroing out
during training, but during testing, they always activate. Intuitively, applying dropout
to a network results in sampling a thinned version of the network, which helps reduce
co-adaptation during training. So given n number of neurons, there exist 2n number of
thinned versions of the network, then for each presentation of a training sample, a new
thinned version of the network is sampled and trained. By doing this sampling, 2n net-
works with shared weights can be combined into a single averaged neural network at test
time. Results have shown that when training CNNs on several large-scale image classi-
fication tasks such as ImageNet, CIFAR-10, and CIFAR-100, state-of-the-art results used
dropout [26].
2.3.4 Activation and Pooling Functions
Here we describe Activation and Pooling functions. As a disclaimer, this section is not
directly relevant to our main contribution, but readers may find them interesting since
Activation and Pooling functions are always found in artificial neural networks. ANN
architectures will almost always consist of them, and they can be found in all the neural
network architectures discussed in our work.
Activation Functions (AFs) are used in neural networks to compute the weighted sum
of input and biases, which is then used to decide whether a neuron can be fired or not,
effectively controlling the neural network’s outputs [27]. These AFs are differentiable,
hence allowing for Backpropagation learning, and can either be linear or non-linear. A
linear mapping is given by an affine transformation, as in most cases [28]:
y = f(x) = wTx+ b
17
Where x is the input feature vector, w are the AF’s weights, and b is a bias vector. The out-
put from each layer is fed into the next layer for multilayered networks until a final output
is obtained. We’re more interested in non-linear AFs. The main advantage of using non-
linear AFs over linear ones, is that they enable estimating non-linear prediction functions,
as opposed to only linear ones. This aids in learning of high order polynomial prediction
functions for the neural networks. This drastically increases the hypothesis search space,
allowing for greater generalization to more prediction tasks. Some of the most common
examples of non-linear AFs include the Sigmoid, Hyperbolic Tangent (Tanh), Rectified
Linear Unit (ReLU), and Softmax functions.
The Sigmoid AF, also referred to as the logistic or squashing function [29], is defined
as:
f(x) =1
1 + e−x
The Sigmoid AF is a simple AF which can be used for binary classification in an ANN’s
prediction layer or for hidden layer neuron activation, however it suffers from sharp gra-
dients during backpropagation from deep hidden layers, gradient saturation, and slow
convergence [27]. The Hyperbolic Tangent function was proposed to remedy some of the
Sigmoid AF’s drawbacks.
The Hyperbolic Tangent (Tanh) is defined as:
f(x) =ex − e−x
ex + e−x
The Tanh AF is smoother than the Sigmoid function, and gives better training perfor-
mance for multi-layered networks as opposed to the Sigmoid function by producing zero-
centred outputs thereby aiding backpropagation [30]. However, a noticeably dangerous
property of the Tanh function is that it is expensive to compute because of its exponential
and division terms [27]. This led to the development of the Rectified Linear Unit (ReLU).
18
The Rectified Linear Unit (ReLU) is proposed in [31] and defined as:
f(x) = max(0, x) =
xi, xi ≥ 0
0, xi < 0
The function rectifies any signal below a value of 0 and sets it to 0, whereas the signal
maintains its value if its value is equal to 0 or greater. ReLU offers faster convergence,
and better performance and generalization compared to the Sigmoid and Tanh AFs [32].
It is nearly a linear function, therefore preserving the ease of optimization with gradient-
descent learning algorithms [32]. The gradient computation is faster because it does not
need to compute exponential nor any division terms. Thanks to its advantages, the ReLU
AF has been the most widely-used AF, is found in many state-of-the-art results [32] and is
found in the majority of neural network architectures discussed in our work. A noticeable
drawback of the ReLU AF is that it can create dead neurons, weight updates that don’t
activate during inference, because its gradient can reach zero. This had led to develop-
ments in the leaky ReLU, an AF that we won’t go into detail here because it’s out of the
scope of our work. We urge the reader to learn more about it for the sake of their curiosity.
The Softmax AF is used to compute a probability distribution from a real-numbered
vector and is defined as:
f(xi) =exi∑Kj=0 e
xj
Where x is a vector of K real numbers. Every value within the output vector is within the
range of 0 to 1, with all values summing to 1. It’s popularly used in multi-class classifica-
tion and segmentation models, with the final class for an entire image or a select pixel to
be that with the highest probability. The Sigmoid and Softmax share similarities in their
use, the Sigmoid function is used for binary classification whereas the Softmax function
can be used for multiple classes.
19
Activation functions are important components of neural network architectures and
can be found in all the architectures we discuss in this work, especially those reviewed in
Section 2.3.6.
Pooling, also known as sub-sampling or down-sampling, is used to transform joint fea-
ture representations into spatially reduced ones that preserve important information and
either disregard irrelevant details in the case with min and max-pooling or blend details
with features of interest in the case of average-pooling [33]. Its primary purpose is to re-
duce the number of features passed on to later layers, to prevent exponential increase in
the number of features in a deep network. Pooling can allow for spatial position invari-
ance, lighting invariance, robustness to clutter and compactness of representation [33]. It
can also be used to increase the receptive field of a neural network for a given input im-
age. In the case with CNNs, the feature maps resulting from convolutional operations can
have high dimensionality, which may cause overfitting with the application of a classifier
[34]. To reduce the overall size of the signal, a max, min or average pooling function can
be used. For example, in [35], a CNN architecture was used for object classification and a
2-dimensional average pooling layer was proposed:
Output =1
kheight ∗ kwidth
kheight−1∑m=0
kwidth−1∑n=0
Input(Cj, stride[0]× h+m, stride[1]× w + n)
(2.4)
Where Output and Input are the output and input feature maps respectively, kheight, kwidth
are the kernel height and width, Cj is a given channel within the input feature map,
stride[0], stride[1] are the stride of the window along the x and y coordinates respectively,
and h,w are the height and width of the input feature map. Input(Cj, stride[0] × h +
m, stride[1] × w + n) selects the pixel value at channel Cj and (x, y) position stride[0] ×
h + m, stride[1] × w + n. This layer performs a local averaging of neighbouring pixels
within a given kernel, thereby reducing the feature’s spatial position precision, which
helps enable spatial invariance, robustness against noise and more compact feature repre-
20
sentation. Whereas a 2-dimensional max-pooling operation would replace the averaging
operation in Equation 2.4 with respective max operations:
Output = maxm=0,...,kheight−1
maxn=0,...,kwidth−1
Input(Cj, stride[0]× h+m, stride[1]× w + n) (2.5)
Instead of computing the average value within the given window of pixels, the max-
pooling function selects the maximum pixel value. A min-pooling operation would sim-
ply replace the max operators with min and select the minimum pixel value within its
respective window. Within the computer vision context, the choice of pooling operation
depends on the dataset at hand. Max-pooling selects the brightest pixel from the image,
and is useful when the objects of interest are brighter than the background and the ML
practitioner wishes to entirely disregard background information. Min-pooling is helpful
for the opposite reason, and selects features that are darker than those around it. Average-
pooling smoothens out the background and object features, and is useful when we desire
to keep information from the background. Pooling enables compact feature representa-
tion and higher-level feature extraction.
Pooling functions are ubiquitous in neural network architectures and are found in the
ones that we describe later on Section 2.3.6.
Un-Pooling is the backwards function of the max-pooling operation. We discuss it
here because it is seen in several neural network architectures. The goal of un-pooling
is to map a feature space to a higher spatial domain, which is useful for when some en-
coded low-dimension feature map needs to be up-sampled to the same size as the input
image, as seen in segmentation tasks [21]. Un-pooling selects an element in position p of
the input feature map, places it in a higher spatial feature map at a position in the new
feature map z within the sub-region selected by the kernel, and sets every element out-
side that position and within the kernel’s sub-region to 0. There exists a non-uniqueness
problem with this operation, the selected position in the z feature map is arbitrary within
the kernel sub-region. One solution is to pair every un-pooling operation with a prior
21
respective pooling operation, to which that pooling operation saves the pooled values’
indices. These indices are called switches and are denoted as s. Then, during the un-
pooling process, the switches s specify the positions where the elements taken from p will
be placed in z. We will see the un-pooling operation used in some of the neural network
architectures discussed in this work.
2.3.5 Non-Medical Imaging Applications
The first pioneering work in CNNs was LeNet which was designed for handwritten digit
classification without needing image pre-processing [36]. However, due to the lack of
training data and computing power, this design failed to generalize to more complex
problems. With the introduction of large internet datasets and compute power, CNNs
started gaining traction for more complex imaging applications. CNNs first started gain-
ing popularity in image classification [4]. Since then, CNNs have since made their way to
a multitude of other computer vision tasks such as pose estimation [37], object detection
[38], object segmentation [39], visual saliency detection [40], action recognition [41], scene
labelling [42], and more [43].
Thanks to recent innovations in the deep learning field, and in compute and data re-
sources, deep learning has delivered promising results in medical imaging analysis tasks,
which we discuss examples of more thoroughly in Section 2.3.6.
2.3.6 Medical Imaging Applications
Deep learning techniques have been introduced for medical imaging analysis with en-
couraging results in segmentation, registration and image enhancement applications. As
for segmentation, there have been successful segmentation approaches for lungs from
chest X-Rays [7], [44] discussed in Section 2.3.6, tumours and brain structures [21], [45]
discussed in Section 2.3.6, biological cells and membranes [6] discussed in Section 2.3.6,
knee cartilage [46], pancreas [47], bone tissue [48] and cell mitosis [49].
22
Figure 2.3: Encoder-Decoder CNN Architecture for Lung Segmentation [7]
Lung X-Ray Segmentation
The first exploratory stage of research and development on lung segmentation in X-Ray
chest images using a purely deep learning method was presented by [7], the purpose of
their study was to examine the ability of deep learning methods and Encoder-Decoder
CNNs to segment lung components in chest X-Ray images. Their CNN architecture con-
sists of an encoder and a decoder component. The encoder component’s purpose is to
encode the input images into low-resolution feature maps. The feature maps are fed to
the decoder network which maps the low-resolution feature maps to full input resolution
feature maps for pixel-wise classification. Their architecture is visualized in Figure 2.3, it
consists of Convolutional, Batch Normalization, ReLU, Pooling, UpSampling, and Soft-
Max layers, a common approach to Encoder-Decoder networks that we’ll see again when
discussing U-Net, one of the most popular segmentation networks in medical imaging.
There exists the non-uniqueness index problem in the unpooling layers of the decoder
layers. To solve this problem, the authors stored and used the max-pooling indices from
the corresponding encoder layer. Two pre-processing techniques are performed to reduce
X-Ray intensity variation’s influence: intensity transformation using histogram equaliza-
23
Figure 2.4: Two-Pathway CNN Architecture (TwoPathCNN) for Brain Tumour Segmen-
tation [21]
tion [50], and a Local Contrast Normalization [51]. Their data set consists of 354 X-Ray
chest images originating from two different database sources: an online tuberculosis por-
tal [52] and the open Japanese JSRT database [53]. It was hoped that the use of inhomoge-
neous data sets acquired from different sources would be more helpful for obtaining more
objective and conclusive testing results. Training is performed with Stochastic Gradient
Descent. When testing, the average accuracy was estimated as a Dice score of 0.962, with
minimum and maximum values being 0.926 and 0.974 respectively.
Dice =S ∩ TS ∪ T
Where T is the set of pixels within the ground-truth lung area from manual segmentation
and S is the set of pixels within the area obtained from automatic segmentation. Unfortu-
nately, given that the authors used a dataset consisting of data from two different sources,
there are no prior results to compare their methodology to for this particularly mixed
dataset. Comparisons to prior non-deep-learning results would have shown more clearly
whether their methodology was as promising as they set it out to be. Fortunately, there
have been non-deep-learning methods prior to this work that were evaluated on JSRT,
one of the source datasets. In particular, [54] proposed a novel unsupervised method for
lung segmentation which achieved a Dice score of 0.958. This comparison helps highlight
that a deep learning method such as the Encoder-Decoder CNN could achieve promising
results, but hasn’t yet achieved state-of-the-art results at the time.
24
(a) Cascaded architecture, using input concatenation (InputCascadeCNN).
(b) Cascaded architecture, using local pathway concatenation (LocalCascadeCNN).
(c) Cascaded architecture, using pre-output concatenation, which is an architecture with proper-
ties similar to that of learning using a limited number of mean-field inference iterations in a CRF
(MFCascadeCNN).
Figure 2.5: Cascaded TwoPathCNN Architectures
25
Magnetic Resonance Brain Tumour Segmentation
In [21], a fully automatic brain tumour segmentation method tailored to glioblastomas, an
aggressive brain tumour, pictured in Magnetic Resonance (MR) images based on DNNs
is proposed. Glioblastomas can vary wildly in shape, size, contrast, and position in the
brain. Their variance in appearance motivate the exploration of an efficient ML solution
that exploits DNNs. At the time, CNNs have already been successfully applied to com-
mon vision segmentation problems and several showed promise in brain tumour seg-
mentation. A prior promising CNN brain tumour segmentation method in [55] divides
the 3-dimensional MR images into 2-dimensional slices and trains a CNN to predict its
centre pixel class. Whereas, the methods proposed in [21] expand on and surpass the
prior work by using a two-pathway CNN architecture referred to as TwoPathCNN and a
framework for cascading CNNs. The TwoPathCNN architecture is made up of two paral-
lel feature-extraction paths and is visualized in Figure 2.4. The pathways consist of: one
path with smaller 7 × 7 and 3 × 3 receptive fields referred to as the local pathway and
another with larger 13× 13 receptive fields referred to as the global pathway. The authors
motivate this design by wanting to take account of the fine-grained visual details around
a given pixel and its larger context. They believe this cascaded use of pathways would
increase segmentation performance because of the additional information available for
prediction. The outputs of the two pathways are concatenated with the help of the 3 × 3
receptive field in the local pathway. The resulting concatenated feature map is fed to a
prediction layer consisting of a coupled Convolutional 21 × 21 filter and Softmax func-
tion activation layer. The Convolutional filters in both pathways make use of the Maxout
activation function proposed in [56]. Maxout is used differently from max-pooling, the
difference is that maxout selects the maximum value at each position over multiple fea-
ture maps as opposed to pooling which selects the maximum value in a sub-window.
Given a set of K feature maps O:
Zs,i,j = max{Os,i,j, Os+1,i,j, ..., Os+K,i,j}
26
The result of the maxout operation is the Zs,i,j feature map where, at each spatial posi-
tion i, j, the value within the Z feature map is the maximum value across the given set
of feature maps O at that respective spatial position. It has been shown to be effective at
modelling useful features from different feature maps [56], hence allowing the authors to
experiment with concatenating or cascading different feature maps together. Training is
performed using Stochastic Gradient Descent to maximize all labels in the training set, or
equivalently, minimizing the negative log-probability − log p(Y|X) =∑
i,j − log p(Yi,j|X)
where Y is the set of predicted pixels and X is the input brain slice. To improve parameter
optimization, the authors implemented a momentum strategy, a training strategy seen be-
fore in [4], which uses temporally averaged gradients to damp the optimization velocity:
vi+1 = µ ∗ vi − αOwi
wi+1 = wi + vi+1
Where wi is the CNN’s parameters at the ith iteration, Owi is the gradient of the loss
function at wi, v is the integrated velocity initialized at zero, α is a learning rate, and
µ is a momentum coefficient. µ is gradually increased during training, starting at 0.5
and increasing to 0.9. Intuitively, momentum allows building velocity in a certain direc-
tion within the parameter search space which helps reduce gradient oscillations, thereby
ultimately improving convergence. L1 and L2 weight regularization, dropout, and early-
stopping are used to prevent overfitting since the authors believed their training dataset
did not contain enough training samples. The work proposes three different cascaded
CNN architectures visualized in Figure 2.5. The three architectures each use some varia-
tion of cascading a feature map extracted from one TwoPathCNN without its prediction
layer, to either the input image as seen in Figure 2.5a, the local pathway’s feature map
as seen in Figure 2.5b or the pre-output concatenation as seen in Figure 2.5c to the sec-
ond TwoPathCNN. The proposed architectures were tested on the BRATS 2013 challenge
dataset [57], comprising of 3 sub-datasets of real patient brain tumour data. The goal is
27
to correctly classify 5 different segmentation labels in each brain slice: non-tumor, necro-
sis, edema, non-enhancing tumour, and enhancing tumour. The training set consists of 30
patients all with pixel-wise accurate segmentations and the test set contains 10 other dif-
ferent patients. Each patient data instance is a 3D volume of MR slices, each available in
four different modalities. The authors decided to work with 2D slices because the MR
volumes do not possess an invariant resolution and the spacing in the third dimension
is inconsistent across volumes. The TwoPathCNN achieved a Dice score of 0.85, whereas
the cascaded architectures InputCascadeCNN, MFCascadeCNN, and LocalCascadeCNN
achieved scores of 0.88, 0.86, and 0.88 respectively, whereas the prior CNN method in [55]
achieved a score of 83.7. The proposed architectures show how cascading and context
information can help achieve higher segmentation accuracy. However, given the added
complexity for only increases of 1.3% to 4.3% accuracy leaves one to wonder whether a
different approach could do better.
Electron Microscopy Neural Structure Segmentation
One of the most influential models in medical imaging segmentation is the U-Net model
proposed in [6]. Segmentation of Electron Microscopy membranes has been done be-
fore using deep learning in [58]. The previous method in [58] is slow because it uses a
sliding-window to classify patches one-by-one over the input image. Additionally, it has
trouble in maintaining a balance between segmentation accuracy, where a smaller win-
dow allows for higher accuracy, and the amount of context information made available
to the network, where a larger window allows for more information. The U-Net model
makes it possible to achieve good segmentation accuracy, while also making use of as
much context information it can, and making faster predictions. The U-Net model won
the Electron Microscopy ISBI 2012 segmentation challenge in 2015 by a significant mar-
gin. The network architecture consists of an encoder component which makes up the
former half of the U shape, a bottleneck component which makes up the bottom of the
U shape, and a decoder component which makes up the latter half of the U shape. A vi-
28
Figure 2.6: U-Net CNN Architecture for Electron Microscopy Cell Segmentation (example
for 32x32 pixels in the lowest resolution) [6]
sualization of its architecture is available in Figure 2.6. The encoder component consists
of 3 × 3 Convolution filters paired with ReLU activation functions, 2 × 2 max-pooling
layers, and copy-and-crop functions. The encoder extracts a feature map from the input
image, and down-samples or reduces the spatial resolution of the feature map in the pro-
cess. Additionally, after every two consecutive applications of the Convolutional filter
and the ReLU activation, a centre-patch is copied-and-cropped from the current feature
map and fed to the feature map in the decoder component’s at the same respective height
in the U shape. The cropping is necessary because border pixels are lost in the encoder’s
Convolutional filters. The authors believed that the decoder would make more accurate
predictions if context information was preserved from the encoder through the use of
the copy-and-crop function. We’ve seen a similar idea in Section 2.3.6 where the authors
29
in [21], used feature map concatenation from one encoder to another as a means to pro-
vide more context information for prediction. The bottleneck simply applies two more
Convolutional-ReLU operations on the input feature map and up-samples the resulting
feature map to a higher spatial resolution so that it can be fed to the decoder. The up-
sampling operation makes use of the Convolutional-Tranpose (up-conv) filter to map its
given input to a higher spatial domain. The decoder component consists of 3 × 3 Con-
volutional filters paired with ReLU activation functions, 2 × 2 Convolutional-Transpose
(up-conv) filters, and a prediction layer consisting of a 1 × 1 Convolutional filter used to
map each component feature vector to the desired number of classes. Training the net-
work consists of optimizing the Cross Entropy loss over the pixel-wise Softmax on the
final feature map:
Loss =∑x∈Σ
w(x) log(pl(x)(x))
where
pk(x) = Softmax(x)
The model was applied to three different segmentation tasks, each with minimal train-
ing datasets compared to common computer vision datasets such as ImageNet. Given the
tiny training sets, the authors make use of several data augmentation methods: smooth
deformations using random displacement, and dropout. Their method achieved the low-
est Warping Error of 0.000353 out of all methods in the first EM challenge in 2015, where
the second-best algorithm had a score of 0.000355. It was applied to a cell segmentation
task achieving a 92% Intersection over Union (IoU) score versus the second-best algo-
rithm with a score of 83%. Finally, it was applied to another cell segmentation task and
achieved an IoU score of 77.5% whereas the second-best algorithm scored with 46%. We
see that U-Net significantly outperformed its competition in the cell segmentation tasks,
and achieved first place in the neural structure segmentation. This style of architecture
heavily influenced the deep learning and medical imaging fields. Ample variations of
30
the model exist including an Attention U-Net on abdominal datasets for multi-class im-
age segmentation [59], a 3D U-Net for volumetric segmentation [60], a Variational U-Net
for conditional appearance and shape generation [61], and a Multi-Resolution-U-Net for
multimodal biomedical image segmentation [62].
2.4 Dictionary Learning and Pattern Theory
Here we discuss the main theory behind Compositional Models which is the idea of for-
mulating vision as pattern theory [10]. The premise is to describe the world’s visual sig-
nals as patterns that could be modelled, generated and used for inference [63]. To model
these signals, we need to store relevant features and patterns in dictionaries. This dic-
tionary is a key-value pair lookup data structure. Given some key, the dictionary would
look up a value by that key. However, due to the highly complex nature of vision, there
exists an astronomical number of images and objects, so defining a dictionary of all pos-
sible patterns would be impossible. This motivates hierarchies and recursive composi-
tions. By recursively composing mid and high-level patterns by lower-level equivalents,
the dictionary would only need to store elementary patterns which can be shared across
objects. Therefore, it would make it possible to learn this dictionary data structure be-
cause it wouldn’t need to contain every possible image pattern, and it would be compact
thanks to the ability to reusing different components contained within it. We care about
learning this dictionary because it can be used in downstream tasks such as image classi-
fication, and generation. Additionally, by developing a graphical probability structure of
these compositions, we can define a framework to impose geometric constraints so that
patterns make up meaningful object contours and textures. The dictionaries themselves
contain a concatenation of bases, also known as atoms. These bases are image patterns ex-
tracted from a training set of images, they are elementary patterns that describe the given
set of images. They can be lines, curves and texture patches. The learned dictionaries
could then be used for tasks such as face recognition [64], [65], image classification [66],
31
Figure 2.7: Hierarchical Compositional Model Representing a Horse [10]
and numerous other image processing applications. So, it’s enticing to learn how to cor-
rectly choose the dictionary to represent the data. There are numerous ways of doing so,
by building it with linear or locally linear structure, or by explicitly optimizing various
measures of how informative the dictionary is via some energy functions [67]. This idea
of dictionary building of bases and compositions of bases forms one of the fundamental
ideas to developing Compositional Models.
2.5 Compositional Models
2.5.1 Hierarchical Compositional Models
In this section, we describe the graph structure, learning algorithm and inference algo-
rithm of Hierarchical Compositional Models (HCMs) as presented by Zhu et al. [10].
Their composition of parts highlights the theme of compositionality that is prevalent in
Compositional Networks. An HCM is a tree data structure consisting of image features
represented within nodes and relationships between nodes modelled by edges. The goal
of an HCM is very similar to what discussed in Dictionary Learning in Section 2.4. The
32
goal of an HCM is to capture elementary patterns that describe a set of images, and re-
cursively compose them into increasingly larger and more complex parts as they move
up the hierarchy of parts. These complex parts are representative of meaningful features
and objects that are contained within the set of training images. The trained HCM can
be used for downstream tasks such as object localization, object classification, and ob-
ject generation. It requires less training data than a CNN to be able to learn meaningful
features for object classification, thereby highlighting one of the advantages of genera-
tive models over discriminative ones. We will describe the HCM’s graph structure, state
variables, probability distributions for geometric constraints. Their mathematical formu-
lations form the basis for many other Compositional Models seen in the literature and
their idea of compositionality form the fundamental idea for Compositional Networks.
One particular version of an HCM, which is discussed in this review, is defined by a sex-
tuplet (V,E, ψ, φ, λP , λD) which specifies its graph structure’s vectors and edges (V,E), its
feature functions (ψ, φ) and its parameters (λP , λD).
Graph Structure
An HCM’s structure is strongly inspired from the Stochastic Context Free Grammar (SCFG)
structure as seen in the field of Natural Language Processing (NLP) [68]. In NLP, a sen-
tence can be decomposed into intermediate phrase parts and words that make up an
SCFG. The SCFG is a tree structure, where the root node is a sentence, intermediary nodes
are non-terminals (parts of sentence such as verb phrases, noun phrases, etc.), leaf nodes
are terminals (i.e. words extracted from the sentence), and edges between nodes are rules
set by the grammar. These rules are sampled from a probability distribution.
Similarly, an HCM’s structure is defined as a tree structure with a root node R, ver-
tices VR, and edges ER. Within the vision context, the root node represents an object, in-
termediary nodes are compositions of parts and contours, and leaf nodes are elementary
tokens extracted from the image. Examples of elementary tokens include lines, corners,
and curves. Edges represent geometric constraints between parts. The edges include
33
parent-child relations, however it’s possible to have child-child constraints (i.e. closed-
loop constraints). An example of a geometric constraint includes the distance between
two different tokens represented by the children nodes. Figure 2.7 illustrates an exam-
ple of an HCM structure representing a horse. In [10], each node is restricted to having at
most one parent. The children of a node u are defined as ch(u). Leaf nodes V leafR are nodes
without any children. The level l of a given node in an HCM is defined as the number
of hops away from a leaf node. So, a leaf node is of level 0, its parent is of level 1, and
so on. This is the general structure, however, there is a notable limitation in the SCFG
structure, it ignores spatial positions and other attributes such as colour, size, etc. Hence,
state variables are introduced to represent these properties.
State Variables
Each node u ∈ VR has a set of state variables wu. These state variables correspond to
the pose (scale and orientation) or attributes (colour, size) of their parts. All nodes are
required to have the same state variable types. So if a parent’s state variable contains the
pose of a part, its children’s state variables contain more fine-grained pose information of
their parts.
Probability Distributions
The probability distribution over a graph’s state variables is a Gibbs distribution, whose
energy is the sum of the data and prior potential functions. They are used to model the
size and orientation of every node’s feature contents. During learning and inference, these
distributions are optimized so that the HCM best fits to the input data. The data poten-
tials of form λDu φu(wu, I) relate the state of node u to the image I . The prior potential func-
tions of the form λPuψu,ch(u)(wu, wch(u)) or λPuψu,v(wu, wv) depending on whether children
are linked or not impose statistical constraints on the states of nodes in a clique. For ex-
ample, two forms of the prior potential functions involve representing ψANDu,ch(u)(wu, wch(u))
as an AND-function between children and ψORu,ch(u)(wu, wch(u)) as an OR-function between
34
children. These are used if the parent node selects different children (OR) or if it selects
all children (AND). Finally, the probability distribution described as a Gibbs distribution
is the following:
P (WR|I) =1
Zexp{−E(WR, I)} (2.6)
where E(WR, I) = λ ∗ φ(WR, I) (2.7)
with λ ∗ φ(WR, I) =∑
u∈ VR
VleafR
λPu,ch(u) ∗ ψu,ch(u)(wu, wch(u)) +∑u∈Vv
λDv ∗ φv(wv, I) (2.8)
Modelling the features that make up the set of images by using probability distributions
is common in Compositional Models, and we’ll see it used again in Compositional Net-
works in Section 2.5.3.
Recursive Formulation of the Energy
The fundamental equation for HCMs combines recursion and composition in a single
equation and is the following:
Eu(Wu, I) = λPu,ch(u) ∗ ψu,ch(u)(wu, wch(u)) + λDu ∗ ψu(wu, I) +∑
p∈ch(u)
Ep(Wp, I) (2.9)
It shows that the energy Eu(Wu, I) for a subtree with root node wu can be computed
recursively in terms of the energies of its descendantsWp for p ∈ ch(u). Equation 2.9 is the
fundamental equation of HCMs. It combines the recursion and composition key elements
of HCMs into a single equation. This composition of parts is seen again in Compositional
Networks in Section 2.5.3.
We’ve discussed the building components of HCMs to highlight the use of composi-
tionality to model visual objects.
35
2.5.2 Recursive Cortical Networks for Object Classification under Oc-
clusion
Here we describe the work on a Recursive Cortical Network (RCN) from [2] which is
a form of an HCM that mimics the human brain’s structure’s use of lateral connections
between neurons. The RCN’s application to partially-occluded object classification is a
motivation to developing Compositional Networks for occluded object classification. The
RCN builds upon the ideas of hierarchical composition from HCMs, lateral connections
for selectivity, contour-surface factorization and joint-explanation parsing. The model
separates the representation of contours from surfaces, which enables it to recognize ob-
jects with dramatically different appearances without needing to exhaustively train on
every possible shape and surface combination. The structure consists of nodes and OR
and POOL layers. Each node encodes an AND relation and each layer encodes an OR
function of its nodes’ features. Each pooling layer pools different deformations, transla-
tions and scales. Lateral connections are introduced between pools which provide a con-
straint. Laterally connected layers are affected by the choice of features of one-another.
This creates samples that vary more smoothly. As for inference, an input image is passed
within the network, performing hypothesis assignments in a forward pass and backward
pass performs a Maximum A-Posteriori configuration from the hypotheses. This cov-
ers the best joint configuration including object classifications and segmentations. As for
learning, contour connectivity features are learned at each level from input images. The
lateral connections are learned from the contour connectivity of input images. Features
at the topmost layer represent whole objects, similar to what we described in Section
2.5.1. The resulting model can beat CAPTCHA with an accuracy of 86.2% with a sig-
nificantly smaller number of training images. Given the generative nature of the RCN,
it can achieve high performance for partially-occluded character classification but lacks
the ability to discriminate hundreds of possible classes, which forms the main motivation
towards developing the Compositional Convolutional Neural Network method.
36
Figure 2.8: Compositional Convolutional Neural Network Architecture [69]
2.5.3 Compositional Convolutional Neural Networks
Here we describe the Compositional Convolutional Neural Network (CompNet), also known
as Compositional Network, which is a novel approach in combining Compositional Models
and Convolutional Neural Networks into one integrated system [11], [69]. The Recursive
Cortical Network and other Compositional Models have shown to be robust against oc-
clusions for object detection, however they lack the same discriminative power as neural
networks. Therefore, they lack the ability in classifying hundreds or more of different
instances of objects. Kortylewski et al. present a solution in which a CNN is used as a
feature extractor which feeds into a Compositional Model inference head. A visualization
of its architecture is available in Figure 2.8. The novelty in this architecture is the use of a
pre-trained CNN as a feature extractor, which feeds feature maps extracted from an input
image, to a Compositional Model for inference.
vMF Kernels and Class Mixtures
Kortylewski et al. propose a differentiable generative compositional model of the feature
activations p(F |y) for some object class y. Where F is a CNN feature map extracted from
an input image using a pre-trained CNN. To cluster visual features, they need to first be
represented by some probability distribution. CNN feature map dimensionality can reach
37
a high order such as 512 or 1024, which makes it difficult to use common distributions
such as a normal distribution, which is otherwise commonly used for clustering [70]. In
this work, the von Mises-Fisher (vMF) distribution is used to describe the underlying
distribution of the high-dimensional CNN feature maps extracted from the input images
within our training set. The vMF distribution is suited for high-dimensionality vectors,
which makes it appropriate for representing CNN feature maps [70]. In our case, the CNN
feature maps represent features that are extracted from liver slices by the CompNet’s
backbone feature extractor. The probability density function of the vMF distribution for
some n-dimensional unit vector x is given by:
fn(x;µ, σ) =e(σµTx)
Zn(σ)
where σ ≥ 0, ||µ|| = 1 and the normalization constant Zn(σ) is defined as:
Zn(σ) =(2π)n/2In/(2−1)(σ)
σn/2−1
where Iv denotes the modified Bessel function of the first kind at order v [71]. µ and
σ are called the mean direction and concentration parameter, respectively. Intuitively,
the greater the value of σ, the higher the concentration of the distribution around the
mean direction µ. The CompNet’s object representations are modelled as mixtures of
vMF distributions:
p(F |θy) =∏p
p(fp|Ap,y,Λ) (2.10)
p(fp|Ap,y,Λ) =∑k
αp,k,yp(fp|λk) (2.11)
θy = {Ay,Λ} are model parameters where Ay = Ap,y are mixture model parameters at
some position p ∈ P on the feature map F (purple tensor in Figure 2.8). The feature map
F is extracted by the CompNet’s backbone-CNN. Ap,y = {αp,0,y, ..., αp,K,y|∑K
k=0 αp,k,y = 1}
38
are the object mixture coefficient, where K is the number of mixtures and Λ = λk =
{σk, µk|k = 1, ..., K} are vMF distribution parameters:
p(fp|λk) =eσkµ
Tk fp
Z(σk), ||fp|| = 1, ||µk|| = 1 (2.12)
Numerical estimation of the vMF distribution parameters is non-trivial and non-unique
in high dimensions, since it involves functional inversion of ratios of Bessel functions [70].
Therefore, an iterative Expectation-Maximization-like procedure is used to estimate them
by iterating between vMF clustering of the feature vectors of all training images and max-
imum likelihood parameter estimation until convergence. After cluster training, the vMF
cluster centres µk will represent feature activation patterns that frequently occur in the
training data. Figure 2.9 shows visualizations of vMF kernels representing common ob-
jects from the training dataset used in [69]. Note how features of similar appearance and
that share semantic meaning are separated into different clusters. We see further exam-
ples of vMF kernel visualizations within the medical imaging context later on in Section
4 in Figures 4.1 and 4.2. The mixture coefficients αp,k,y can also be learned with maximum
likelihood estimation. They describe the expected activation of a cluster centre µk at a po-
sition p within the 2D lattice in a feature map F for an object class y. The object classes are
known prior, they are supervised labels given by a human annotator. Given that an ob-
ject may have different poses, 3D objects are represented with a generalized model using
mixtures of compositional models:
p(F |Θy) =∑m
vmp(F |θmy ) (2.13)
with V = {vm ∈ {0, 1},∑
m vm = 1} and Θy = {θmy ,m = 1, ...,M}. M is the number
of mixture models, vm is a binary assignment variable indicating which mixture is ac-
tive. The mixture components are learned by iterating between estimating the assignment
variables V and maximum likelihood estimation of the components. Figure 2.10 shows
39
Figure 2.9: Illustration of vMF kernels by visualizing image patterns from the training set
that activates a given vMF kernel the most. Note how image patterns that are of similar
appearance and share semantic meaning are separated into different kernels [69].
visualizations of mixture models learned for different object classes and vMF kernels.
Note how different 3D viewpoints of a given object are separated into different mixture
models.
40
Figure 2.10: Visualization of learned mixture models (4 mixtures). Each row is for a dif-
ferent object class (car, train, boat, or bus) and each column represents a different mixture
for the object class. Note how different 3D viewpoints are approximately separated into
different mixtures. [69]
Occlusion Modeling
At each position p in the image, either some object model p(fp|Amp,y,Λ) or an occluder
model p(fp|β,Λ) is active:
p(F |θmy , β) =∏p
p(fp, zmp = 0)1−zpmp(fp, z
mp = 1)z
mp (2.14)
p(fp, zmp = 1) = p(fp|β,Λ)p(zmp = 1) (2.15)
p(fp, zmp = 0) = p(fp|Amp,y,Λ)(1− p(zmp = 1)) (2.16)
The binary variables {zmp ∈ {0, 1}|p ∈ P} indicate if the object is occluded at the position
p for some mixture component m. Figure 2.11 shows examples of occlusion localization
results. Note how different occluders, either real or artificial occlusions, are accurately
localized by the CompNet. The occlusion prior p(zmp = 1) is fixed a-priori. Multiple
41
occluder models are learned in an unsupervised manner:
p(fp|β,Λ) =∏n
p(fp|βn,Λ)τn (2.17)
=∏n
(∑k
βn,kp(fp|σk, µk))τn (2.18)
where τn indicates which occluder model explains the data best. The parameters of the
occluder models βn are learned from clustered features of random natural images that do
not contain any object of interest. In this work, we forfeit the use of an occluder model,
and use a thresholding solution instead. This thresholding solution involves a threshold
over the object estimation map, in a way such that any set of pixels below our threshold
are deemed as an occlusion. We explain this solution in further detail in Section 3.
Feed-Forward Inference
The backbone CNN is used as a feature extractor to extract a feature map F = φ(I, ω) ∈
RH×W×D from some input image I , which ω is the set of parameters of the feature extrac-
tor. The vMF likelihood function L = p(fp|λk) (yellow tensor in Figure 2.8) is computed
from an inner product, equivalent to a 1× 1 Convolution, between the feature map F and
the cluster centres µk:
L = {N (F ∗ µk)|k = 1, ..., K} ∈ RH×W×K (2.19)
where N = exp(σkµTk fp)/Z(σk) is a non-linear transformation that results in a normal
distribution. The likelihood map L describes which cluster centre µk ”activate” or ”rep-
resent” each 2D lattice patch p within the input image’s extracted feature map. Every
channel, within the resulting feature map L, corresponds to an activation map for a re-
spective vMF kernel. A high activation within a set of pixels represents a part of an object
being identified. All the activation maps, or channels, which represent different object
parts are used together for the rest of the inference process. The idea of compositionality
42
Figure 2.11: Occlusion localization results from [69]. Each result consists of three images:
The input image, the occlusion scores of a dictionary-based compositional model from a
prior work by kortylewski et al. [72] and the occlusion scores of the proposed CompNet
[69]. Note how the CompNet can localize occluders with high accuracy across different
objects and occluder types for real as well as for artificial occlusions.
43
is highlighted here, different parts of objects learned during vMF clustering are selected
and combined to describe data during the feed-forward inference and test time. The
mixture likelihoods (blue planes in Figure 2.8) are computed at every position p as the
dot-product between the mixture coefficients and the corresponding vector lp ∈ RK from
the likelihood tensor L:
Emy = {lTpAmp,y|∀p ∈ P} ∈ RH×W (2.20)
Similarly, the occlusion likelihood (red planes in Figure 2.8) are computed as:
O = {maxn
lTp βn|∀p ∈ P} ∈ RH×W (2.21)
The occlusion likelihood and mixture likelihoods Emy are used together to estimate the
overall likelihood of the individual mixtures as smy = P (F |θmy , β) =∑
p max(Emp,y, Op). The
final likelihood is computed as sy = maxm smy . The resulting occlusion map is defined as
Zy = Zmy ∈ RH×W , where Zmy is the set of pixels covering portions of an object class y.
m = argmaxmsmy is the set of pixels with the highest scores sm classified as object y. As
demonstrated in Figure 2.11, we can see how different occlusions are accurately localized.
End-to-End Parameter Optimization
The trainable parameters of the CompNet are T = {ω,Λ, Ay}. Recall how ω is the set of
weight parameters of the feature extractor, Λ is the set of vMF distribution parameters,
and y is the set of mixture model parameters. The parameters are optimized via Stochastic
Gradient Descent and backpropagation. The loss function is therefore defined as:
L(y, y′, F, T ) = L(y, y′) + γ1Lweight(ω) + γ2Lvmf (F,Λ) + γ3Lmix(F,Ay) (2.22)
L(y, y′) is the cross-entropy loss between the CompNet’s estimates y′ and the ground truth
label y. Lweight = ||ω||22 is a weight regularization loss for the CNN backbone. Lvmf and
44
Lmix regularize the CompNet’s parameters to have maximal likelihood for the features in
F . More specifically:
Lvmf = −∑p
maxk
log p(fp|µk) (2.23)
Lmix = −∑p
(1− z↑p log[∑k
αm↑
p,k,yp(fp|λk)]) (2.24)
where z↑p and m↑ denote the respective occlusion patch and object mixture variables in-
ferred in the forward process. {γ1, γ2, γ3} control the trade-off between the losses.
45
Chapter 3
Methodology
3.1 Datasets
In this work, we rely on Computed Tomography (CT) scans of healthy abdominal organs
coupled with liver segmentation maps from the online CHAOS dataset’s training set [73]
to train and test our CompNet across experiments. The CHAOS dataset consists of 40
different patients. Each patient was injected with a contrast agent to help increase the
contrast of fluids and structures in the images captured from them. The CT images were
acquired from the upper abdomen area of every patient during the portal venous phase
of the contrast agent injection. This phase is obtained 70 to 80 seconds after the contrast
agent injection. During this phase, the liver tissue and its blood vessel characteristics are
enhanced maximally through blood supply of the portal vein. This phase is widely used
for liver and vessel segmentation in medical imaging analysis prior to surgery. Three dif-
ferent modalities are used: Philips SecuraCT with 16 detectors, Philips Mx8000CT with
64 detectors and Toshiba AquilionOne with 320 detectors. The CHAOS dataset consists
of 20 CT abdominal volumes for training and 20 CT abdominal volumes for testing, each
volume represents a different patient. Each volume consists of 512 × 512 16-bit DICOM
images, and x-y spacing between 0.7-0.8 mm and 3 to 3.2 mm of inter-slice distance (ISD).
There is an average of 90 slices per patient, with the minimum number of slices being 77
46
and the maximum being 105. In total, there are 1367 slices for training and 1408 slices for
testing. Every slice within the training set also consists of a respective 512× 512 segmen-
tation map for the liver, whereas the test set doesn’t. This isn’t a concern because we only
make use of the training set.
We also make use of the online Liver Tumour Segmentation (LiTS) challenge dataset
[74]. The LiTS dataset consists of 201 different patients and CT volumes stored as 32-bit
NifTi files. The dataset consists of a training set consisting of 131 CT abdominal scan
volumes with respective segmentation maps for liver and tumour tissue and a test set
consisting of 70 CT abdominal scan volumes. The dataset contains instances of tumours
such as Hepatocellular Carcinoma (HCC). The tumours have varying contrast enhance-
ments, such as hyper or hypo-dense contrast. Some images also contain imaging artifacts,
such as metal artifacts, which are present in real life clinical data. The image data was ac-
quired with different CT scanners and acquisition protocols, and is diverse with respect
to resolution and image quality. Image resolution ranges from 0.56 mm to 1.0 mm in ax-
ial and 0.45 mm and 6.0 mm in the z direction. The number of slices in the z direction
ranges from 42 to 1026. The LiTS dataset is used to train a U-Net for liver and tumour
segmentation and to test the CompNet’s ability in tumour localization. The trained U-
Net’s encoder network is then used as our CompNet’s backbone feature extractor. The
U-Net architecture used is identical to that in Section 2.3.6 as illustrated in Figure 2.6. We
also only make use of the LiTS training set and disregard its test set.
The final dataset we make use of is from the McGill University Health Centre (MUHC)
Picture Archiving and Communication System (PACS). For ease of reading, we call it the
MUHC PACS dataset. It consists of 25 different CT abdominal volumes stored as NifTi
files, each from a different patient. Every patient has some form of a liver tumour.
47
3.2 Methods
We make use of the CompNet method for tumour localization. The implementation is
made available in the Python 3.6 programming language and uses the PyTorch Python
software package. Additionally, we make use of a pre-trained U-Net’s encoder compo-
nent as the CompNet’s backbone feature extractor. The U-Net is pre-trained to perform
liver and tumour segmentation on the LiTS dataset, we go into further detail about how
we trained the U-Net in section 3.2.2.
For CompNet training and testing, we segment the CHAOS, LiTS and MUHC PACS
liver slices because it would remove the interference from other organs within the ev-
ery CT slice. For the CHAOS and LiTS liver slices, we use segmented liver slices. As
for the MUHC PACS liver slices, we manually segment the livers using the GIMP image
manipulation software. Additionally, we want to simplify our problem to localizing tu-
mours on livers of a similar scale, spatial position, and rotation. The livers in the CHAOS
dataset can present themselves in various sizes and positions. We wish to constrain the
variance in liver size and position so that we can simplify the CompNet’s training and
testing phases. We constrained the variance of liver appearance in our dataset because
we wanted to limit the difficulty of our CompNet to learn a liver representation. We want
to avoid needing a large amount of training data to capture a large variance in liver shape
and position, therefore we chose to perform linear affine registration on the CHAOS train-
ing volumes, which we go into further detail in Section 3.2.1.
3.2.1 Image Registration
As for image registration, we chose to perform a rigid-body linear image registration
including the translation, rotation, and uniform scaling transformations with 4 degrees-
of-freedom. We chose rigid-body because we wanted to use the simplest registration
process that could align our liver slices into a similar position, scale and orientation. We
48
want to simplify our problem in a way such that the livers within our training set share a
common size, position, and rotation.
Linear registration is the process of applying a linear function on a source image to
map it to a target domain. This involves translation, rotation and uniform scaling trans-
formations. The rotation and uniform scaling transforms can be expressed by a multipli-
cation between the source image and a matrix representing the amount of rotation, scal-
ing and translation. This matrix is called the registration matrix. To perform linear affine
registration, we need to first compute the nearest neighbour. The nearest neighbour is the
source volume that is the most similar to all other volume instance within a given training
set. The notion of similarity is expressed as the amount of work needed to map one image
to another domain. So a nearest neighbour is a volume whose average work to perform a
linear mapping to every other volume is the lowest amongst them all. It is chosen as the
target volume because it offers the most similar spatial coordinates, rotation, and scale
for all the volumes in the dataset. Solving for the nearest neighbour involves computing
the linear affine registration matrix for every volume pair, computing every registration
matrix’s determinant and selecting the volume whose average determinant amongst all
other volumes is the lowest. Linear affine registration is performed by first solving for
the nearest neighbour amongst all CT volumes, then registering every CT volume to the
same spatial coordinates, rotation, and scale of that nearest neighbour. We made use of
the FLIRT software package for computing registration matrices and performing image
registration [75]. Computing the registration matrices involves running the fsl reg script
on the training set of volumes. This script can be viewed in the Appendix Section 6.1.
Once the nearest neighbour is solved, the FLIRT command for performing a linear reg-
istration is run. The fsl reg source target output −a −flirt\”−out output\” command
is performed for every source-to-nearest-neighbour-pair, with target being the nearest
neighbour volume.
49
3.2.2 Pre-Trained Feature Extractor
The original CompNet’s feature extractor uses a pre-trained VGG16 model. The VGG16
feature extractor is pre-trained on ImageNet, a dataset consisting of thousands of different
instances of common objects [76]. This feature extractor therefore isn’t well tailored to
extracting features from liver slices because its source task dataset is greatly different from
the target task. Therefore, we trained a U-Net model to segment livers and tumours from
the online Liver Tumour Segmentation (LiTS) challenge dataset [74]. The U-Net’s encoder
and decoder layers’ weights are randomly initialized. Training the U-Net first consists of
using 80% of the training set for training and 20% of the training set for validation. The
training parameters used are: a learning rate of 0.001, a batch size of 4, and a number of
training epochs of 50. The optimizer used is Adam, and a DICE loss is used to measure
the performance. These gave empirically the highest validation score of 0.9803%. The
training loss and validation accuracy across the number of epochs are available as plots
in Figure 3.1.
(a) U-Net training loss (b) U-Net validation accuracy
Figure 3.1: U-Net training loss and validation accuracy plots
Afterwards, for feature extraction, we make use of the pre-trained U-Net’s encoder
component for the CompNet. The encoder component consists of 5 levels, including the
bottleneck region. We sample a feature map from the 3rd level, starting from the top,
with 256 feature channels. Up until and including the 3rd level of the U-Net, there is a
50
total of 6 3 × 3 convolutional operations each coupled with ReLU activation functions,
and 2 max-pooling operations. Thereby reducing the input image to 0.237762237762× its
original dimension.
3.2.3 CompNet Training
Training the CompNet usually requires 3 different parts, but we only make use of one.
In this work, we don’t require performing parameter optimization using Stochastic Gra-
dient Descent (SGD) because we aren’t interested in optimizing the CompNet’s classifi-
cation performance. We also don’t require training an occlusion model, since we treat
any patch p that’s not recognized to be part of a liver as an occlusion. Instead, we sim-
ply threshold object likelihoods, which we discuss in further detail in Section 4. There-
fore, for training, we’re only concerned with vMF kernel and object mixture model train-
ing. We do so by performing vMF Expectation-Maximization clustering on feature maps
extracted from our training dataset. The feature cluster probabilities are then used as
weights within the Cluster Activation Convolutional 1×1 filter, as seen in Equation 2.5.3,
to produce the object likelihood tensor. The object mixture model training is performed
via Expectation-Maximization iteration, but we constrained our experiments to using 1
mixture model. The intuition behind using more than 1 mixture model in [11] is to be
able to capture multiple viewpoints, angles, or spatial differences, such as rotation and
scale, under which an object can present itself within a given training dataset. Given that
the top-down viewpoint for every liver instance is the same within the training set and
that every volume has been linearly registered to have a similar scale, rotation and spatial
coordinates, we only required training 1 mixture model. We used the following parame-
ters for training the CompNet: 100 vMF cluster centres, a kappa of 55 for vMF clustering,
and a maximum of 2000 features sampled from every feature map extracted from every
image. We saw empirically that a kappa of 55 results in cluster centres that allowed for
a slightly higher variance in patch appearance, and that 100 vMF cluster centres allowed
for enough variance in cluster centre appearance. We determined that these parameters
51
yielded the highest-performing occlusion maps. Since we only desire learning a repre-
sentation for one object, being the liver, we constrained the number of classes trained to
1.
3.2.4 CompNet Testing
Liver Representation
Understanding the CompNet’s ability to learn an accurate representation of the liver im-
ages in the training set is done by visual inspection of the cluster centre patches learned
from vMF clustering, and by the visual inspection of the cluster centre activation maps
that are a result of the 1× 1 convolution of the cluster centres and a given input image.
Tumour Occlusions
Testing the CompNet’s ability to localize tumours is done with visual inspection of the
occlusion maps generated from the feed-forward pass of the test images from two dif-
ferent test sets. The first test set consists of segmented and registered liver slices held
out from the training set, each with some manually added occlusion such as a tumour
from a sick liver, a noise patch, or a common object of our choosing. The common object
occlusions are either a car or a monkey face, which were segmented from images found
online. The noise patches are of a colour similar to the surrounding liver tissue. We chose
to do this because we wanted to pose a challenge for our CompNet, we want to see if
it can still extract occluders that are of a very similar colour. Their size is selected to be
large enough to cover a significant portion of either the bottom-left or the top-right por-
tions of the liver. We chose these two areas because we believe they were the parts of
the liver with the largest amount of variance amongst the training set in terms of shape.
The manually added tumour occluders are from the MUHC PACS dataset as well and
their respective liver slices are from patients 001 and 439. The second test set consists of
segmented liver slices with real tumours sampled from the LiTS training set. The third
52
test set consists of manually segmented liver slices with real tumours sampled from the
MUHC PACS dataset. We only select 1 or 2 slices from 3 different volumes whose livers
are most representative, in terms of shape and texture, of the livers from the MUHC PACS
dataset. More specifically, we select slices #14 and #50 from patient #0, #90 from patient
#439, and #126 from patient #29. We chose those patients because their livers were the
most representative of those in the CHAOS dataset amongst all the other MUHC PACS
patients, and we only needed to choose 1 or 2 slices because we believed it was sufficient
to simply test for real tumour localization. We use these slices to test our trained Comp-
Net’s ability to localize real tumours. We also use these test slices to extract real tumours,
and manually add them to segmented livers in the held-out test portion of the CHAOS
training set. That way, we can attempt to localize real tumours on livers that are repre-
sentative of the training set. We use the GNU Image Manipulation Program (GIMP) [77]
to manually segment the livers from the private dataset, and to manually extract tumours
from the private dataset and add them to livers from the held-out portion of the CHAOS
training set.
53
Chapter 4
Results
Here we display the results we’ve made by visualizing several CompNet’s trained clus-
ter patches displayed in Figures 4.1, 4.2, and 4.3. The size of the cluster center patch is
determined by the receptive field size respective to the layer to which the feature map
was extracted from. Each set of patches such as Figure 4.1a represents one specific clus-
ter. The patches within each set represent the top most representative patches for the
clsuter center. Those patches map to feature vectors that are sampled and clustered dur-
ing VMF cluster initialization. Overall, these figures highlight the CompNet’s ability to
extract meaningful features from the liver, and to successfully cluster the extracted fea-
tures based on similarity.
We display the VMF cluster center activations in Figure 4.4. These activations are
sampled from the resulting feature map channels from the 1× 1 VMF cluster center con-
volution. Each image represents the activation for a different cluster. The annotation in
Figure 4.4 explains the intensity values, but overall we can see that different parts of the
liver are activated depending on the cluster. This highlights the CompNet’s ability to
learn different parts of the liver and to distinguish between them during inference.
We show the CompNet’s occlusion generation on synthetic occluders displayed in
Figure 4.5, its occlusion generation on real tumors from the LiTS training set displayed
in Figure 4.6 and its occlusion generation on real tumors from the MUHC PACS dataset
54
in Figure 4.7. Within the synthetic occluder figure, we can see that the CompNet can
highly accurately localize common objects and synthetic patches. With regards to tumors
that were synthetically added, we see that the CompNet can localize parts of the tumors,
but fails to localize the entire content of a tumor, hence a high amount of false negatives.
With regards to real tumor slices as in Figures 4.6 and 4.7, we see a similar behavior as
with the synthetically added tumors. The CompNet can localize some parts of the tumor
border, but fails to localize the entire contents of the tumors. We also see a significant
amount of false negatives in every occlusion map, a significant amount of pixels within
the images are classified as tumors when they actually aren’t. We empirically discovered
that a threshold of value 21 (a non-unit value) over the occlusion scores yielded the best
occlusion map results.
We discuss the significance of our results in greater detail in Section 5.
55
(a) Cluster Center 1 (b) Cluster Center 2 (c) Cluster Center 3
(d) Cluster Center 4 (e) Cluster Center 5 (f) Cluster Center 6
Figure 4.1: First 6 CompNet cluster center patch visualizations. Every set of 16 patches
represents one cluster center who’s patches are most representative for that given cluster
center.
56
(a) Cluster Center 7 (b) Cluster Center 8 (c) Cluster Center 9
(d) Cluster Center 10 (e) Cluster Center 11 (f) Cluster Center 12
Figure 4.2: Second 6 CompNet cluster center patch visualizations. Every set of 16 patches
represents one cluster center who’s patches are most representative for that given cluster
center.
57
(a) Cluster Center 13 (b) Cluster Center 14 (c) Cluster Center 15
(d) Cluster Center 16 (e) Cluster Center 17 (f) Cluster Center 18
Figure 4.3: Third 6 CompNet cluster center patch visualizations. Every set of 16 patches
represents one cluster center who’s patches are most representative for that given cluster
center.
58
Figure 4.4: VMF cluster activations. The image in the top left is the input image. Purple
pixels represent no activation, blue pixels represent a low activation, green pixels repre-
sent moderate activation, and yellow pixels represent high activation.
59
(a) Test 1 (b) Test 2 (c) Test 3
(d) Test 4 (e) Test 5 (f) Test 6
(g) Test 7 (h) Test 8 (i) Test 9
Figure 4.5: Occlusion generation on synthetic occluders. The left column consists of input
images with some synthetic occlusion and the right column consists of occlusion maps
where purple pixels represent no occlusion, orange pixels represent a low scoring of oc-
clusion, green pixels represent a medium scoring of occlusion, and blue pixels represent
a high scoring of occlusion.
60
(a) Test 1 (b) Test 2 (c) Test 3
(d) Test 4 (e) Test 5 (f) Test 6
Figure 4.6: Occlusion generation on real tumors. Every sub-figure’s left column consists
of input images with some real tumour, the middle column consists of occlusion maps,
and the right column consists of the ground truth liver and tumor segmentation maps.
For the occlusion maps, the purple pixels represent no occlusion, blue pixels represent
a low scoring of occlusion, green pixels represent a medium scoring of occlusion, and
yellow pixels represent a high scoring of occlusion. For the ground truth segmentation
maps, black, grey and white pixels represent the background, liver tissue, and tumor
tissue classes respectively.
61
(a) Test 1 (b) Test 2 (c) Test 3
(d) Test 4 (e) Test 5
Figure 4.7: Occlusion generation on manually segmented real tumors from the MUHC
PACS dataset. Every sub-figure’s left column consists of input images with some real
tumours and every sub-figure’s right column consists of occlusion maps where purple
pixels represent no occlusion, orange pixels represent a low scoring of occlusion, green
pixels represent a medium scoring of occlusion, and blue pixels represent a high scoring
of occlusion.
62
Chapter 5
Discussion
5.1 Cluster Centre Patch Visualizations
The cluster centre patch visualizations represent patches of the liver extracted from the
training set that best explain that given cluster. The idea of compositionality in Compo-
sitional Networks is highlighted through the visualizations and use of the cluster centres
in Figures 4.1 and 4.2. We can see how different parts of the liver have been learned dur-
ing training, and how they are composed together via cluster activation during inference.
This highlights the idea of compositionality: making compositional use of parts of objects
to create a data-efficient and holistic representation of an object.
These patches are representative of feature vectors sampled from the feature maps
from input images within the training set. The feature vectors are clustered together by
a measure of similarity, as seen in Equation 2.5.3. During feed-forward inference, the
cluster centres µk are used as weights within a 1×1 convolution, as described in Equation
2.5.3. The feed-forward inference produces the cluster activation or likelihood map L.
When a set of pixels within a given likelihood map channel have a high activation, this
signals that the cluster centre for the respective channel is representative of that set of
pixels. The patches related to that cluster centre are representative of the set of pixels
with a high activation. Therefore, it’s important to learn and understand ”good” cluster
63
centres, because the cluster centres contain a representation of what the liver should look
like. If the cluster centres are ”bad,” then inference becomes inaccurate and leads to a loss
in occlusion localization performance. We discuss what makes a cluster centre ”good”
and what makes it ”bad.”
We can see that for the most part, the patches are selected from features that represent
similar parts of the liver for the respective cluster. However, it’s not perfect, we can see
discontinuity in several clusters, especially the ones in Figures 4.1f, 4.2c, 4.2d, and 4.2f.
For example, in cluster centre #6, two patches are small liver pieces, whereas the rest of
the patches are selected from middle-to-lower parts of the liver, and in cluster #12, the
first patch is the right-portion of a large liver, several of the middle patches are corners
of other livers, and one of the middle and the last two patches are the right portion of
a larger but of a different liver shape. There seems to always be at least one patch that
is different from the rest. The discontinuity poses a problem, because the same cluster
might activate multiple significantly different parts of the liver. We want clusters to ac-
tivate different parts of the liver, however, those parts should ideally be similar to one
another. If they are grossly different, then that activation signifies a misrepresentation of
that part of the liver. If multiple clusters each activate significantly different parts of the
liver, the discontinuity between patches can result in false negative occlusion predictions.
The discontinuity leads to low object likelihood scores, and once a threshold is applied
on the object likelihood map to detect occlusions, the patches with low scores are deemed
as occlusions, whereas they wouldn’t have been discriminated as occlusions if the object
likelihood score was high enough. This discontinuity is what can make a ”bad” cluster
centre. This discontinuity occurs when feature vectors that represent different parts of
the liver get clustered together via the Expectation-Maximization equation seen in Sec-
tion 2.5.3. We believe the root causes to this are: the backbone feature extractor doesn’t
extract features that are discriminative enough between different parts of the liver, the
vMF kappa parameter which allows for a larger variance in features for every cluster
isn’t optimized (if it’s too high, greatly different patches get clustered to the same cluster
64
centre), or there are simply not enough cluster centres to accommodate all the different
parts of the liver.
Additionally, we see that some patches select minuscule features, as seen in cluster
centre #8 in Figure 4.2b. The only liver tissue seen is in the upper-left corner of the re-
ceptive field, and the rest of the patch consists of the background. Cluster centres with
almost entirely black patches would be considered as ”bad” patches, because the back-
ground is considered an object. This can lead to false positive occlusion localization,
because the background can be considered as an object or an occlusion if the object like-
lihood threshold is too low or too high respectively. We see this effect occur in 4.5 where
the background is considered as a light occlusion (orange pixels). This occurs because the
features selected to be sampled from the extracted feature maps are outside the liver. We
could rectify this issue by introducing a constraint to the sampling function so that only
features within the liver are sampled.
There are also instances where no meaningful liver patch is selected, seen as entirely
black patches at the bottom as seen in Figures 4.1a and 4.1c for example. This means
that there aren’t at least 16 liver patches that best represent the given cluster. This sim-
ply highlights that there aren’t enough training patches to use for certain clusters, this
can be alleviated by using more training data. This may not necessarily cause lower per-
formance, but it shows that not all the cluster centres have learned the same amount of
patches. If one wants to learn a better representation of the liver object, then more training
data is needed.
Finally, there are clusters that represent extremely similar liver patches, therefore be-
ing redundant. An example of this is seen when comparing clusters #17 and #18 in Figures
4.3e and 4.3f. This poses a problem because it introduces confusion in the object likelihood
prediction function 2.5.3. Since the object likelihoods from Equation 2.5.3 are normalized,
as opposed to having one cluster centre generating a high likelihood of activation, there
can be multiple redundant centres that generate low-to-mid-likelihoods of activation in-
stead. This is a problem because if the object likelihoods are low enough, those patches
65
may be classified as occlusions if they don’t reach the object threshold. Therefore, redun-
dant clusters can cause false positive occlusion predictions, which is an effect we wish
to minimize. We believe redundant clusters occur when either: the backbone feature
extractor fails to discriminate between similar liver patches, the vMF kappa parameter
which allows for a larger variance in features for every cluster isn’t optimized (if it’s too
low, similar patches are clustered in different cluster centres), or there are too many vMF
cluster centres.
Ideally, every cluster should only contain similar patches, and every cluster should
represent a different set of patches. That way it there is little redundancy across patches,
and every cluster represents a distinct and meaningful part of the liver. If this were the
case, then it becomes significantly easier to diagnose why false negatives or positives
occur in the occlusion maps. If all cluster centres were ideal, then false negatives or posi-
tives in the occlusion map are a result of not having enough training data to learn a larger
variance in liver shape, size, and texture. This generative aspect of the CompNet makes
it very attractive to use for medical imaging, because it’s easier to diagnose the cause of
lower occlusion or classification performance through the use of cluster visualizations.
This offers the human researcher an understanding of the representation learned for the
training objects.
5.2 Cluster Center Activation Visualizations
The activation maps in Figure 4.4 are a result of the 1×1 convolution operation of the VMF
cluster centres, or kernels, with the input image. The activation visualizations provide an
understanding as to how well or how poorly the cluster centres have been learned during
training. Each image is a different slice from the 256 channel feature map, for brevity, we
only included the first 35 maps, they’re enough to serve a meaningful discussion about
our cluster activations. Each channel within the feature map corresponds to a different
cluster centre. Intuitively, subsets of the activation map with high activation highlight
66
portions of the input image that are representative of the cluster respective for that fea-
ture map. Therefore, this gives the human researcher the ability to understand whether
the object parts learned during training offer a good representation of the object. Ideally,
every activation map should represent a different part of the liver getting activated. How-
ever, this isn’t the case. We can see that there are certain activation maps that signal that
the entire liver is being activated, as seen in the third map from the left in the top row,
and the fourth map from the left in the second to top row.
This means that there exists clusters that activate the entire liver, as opposed to smaller
regions. This shows that the object likelihood map isn’t accurate enough to select smaller
regions of the liver, proving that the representation learned for the liver object hasn’t cap-
tured a large variance in shape, size, or texture of the liver. This poses a problem because
it becomes too difficult to localize more granular occlusions, such as smaller tumours, or
tumours with a similar texture to that of the liver. We see this issue occur in Figure 4.5
where the tumours’ contents aren’t being localized as occlusions.
We can see that there are activation maps where the background is activated, or even
if the background and liver get activated. This is a troublesome concern because the
activation of those clusters represent that the background is an object, this explains why
we see the background get categorized as a light occlusion in Figures 4.5 and ??.
5.3 Synthetic Occlusions
We can see in Figure 4.5 that given some manually added occlusion to a liver that is held-
out from the training set, we’re able to localize that occlusion, whether it be a tumour, a
common object or a noise patch. In the first row of Figure 4.5, there are three tumours,
each of different texture, shape, and size. In Figure 4.5a, we detect the tumour’s boundary,
but not its contents. Similarly, in Figure 4.5b we detect a portion of the left border of the
tumour and half of its contents, but its scores are low (orange pixels). There are some
high-scoring blue pixels on that border of the liver, but a portion of those blue pixels can
67
be concluded to be false positives when compared to the other input images with the same
liver but different occlusions. In Figure 4.5c we detect the border of the tumour and some
of its content. Finally, in Figure 4.5f, we again only detect the border of the tumour. This
leads us to believe that the CompNet has difficulty in discriminating between liver and
tumour tissue. The fact that we can detect the border of the tumour but have difficulty
in capturing its contents, shows that the CompNet can detect a significant difference in
contrast and texture when the different textures neighbour one another. We believe it
continues to classify tumour content as liver tissue because the CompNet hasn’t been
trained well enough to discriminate between tumour and liver tissue, since their textures
are similar. This shows that the false negatives are a result of a lack of learning a highly
accurate representation of the liver. However, more work is needed to prove that a highly
accurate representation of the liver using a CompNet would be able to correctly localize
a large variety of tumours.
Whereas in Figures 4.5d and 4.5e, the monkey face and the car occlusions are very
easily localized, their border and contents are categorized as a high-likeliness of occlusion
(blue pixels). Their textures and colours are incredibly different from liver tissue, hence
they’re easily localized by the CompNet. This shows that using a somewhat acceptable
representation of the liver with a CompNet allows discovering grossly different objects
as occluders on the object of interest.
We can also see blood vessels, which are a bright white texture on the liver, are con-
sidered as false positives (blue pixels). We believe this is a result of not enough training
data, since the appearance of blood vessels can vary greatly from one liver to another. We
also see that blood vessels have been captured within several cluster centre patches, this
means that the CompNet understands that a blood vessel may be part of the liver, but it
lacks an understanding of a larger variance and variety of blood vessels, hence the false
positives occurring.
Finally, the manually added grey patches that are of a similar colour to the liver tissue
are easily localized (blue pixels) as seen in Figures 4.5g, 4.5h, and 4.5i. The fact that these
68
patches, with an extremely similar colour to the liver tissue are so easily localized, shows
that the main bottleneck in the CompNet’s performance for localizing tumours correctly
lies in discriminating the texture between liver and tumour tissue. We see this proven
again when discussing results from localizing real tumours in liver slices.
5.4 LiTS Tumour Occlusions
After analyzing Figures 4.6a through 4.6f, we can see that the bottom portion of the liver is
being localized as an occlusion because a shape like that hasn’t been seen in the training
set. Additionally, the tumour tissue texture and contrast in the majority of these input
images is extremely similar to the liver tissue texture and contrast, again proving that
the CompNet’s bottleneck in discriminating between liver and tumour texture and that
the success in localizing the tumour’s border is because of its easily detectable difference
in contrast in the synthetic occlusion examples. We can also see that the left part of the
liver is being classified as an occlusion, showing that the CompNet hasn’t learned a liver
representation of a liver with that of a left side characteristic of that in the LiTS dataset.
We can also see in the cluster centre Figures 4.1, 4.2, and 4.3 that the upper left and middle
left portions aren’t well captured by the vMF clusters. This shows that the CompNet has
difficulty in learning the left portion of the liver, proven by the fact that the left-most part
of the liver in the LiTS test images is classified as an occlusion. This problem is seen again
when analyzing the MUHC PACS liver slices in Section 5.5.
5.5 MUHC PACS Tumour Occlusions
We can see in Figures 4.7a, 4.7b, and 4.7d that several portions of the liver are classified
as occlusions, such as the left and right portions of the Test 1 and 2 livers, and the right
portion of the Test 2 liver. Thereby showing that those parts of the liver aren’t very rep-
resentative of the average liver in the training set. However, we can see in Figures 4.7a,
69
4.7b, and 4.7e that portions of the tumours are successfully classified as tumours. We see
part of the large left tumour in the Test 1 liver, the boundary of the middle tumour in the
Test 2 liver and the middle and right tumours in the Test 5 liver are successfully classified
as occlusions. This shows promise in our work, but evidently there is more work needed,
as we can see that some obvious tumours in the Test 3 and Test 4 livers aren’t localized.
None of the tumours in the Test 3 liver are localized, and neither are the two large bottom
tumours in the Test 4 liver localized.
5.6 Future Work
Future work should include an ability to measure the occlusion localization performance
quantitatively. As of now, there are problems with analyzing performance via visualiza-
tions. Analyzing via visualization is slow, small changes may not be caught by the human
eye, a visual check of the cluster centre patches needs to be performed, and it takes more
time and memory to generate the visualizations. In future work, we could create ground
truth expertly annotated segmentation maps for the synthetic occlusions or use the seg-
mentation maps from the LiTS dataset, and perform a DICE scoring between the ground
truth and the occlusion maps generated from the CompNet for each respective experi-
ment. The quantitative measure would enable the user to use an optimization function to
pick optimal values for the vMF kappa, occlusion threshold, the number of vMF cluster
centres, and the number of sampled features parameters. Therefore, other future work
entails using an optimization function for picking optimal said parameters.
5.7 Hypothesis Discussion and Conclusion
Our results show that CompNets are not currently able to localize the entirety of a tu-
mour. However, we have seen in our experiments that they can at least detect parts of
synthetically added liver tumours, even if the tumour appears in different shapes, sizes,
70
and textures. Most notably, we have shown that it can detect and localize the boundaries
of synthetically added tumours. Currently, our results indicate that the implementation
of CompNets for tumour localization isn’t finished, but have promising potential in un-
supervised tumour localization since the CompNet is at least able to detect boundaries
of synthetically-added tumours. The fact that we can accurately localize entire common
objects that occlude livers, shows that the CompNet possesses an ability to discriminate
well between grossly different anomalies and liver tissue. We have also shown through
our visualization of the cluster centre patches that the CompNet can learn a representa-
tion of the liver from little training data. We have shown the potential in CompNets for
medical imaging analysis, but there are still several challenges that need to be addressed.
By potential, we mean that we have shown the CompNet can at least localize parts of
synthetically added tumours, without the need of tumour training data and little liver
training data. We believe that the main challenge to address involves learning ideal clus-
ter centres so that the tumour tissue is more accurately discriminated between liver tissue.
More experiments are needed for learning better parts, compositions, or representation
of the liver to prove more conclusively whether CompNets can successfully localize the
entirety of synthetically added and natural tumours.
We have shown that this area of research isn’t finished, and more investigation is
needed. We cannot conclusively prove if CompNets can localize the entirety of tumours,
but we have shown that it can localize borders and edges of synthetically added tumours.
To conclude, we cannot prove our hypothesis, but we believe there is potential in explor-
ing Compositional Networks for tumour localization. More work is needed to prove
that they can be reliably and successfully used for tumour localization. We have shown
through our experiments and discussion that Compositional Networks can detect parts of
synthetically added tumours in an unsupervised way, even with a tiny amount of training
data for learning a liver representation.
71
Chapter 6
Appendix
6.1 Image Registration Script
#!/bin/sh
Usage() {
cat <<EOF
Usage: affine_reg [options]
Target-selection options - choose ONE of:
-T : use FMRIB58_FA_1mm as target
for nonlinear registrations (recommended)
-t <target> : use <target> image as target
for nonlinear registrations
-n : find best target from all images
in nii_volumes
EOF
72
exit 1
}
estimate_reg(){
f=$1
for g in *.nii ; do
o=${g}_to_$f
if [ $f != $g ] ; then
#Usage: fsl_reg <input> <reference> <output>
echo "$FSLDIR/bin/fsl_reg $g $f ${g}_to_${f} -e \
-a -flirt \"-omat ${o}.mat\"">> .commands
fi
done
}
do_reg() {
target=$1
out_dir=$2
for src in *.nii ; do
o=${out_dir}/${src}_to_$target
73
if [ $target != $src ] ; then
echo "$FSLDIR/bin/fsl_reg $src $target ${o} \
-a -flirt\"-out ${o}\"">> .commands
fi
done
}
[ "$1" = "" ] && Usage
echo [‘date‘] [‘hostname‘] [‘uname -a‘] [‘pwd‘] [$0 $@]
if [ $1 = -n ] ; then
for f in *.nii ; do
estimate_reg $f
done
else
OUTDIR=out
if [ $1 = -T ] ; then
TARGET=$FSLDIR/data/standard/nii_volumes
elif [ $1 = -t ] ; then
TARGET=$2
else
Usage
fi
if [ ‘${FSLDIR}/bin/imtest $TARGET‘ = 0 ] ; then
74
Figure 6.1: Caption
echo ""
echo "Error: target image $TARGET not valid"
Usage
fi
mkdir -p $OUTDIR
#$FSLDIR/bin/imcp $TARGET out/target
do_reg $TARGET $OUTDIR
fi
echo "Running .commands"
${FSLDIR}/bin/fsl_sub -l logs -T 60 -N affine_reg \
-t .commands
rm .commands
75
Bibliography
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553,
pp. 436–444, 2015.
[2] D. George, W. Lehrach, K. Kansky, M. Lazaro-Gredilla, C. Laan, B. Marthi, X. Lou,
Z. Meng, Y. Liu, H. Wang, A. Lavin, and D. S. Phoenix, “A generative vision model
that trains with high data efficiency and breaks text-based CAPTCHAs,” en, Science,
vol. 358, no. 6368, Dec. 2017, Publisher: American Association for the Advancement
of Science Section: Research Article, ISSN: 0036-8075, 1095-9203. DOI: 10.1126/
science.aag2612. [Online]. Available: https://science.sciencemag.
org/content/358/6368/eaag2612 (visited on 05/22/2020).
[3] M. D. Kohli, R. M. Summers, and J. R. Geis, “Medical Image Data and Datasets in
the Era of Machine Learning—Whitepaper from the 2016 C-MIMI Meeting Dataset
Session,” en, Journal of Digital Imaging, vol. 30, no. 4, pp. 392–399, Aug. 2017, ISSN:
1618-727X. DOI: 10.1007/s10278-017-9976-3. [Online]. Available: https:
//doi.org/10.1007/s10278-017-9976-3 (visited on 04/07/2021).
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep
Convolutional Neural Networks,” en, p. 9,
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Se-
mantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence,
vol. 40, no. 4, pp. 834–848, 2017.
76
[6] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomed-
ical Image Segmentation,” en, in Medical Image Computing and Computer-Assisted In-
tervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi,
Eds., ser. Lecture Notes in Computer Science, Cham: Springer International Pub-
lishing, 2015, pp. 234–241, ISBN: 978-3-319-24574-4. DOI: 10.1007/978-3-319-
24574-4_28.
[7] A. Kalinovsky and V. Kovalev, “Lung image segmentation using deep learning
methods and convolutional neural networks,” XIII International Conference on Pat-
tern Recognition and Information Processing, 2016. [Online]. Available: https://
elib.bsu.by/bitstream/123456789/158557/1/Kallinovsky_Kovalev.
pdf.
[8] M. I. Razzak, S. Naz, and A. Zaib, “Deep learning for medical image processing:
Overview, challenges and the future,” Classification in BioApps, pp. 323–350, 2018.
[9] P. Savadjiev, J. Chong, A. Dohan, M. Vakalopoulou, C. Reinhold, N. Paragios, and
B. Gallix, “Demystification of ai-driven medical image interpretation: Past, present
and future,” European radiology, vol. 29, no. 3, pp. 1616–1624, 2019.
[10] L. Zhu, Y. Chen, and A. Yuille, “Recursive Compositional Models for Vision: De-
scription and Review of Recent Work,” en, Journal of Mathematical Imaging and Vi-
sion, vol. 41, no. 1-2, pp. 122–146, Sep. 2011, ISSN: 0924-9907, 1573-7683. DOI: 10.
1007/s10851-011-0282-2. [Online]. Available: http://link.springer.
com/10.1007/s10851-011-0282-2 (visited on 11/26/2020).
[11] A. Kortylewski, Q. Liu, H. Wang, Z. Zhang, and A. Yuille, “Combining Compo-
sitional Models and Deep Networks For Robust Object Classification under Occlu-
sion,” arXiv:1905.11826 [cs], Jan. 2020, arXiv: 1905.11826. [Online]. Available: http:
//arxiv.org/abs/1905.11826 (visited on 05/04/2020).
[12] P. J. Werbos, “Backpropagation through time: What it does and how to do it,” Pro-
ceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
77
[13] J.-G. Lee, S. Jun, Y.-W. Cho, H. Lee, G. B. Kim, J. B. Seo, and N. Kim, “Deep learning
in medical imaging: General overview,” Korean journal of radiology, vol. 18, no. 4,
p. 570, 2017.
[14] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,”
en, arXiv:1603.07285 [cs, stat], Jan. 2018, arXiv: 1603.07285. [Online]. Available: http:
//arxiv.org/abs/1603.07285 (visited on 07/30/2020).
[15] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradi-
ent descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–
166, Mar. 1994, Conference Name: IEEE Transactions on Neural Networks, ISSN:
1941-0093. DOI: 10.1109/72.279181.
[16] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient backprop bt-neural
networks: Tricks of the trade,” Neural Networks: Tricks of the Trade, 2012.
[17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift,” in International conference on machine learning,
PMLR, 2015, pp. 448–456.
[18] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and
R. M. Summers, “Deep convolutional neural networks for computer-aided detec-
tion: Cnn architectures, dataset characteristics and transfer learning,” IEEE transac-
tions on medical imaging, vol. 35, no. 5, pp. 1285–1298, 2016.
[19] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal
of Big data, vol. 3, no. 1, pp. 1–40, 2016.
[20] L. Prechelt, “Early Stopping - But When?” en, in Neural Networks: Tricks of the Trade,
ser. Lecture Notes in Computer Science, G. B. Orr and K.-R. Muller, Eds., Berlin,
Heidelberg: Springer, 1998, pp. 55–69, ISBN: 978-3-540-49430-0. DOI: 10.1007/3-
540-49430-8_3. [Online]. Available: https://doi.org/10.1007/3-540-
49430-8_3 (visited on 04/06/2021).
78
[21] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal,
P.-M. Jodoin, and H. Larochelle, “Brain tumor segmentation with Deep Neural Net-
works,” en, Medical Image Analysis, vol. 35, pp. 18–31, Jan. 2017, ISSN: 1361-8415.
DOI: 10.1016/j.media.2016.05.004. [Online]. Available: https://www.
sciencedirect.com/science/article/pii/S1361841516300330 (vis-
ited on 03/08/2021).
[22] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for
Deep Learning,” en, Journal of Big Data, vol. 6, no. 1, p. 60, Jul. 2019, ISSN: 2196-1115.
DOI: 10.1186/s40537-019-0197-0. [Online]. Available: https://doi.org/
10.1186/s40537-019-0197-0 (visited on 04/24/2020).
[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
Courville, and Y. Bengio, “Generative Adversarial Nets,” en, NIPS’14: Proceedings
of the 27th International Conference on Neural Information Processing Systems, vol. 2,
pp. 2672–2680, 2014.
[24] T. C. W. Mok and A. C. S. Chung, “Learning Data Augmentation for Brain Tumor
Segmentation with Coarse-to-Fine Generative Adversarial Networks,” en, arXiv:1805.11291
[cs], vol. 11383, pp. 70–80, 2019, arXiv: 1805.11291. DOI: 10.1007/978-3-030-
11723-8_7. [Online]. Available: http://arxiv.org/abs/1805.11291 (vis-
ited on 12/07/2020).
[25] S. Jaiswal, A. Mehta, and G. C. Nandi, “Investigation on the effect of l1 an l2 regu-
larization on image features extracted using restricted boltzmann machine,” in 2018
Second International Conference on Intelligent Computing and Control Systems (ICICCS),
2018, pp. 1548–1553. DOI: 10.1109/ICCONS.2018.8663071.
[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
A simple way to prevent neural networks from overfitting,” The journal of machine
learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
79
[27] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, “Activation Functions:
Comparison of trends in Practice and Research for Deep Learning,” arXiv:1811.03378
[cs], Nov. 2018, arXiv: 1811.03378. [Online]. Available: http://arxiv.org/abs/
1811.03378 (visited on 04/07/2021).
[28] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http:
//www.deeplearningbook.org.
[29] J. Turian, J. Bergstra, and Y. Bengio, “Quadratic features and deep architectures for
chunking,” in Proceedings of Human Language Technologies: The 2009 Annual Confer-
ence of the North American Chapter of the Association for Computational Linguistics, Com-
panion Volume: Short Papers, 2009, pp. 245–248.
[30] B. Karlik and A. V. Olgac, “Performance analysis of various activation functions in
generalized mlp architectures of neural networks,” International Journal of Artificial
Intelligence and Expert Systems, vol. 1, no. 4, pp. 111–122, 2011.
[31] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann ma-
chines,” in Icml, 2010.
[32] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv
preprint arXiv:1710.05941, 2017.
[33] Y.-L. Boureau, J. Ponce, and Y. LeCun, “A Theoretical Analysis of Feature Pooling
in Visual Recognition,” en, Proceedings of the 27th International Conference on Machine
Learning (ICML-10), p. 8, 2010.
[34] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep
neural network architectures and their applications,” en, Neurocomputing, vol. 234,
pp. 11–26, Apr. 2017, ISSN: 0925-2312. DOI: 10.1016/j.neucom.2016.12.038.
[Online]. Available: https://www.sciencedirect.com/science/article/
pii/S0925231216315533 (visited on 04/12/2021).
[35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
80
[36] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in
Proceedings of the 2nd International Conference on Neural Information Processing Sys-
tems, 1989, pp. 396–404.
[37] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural net-
works,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2014, pp. 1653–1660.
[38] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on com-
puter vision, 2015, pp. 1440–1448.
[39] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proceedings of the
IEEE international conference on computer vision, 2017, pp. 2961–2969.
[40] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep
learning,” in Proceedings of the IEEE conference on computer vision and pattern recogni-
tion, 2015, pp. 1265–1274.
[41] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “De-
caf: A deep convolutional activation feature for generic visual recognition,” in In-
ternational conference on machine learning, PMLR, 2014, pp. 647–655.
[42] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for
scene labeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35,
no. 8, pp. 1915–1929, 2012.
[43] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for
image restoration,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 3929–3938.
[44] Y. Gordienko, P. Gang, J. Hui, W. Zeng, Y. Kochura, O. Alienin, O. Rokovyi, and
S. Stirenko, “Deep Learning with Lung Segmentation and Bone Shadow Exclusion
Techniques for Chest X-Ray Analysis of Lung Cancer,” en, in Advances in Computer
Science for Engineering and Education, Z. Hu, S. Petoukhov, I. Dychka, and M. He,
81
Eds., vol. 754, Series Title: Advances in Intelligent Systems and Computing, Cham:
Springer International Publishing, 2019, pp. 638–647, ISBN: 978-3-319-91007-9 978-
3-319-91008-6. DOI: 10.1007/978-3-319-91008-6_63. [Online]. Available:
http://link.springer.com/10.1007/978-3-319-91008-6_63 (visited
on 07/15/2020).
[45] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.-M.
Jodoin, and H. Larochelle, “Brain tumor segmentation with deep neural networks,”
Medical image analysis, vol. 35, pp. 18–31, 2017.
[46] A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, and M. Nielsen, “Deep feature
learning for knee cartilage segmentation using a triplanar convolutional neural net-
work,” in International conference on medical image computing and computer-assisted
intervention, Springer, 2013, pp. 246–253.
[47] H. R. Roth, A. Farag, L. Lu, E. B. Turkbey, and R. M. Summers, “Deep convolutional
networks for pancreas segmentation in ct imaging,” in Medical Imaging 2015: Image
Processing, International Society for Optics and Photonics, vol. 9413, 2015, 94131G.
[48] C. Cernazanu-Glavan and S. Holban, “Segmentation of bone structure in x-ray im-
ages using convolutional neural network,” Adv. Electr. Comput. Eng, vol. 13, no. 1,
pp. 87–94, 2013.
[49] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Mitosis detection
in breast cancer histology images with deep neural networks,” in International con-
ference on medical image computing and computer-assisted intervention, Springer, 2013,
pp. 411–418.
[50] J. A. Stark, “Adaptive image contrast enhancement using generalizations of his-
togram equalization,” IEEE Transactions on image processing, vol. 9, no. 5, pp. 889–
896, 2000.
82
[51] S. Lyu and E. P. Simoncelli, “Nonlinear image representation using divisive normal-
ization,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE,
2008, pp. 1–8.
[52] [Online]. Available: http://tuberculosis.by/.
[53] J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K.-i. Komatsu,
M. Matsui, H. Fujita, Y. Kodera, and K. Doi, “Development of a digital image database
for chest radiographs with and without a lung nodule: Receiver operating charac-
teristic analysis of radiologists’ detection of pulmonary nodules,” American Journal
of Roentgenology, vol. 174, no. 1, pp. 71–74, 2000.
[54] W. S. H. M. Wan Ahmad, W. M. D. W Zaki, and M. F. Ahmad Fauzi, “Lung segmen-
tation on standard and mobile chest radiographs using oriented Gaussian deriva-
tives filter,” en, BioMedical Engineering OnLine, vol. 14, no. 1, p. 20, Mar. 2015, ISSN:
1475-925X. DOI: 10.1186/s12938-015-0014-8. [Online]. Available: https:
//doi.org/10.1186/s12938-015-0014-8 (visited on 04/07/2021).
[55] D. Zikic, Y. Ioannou, M. Brown, and A. Criminisi, “Segmentation of brain tumor
tissues with convolutional neural networks,” Proceedings MICCAI-BRATS, vol. 36,
pp. 36–39, 2014.
[56] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout
networks,” in International conference on machine learning, PMLR, 2013, pp. 1319–
1327.
[57] B. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y. Burren, N.
Porz, J. Slotboom, R. Wiest, L. Lanczi, E. Gerstner, M.-A. Weber, T. Arbel, B. Avants,
N. Ayache, P. Buendia, L. Collins, N. Cordier, J. Corso, A. Criminisi, T. Das, H.
Delingette, C. Demiralp, C. Durst, M. Dojat, S. Doyle, J. Festa, F. Forbes, E. Geremia,
B. Glocker, P. Golland, X. Guo, A. Hamamci, K. Iftekharuddin, R. Jena, N. John,
E. Konukoglu, D. Lashkari, J. Antonio Mariz, R. Meier, S. Pereira, D. Precup, S. J.
Price, T. Riklin-Raviv, S. Reza, M. Ryan, L. Schwartz, H.-C. Shin, J. Shotton, C. Silva,
83
N. Sousa, N. Subbanna, G. Szekely, T. Taylor, O. Thomas, N. Tustison, G. Unal, F.
Vasseur, M. Wintermark, D. Hye Ye, L. Zhao, B. Zhao, D. Zikic, M. Prastawa, M.
Reyes, and K. Van Leemput, “The Multimodal Brain Tumor Image Segmentation
Benchmark (BRATS),” IEEE Transactions on Medical Imaging, p. 33, 2014. DOI: 10.
1109/TMI.2014.2377694. [Online]. Available: https://hal.inria.fr/
hal-00935640.
[58] D. C. Ciresan, L. M. Gambardella, and A. Giusti, “Deep Neural Networks Segment
Neuronal Membranes in Electron Microscopy Images,” en, p. 9,
[59] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S.
McDonagh, N. Y. Hammerla, B. Kainz, et al., “Attention u-net: Learning where to
look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
[60] O. Cicek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net:
Learning dense volumetric segmentation from sparse annotation,” in International
conference on medical image computing and computer-assisted intervention, Springer,
2016, pp. 424–432.
[61] P. Esser, E. Sutter, and B. Ommer, “A variational u-net for conditional appearance
and shape generation,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 8857–8866.
[62] N. Ibtehaz and M. S. Rahman, “Multiresunet: Rethinking the u-net architecture for
multimodal biomedical image segmentation,” Neural Networks, vol. 121, pp. 74–87,
2020.
[63] D. Mumford, “Pattern Theory: A Unifying Perspective,” en, Pattern Theory, vol. 3,
p. 38, 1992. [Online]. Available: https://link.springer.com/chapter/10.
1007/978-3-0348-9110-3_6.
[64] Xue Mei, Haibin Ling, and D. W. Jacobs, “Sparse representation of cast shadows via
1-regularized least squares,” in 2009 IEEE 12th International Conference on Computer
Vision, 2009, pp. 583–590. DOI: 10.1109/ICCV.2009.5459185.
84
[65] X. Li, T. Jia, V. Tech, and H. Zhang, “Expression-Insensitive 3D Face Recognition us-
ing Sparse Representation,” en, 2009 IEEE Conference on Computer Vision and Pattern
Recognition, p. 8, 2009. DOI: 10.1109/CVPR.2009.5206613.
[66] K. Huang and S. Aviyente, “Sparse Representation for Signal Classification,” en,
Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Confer-
ence, p. 8, 2006.
[67] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse Representation
for Computer Vision and Pattern Recognition,” Proceedings of the IEEE, vol. 98, no. 6,
pp. 1031–1044, Jun. 2010, Conference Name: Proceedings of the IEEE, ISSN: 1558-
2256. DOI: 10.1109/JPROC.2010.2044470.
[68] Y.-Y. Wang, M. Mahajan, and X. Huang, “A unified context-free grammar and n-
gram model for spoken language processing,” in 2000 IEEE International Conference
on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), IEEE,
vol. 3, 2000, pp. 1639–1642.
[69] A. Kortylewski, J. He, Q. Liu, and A. Yuille, “Compositional Convolutional Neu-
ral Networks: A Deep Architecture with Innate Robustness to Partial Occlusion,”
arXiv:2003.04490 [cs], Apr. 2020, arXiv: 2003.04490. [Online]. Available: http://
arxiv.org/abs/2003.04490 (visited on 05/04/2020).
[70] A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra, “Clustering on the Unit Hypersphere
using von Mises-Fisher Distributions,” en, 20th International Conference on Pattern
Recognition, p. 38, 2010. DOI: 10.1109/ICPR.2010.522.
[71] G. N. Watson, A treatise on the theory of Bessel functions. Cambridge university press,
1995.
[72] A. Kortylewski, Q. Liu, H. Wang, Z. Zhang, and A. Yuille, “Localizing Occlud-
ers with Compositional Convolutional Networks,” en, in 2019 IEEE/CVF Interna-
tional Conference on Computer Vision Workshop (ICCVW), Seoul, Korea (South): IEEE,
Oct. 2019, pp. 2029–2032, ISBN: 978-1-72815-023-9. DOI: 10.1109/ICCVW.2019.
85
00253. [Online]. Available: https://ieeexplore.ieee.org/document/
9022239/ (visited on 06/28/2021).
[73] A. E. Kavur, M. A. Selver, O. Dicle, M. Barıs, and N. S. Gezer, CHAOS - Combined
(CT-MR) Healthy Abdominal Organ Segmentation Challenge Data, version v1.03, Zen-
odo, Apr. 2019. DOI: 10.5281/zenodo.3362844. [Online]. Available: https:
//doi.org/10.5281/zenodo.3362844.
[74] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han,
P.-A. Heng, J. Hesser, et al., “The liver tumor segmentation benchmark (lits),” arXiv
preprint arXiv:1901.04056, 2019.
[75] FLIRT - FslWiki. [Online]. Available: https://fsl.fmrib.ox.ac.uk/fsl/
fslwiki/FLIRT (visited on 05/17/2021).
[76] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale
hierarchical image database,” in 2009 IEEE conference on computer vision and pattern
recognition, Ieee, 2009, pp. 248–255.
[77] [Online]. Available: https://www.gimp.org/.
86