medical image segmentation for embryo image …...medical image segmentation for embryo image...
TRANSCRIPT
-
MEDICAL IMAGE SEGMENTATION FOR EMBRYO IMAGE ANALYSIS
A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THEUNIVERSITY OF HAWAI‘I IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
ELECTRICAL ENGINEERING
MAY 2020
ByMd Yousuf Harun
Thesis Committee:
Dr. Aaron Ohta, ChairpersonDr. Il Yong Chun, Chairperson
Dr. Victor Lubecke
-
c© Copyright 2020by
Md Yousuf HarunAll Rights Reserved
ii
-
To the sundial in the center of the courtyard
iii
-
Acknowledgements
I would like to express my gratitude to a number of individuals who have been instrumental
to my education over the past two years at the University of Hawai‘i at Mānoa.
First of all, I would like to thank my advisors, Dr. Aaron Ohta and Dr. Il Yong Chun
for their enormous help in my MS research.
I am grateful to Dr. Aaron Ohta for involving me in the exciting interdisciplinary
embryo image segmentation project that brings together doctors and engineers. He always
supported me and gave me the freedom to pursue my research in directions that were of
special interest to me. I have learned many invaluable things from him which will pave my
future research endeavors. Working with him is a great opportunity and rewarding in every
aspect of my academic life.
I want to express my gratitude to Dr. Il Yong Chun for introducing me to the exciting
field of computational medical imaging. He always motivated me and guided me to perform
good research. I am thankful to him for helping me to improve my technical writing skills.
I would like to thank Dr. Victor Lubecke for taking time to be a part of my thesis
committee. I appreciate the input and support he has given in order to improve this thesis.
I want to thank Dr. Thomas Huang for his guidance and collaboration in the embryo
image segmentation project. No result in this thesis would have been possible without the
fruitful cooperation between doctors and engineers.
I am also thankful to M Arifur Rahman, Kareem Elassy, Mohsen Paryavi,
Meenakshi Vohra, Richie Chio and many laboratory colleagues I have had for their
support and guidance. The discussions I have had with them were instrumental to my
iv
-
research and this manuscript. Special thanks to Arif for his help and suggestions in all
aspects of graduate life
Most of all, I would like to thank my parents, family and friends for their continued support
throughout my life. Their encouragement has driven me to do my best as both a student
and a person. I dedicate this thesis to them.
v
-
Abstract
This thesis describes a project that applies electrical engineering to biomedical applications.
The project involves the development of a deep learning-based image segmentation method
to identify cellular regions in microscopic images of human embryos for their morphological
and morphokinetic analysis during in vitro fertilization (IVF) treatment. First, we aim
to segment inner cell mass (ICM) and trophectoderm epithelium (TE) in zona pellucida
(ZP)-intact embryos imaged by a microscope for morphological analysis. ICM and TE
segmentation in ZP-intact embryonic images is difficult due to small number of training
images (211 ZP-intact embryonic images) and similar textures among ICM, TE, ZP, and
artifacts. We overcame the aforementioned challenges by leveraging deep learning and
semantic segmentation techniques. In this work, we implemented a UNet variant model
named Residual Dilated UNet (RD-UNet) to segment ICM and TE in ZP-intact embryonic
images. We added residual convolution to the encoding and decoding units and replaced
conventional convolutional layer with multiple dilated convolutional layers at the central
bridge of RD-UNet. The experimental results with a testing set of 38 ZP-intact embryonic
images demonstrate that RD-UNet outperforms existing models. RD-UNet can identify
ICM with a Dice Coefficient of 94.3% and a Jaccard Index of 89.3%. The model can
segment TE with a Dice Coefficient of 92.5% and a Jaccard Index of 85.3%.
Second, we aim to segment inner cell regions in ZP-ablated embryonic images obtained
by time-lapse microscopic imaging for morphokinetic analysis. Segmenting inner cell
regions in ZP-ablated embryonic images has following challenges: irregular expansion of
vi
-
inner cell, surrounding fragmented cellular clusters and artifacts, and inner cell expansion
beyond culture well. We proposed a UNet based architecture named Deep Dilated Residual
Recurrent UNet (D2R2-UNet) to segment inner cell regions in ZP-ablated embryonic
images. We incorporated residual recurrent convolution into the encoding and decoding
units, dilated convolution into the central bridge, and residual convolution into the
encoder-decoder skip-connections in order to maximize the segmentation performance. The
experimental results with a testing set of 342 ZP-ablated embryonic images demonstrate
that the proposed D2R2-UNet improves inner cell segmentation performances over existing
UNet variants. Our model obtains the best overall performance as compared to other
models in inner cell segmentation, with a Jaccard Index of 95.65% and a Dice Coefficient
of 97.78%.
vii
-
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation of Embryo Image Segmentation . . . . . . . . . . . . . . . . . . 1
1.2 Quantitative Evaluation of Embryo Viability . . . . . . . . . . . . . . . . . 3
1.3 ICM and TE Segmentation Challenges in Morphological Analysis . . . . . . 4
1.4 Inner Cell Segmentation Challenges in Morphokinetic Analysis . . . . . . . 5
1.5 Semantic Segmentation with Deep Learning . . . . . . . . . . . . . . . . . . 6
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Baseline UNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Proposed D2R2-UNet Architecture . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Residual Convolutional Unit . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Recurrent Convolutional Unit . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Residual Recurrent (R2) Convolutional Unit . . . . . . . . . . . . . 15
2.2.4 Dilated Convolution in the Central Bridge . . . . . . . . . . . . . . . 16
2.2.5 Residual Convolutional Skip-Connections between Encoder and
Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
viii
-
2.3 RD-UNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Network Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.1 Dataset for ICM and TE Segmentation . . . . . . . . . . . . . . . . 21
2.7.2 Dataset for Inner Cell Segmentation . . . . . . . . . . . . . . . . . . 21
Chapter 3: Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 ICM and TE Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Inner Cell Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 4: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
ix
-
List of Tables
3.1 Comparison of ICM results of our method with that of existing methods
based on same data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Comparison of TE results of our method with that of existing methods based
on same data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Comparison among different UNet architectures based on their inner cell
segmentation performance evaluated on same testing set . . . . . . . . . . . 34
x
-
List of Figures
1.1 (a) an image of an embryo and its (b) annotated regions. Here, ZP, ICM,
CM, and TE denote zona pellucida, inner cell mass, cavity mass, and
trophectoderm epithelium, respectively. . . . . . . . . . . . . . . . . . . . . 3
1.2 Expansion kinetics of (a) a genetically normal embryo (b) a genetically
abnormal embryo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Examples of inner cell segmentation challenges in ZP-ablated embryo: (a)
ZP-ablated embryo, (b) artifacts, (c) inner cell beyond culture well. . . . . 5
1.4 Semantic segmentation: (a) an image of street view and its (b) pixel
annotated segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 The baseline UNet architecture [2].The height and width of each box
represents the image size and number of channels, respectively. The dotted
boxes denote copied feature maps. . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Different convolutional units we compared in UNet. (a) A convolutional
unit in the baseline UNet [2]. (b) A residual convolutional unit [4]. (c) A
recurrent convolutional unit [5]. (d) A R2 convolutional unit [11]. (e) A
recurrent convolutional layer [5] with the number of evolution steps S = 3.
For all UNet variations, we use ELU [10] instead of RELU [6] since ELU
slightly improved the embryo image segmentation performance. . . . . . . . 15
xi
-
2.3 A residual convolutional encoder-decoder skip-connection consisting of four
residual convolutional layers, each of which applies 3×3 convolution followed
by ELU activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 The D2R2-UNet architecture for inner cell segmentation: we modify
the UNet backbone in Fig. 2.1 by using R2 convolutional units, dilated
convolutional layers, and residual convolutional encoder-decoder skip-
connections. The height and width of each box represents the image size and
number of channels, respectively. The black and blue dotted boxes denote
central bridge and copied feature maps, respectively. . . . . . . . . . . . . . 17
2.5 The RD-UNet architecture for ICM and TE segmentation [1]: we modify the
UNet backbone in Fig. 2.1 by using residual convolutional units and dilated
convolutional layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 ICM segmentation results by RD-UNet. The background (non ICM) is
colored dark cyan, the annotated ground truth ICM is light green, the
network predicted ICM is yellow, and the contour of the ground truth ICM
is red. JI and DC stand for Jaccard Index and Dice Coefficient, respectively. 31
3.2 TE segmentation results by RD-UNet. The background (non TE) is colored
dark cyan, the annotated ground truth TE is light green, the network
predicted TE is yellow, and the contour of the ground truth TE is red. JI
and DC stand for Jaccard Index and Dice Coefficient, respectively. . . . . . 32
3.3 Comparisons of the joint loss (Equation 2.4) between different UNet variant
models for inner cell segmentation: (a) training loss and (b) testing loss. . . 34
3.4 Segmentation results. Light green in 2nd and 5th rows indicates segmented
inner cell by D2R2-UNet. Red and blue in 3rd and 6th rows indicate the
boundaries of ground truth and predicted inner cell, respectively. JI and DC
stand for Jaccard index and Dice coefficient, respectively. . . . . . . . . . . 36
xii
-
Chapter 1
Introduction
1.1 Motivation of Embryo Image Segmentation
According to Centers for Disease Control and Prevention, almost six million women in
United States suffer from infertility [1]. The World Health Organization reports the total
number of patients worldwide suffering from infertility as almost fifty million [1]. The most
effective treatment for infertility is in vitro fertilization (IVF) and IVF is performed more
than one million times annually around the world [2]. However, IVF suffers from relatively
low birth rates, i.e., less than 30% in the US from 1995 to 2016 [1]. One of the reasons of
such low birth rate is the misidentification of embryo viability. During the IVF process, the
fertilized eggs (embryos) are cultured in controlled environmental conditions and imaged
digitally using microscopes or embryoscopes. When the embryos reach their blastocyst stage
(at least 32 cells on fifth day of culture), the healthiest embryo is selected for implantation.
Morphology assessment is a standard approach for embryo grading in IVF. Several
studies have been conducted to find the most important feature of embryo morphology
[3, 4, 5]. The studies suggest that the morphological features such as inner cell mass
(ICM), trophectoderm epithelium (TE), and degree of blastocoel cavity expansion relative
to the zona pellucida (ZP) are effective measures to determine embryo viability. ICM
eventually develops into a fetus which contains major body organs [4]. Successful
1
-
hatching of an implanted embryo, i.e., live birth correlates highly with strong TE layer
[3]. Therefore, identification of ICM and TE regions are important to evaluate embryo
implantation potential. In addition, [6] reports that the morphokinetics of an embryo
highly correlates with its genetic quality, i.e., euploid or aneuploid. Here, an embryo with
higher expansion rate has higher reproductive potential. A related study demonstrates that
euploid (genetically normal) embryos expand more rapidly than the aneuploid embryos
(genetically abnormal) [7].
The identification of inner cell expansion is crucial for morphokinetic analysis of embryo
towards genetic quality assessment in IVF. Traditionally, embryologists determines embryo
viability by manually evaluating the morphological features of embryos based on visual
inspection. This subjective and qualitative approach is prone to human bias and does not
consider the genetic quality of an embryo. In addition, it poses high risk of misidentification
of embryo viability, abnormal pregnancies, and health risks; it is a time-consuming task
for embryologists to manually analyze the embryo morphology. This becomes labor and
resource inefficient. To increase the chance of successful pregnancy, multiple embryos
are transferred to mother’s uterus which oftentimes results in multiple pregnancies with
associated health complications. Thus, identification of the single embryo with the highest
potential for a live birth is critical to achieve sustain pregnancies and minimize health risks.
Although preimplantation genetic screening (PGS) provides a good evaluation of embryo
genetics [8, 9], such genetic testing remains very expensive.
All the aforementioned issues necessitate a cost effective, automated, quantitative
method for gauging embryo health. In this study, we developed a deep learning based
segmentation method to precisely identify 1) ICM and TE regions in ZP-intact embryo
images, and 2) inner cell in ZP-ablated embryo images. We use deep neural networks
to recognize both local (texture) and contextual (spatial arrangement) representations of
different embryo regions and segment them in the noisy images.
2
-
(a) (b)
Figure 1.1 (a) an image of an embryo and its (b) annotated regions. Here, ZP, ICM, CM,and TE denote zona pellucida, inner cell mass, cavity mass, and trophectoderm epithelium,respectively.
1.2 Quantitative Evaluation of Embryo Viability
There are two main approaches for quantitative evaluation of embryo viability:
1) morphological analysis: this is based on morphological attributes of embryo cellular
regions such as inner cell mass and trophectoderm epithelium, as illustrated in Fig. 1.1.
The size of these regions is a good indicator of embryo viability. Biological studies [3, 4, 5]
suggest that embryo morphology correlates with its health.
2) morphokinetic analysis: this is based on morphokinetics of an embryo i.e. how rapidly
the embryo grows in incubation. Studies [6, 7] suggest that morphokinetics of an embryo
highly associates with its genetic information. Genetically normal embryos expand more
rapidly than genetically abnormal embryos. Fig. 1.2 shows that genetically normal embryo
has higher expansion rate (steep slope) as opposed to that of genetically abnormal embryo
(flat or negative slope). In the morphokinetic study, embryologists ablate the zona-pellucida
(ZP) of an embryo to let the inner cell expand beyond ZP region. Then they apply a time-
lapse microscopic imaging using embryoscopes to capture the images of ZP-ablated embryo
during a ten hours observation period. Finally, they estimate the total area of inner cell at
different time points and measure the expansion rate.
3
-
(a) (b)
Figure 1.2 Expansion kinetics of (a) a genetically normal embryo (b) a genetically abnormalembryo.
1.3 ICM and TE Segmentation Challenges in Morphological
Analysis
ICM and TE analysis plays crucial roles in determining embryo viability for healthy
pregnancies in IVF. At blastocyst stage, an embryo consists of three inner regions: 1)
inner cell mass (ICM), 2) trophectoderm epithelium (TE), and 3) cavity mass (CM). These
inner regions are enclosed by an outer layer named zona pellucida (ZP). For convenience,
we refer this embryo as ZP-intact embryo. Fig. 1.1 illustrate a ZP-intact embryo and its
annotated inner (ICM, CM, TE) and outer (ZP) regions.
The ICM and TE have similar pixel intensity values, i.e., in general, it is hard to
distinguish them. They are also surrounded by two other embryo regions such as zona
pellucida (ZP) and cavity mass (CM) that share similar pixel intensity values, similar to
ICM and TE. In addition, there exist undesirable fragments and artifacts near the ICM and
TE regions. The similar pixel intensity values of surrounding CM, ZP, artifacts, fragments,
and image contrast variations make it challenging to differentiate between ICM and TE
regions and precisely segment them. The number of training images in the dataset is also
small (211 images); this poses an additional challenge in the ICM and TE segmentation,
such as less diversity in training data and overfitting to training data.
4
-
(a) (b) (c)
Figure 1.3 Examples of inner cell segmentation challenges in ZP-ablated embryo: (a) ZP-ablated embryo, (b) artifacts, (c) inner cell beyond culture well.
1.4 Inner Cell Segmentation Challenges in Morphokinetic
Analysis
The inner cell segmentation is critical for the morphokinetic study using an embryoscope
[6]. The inner cell expansion rate is measured over a ten-hour observation period using
time-lapse microscopic imaging. In these embryos, the embryologists ablate the ZP to
perform preimplantation genetic screening. The goal of this project is to segment inner cell
to facilitate the measurement of morphokinetics of an embryo i.e., how rapidly the inner
cell expands by estimating their total area. To estimate the total area of an embryo, [6]
segmented objects with circular shapes using the embryoscope software tool.
There exist significant challenges in this segmentation method, because a) inner cells
expand with irregular rates, b) some artifacts and fragmented cellular clusters can exist
close to inner cell outlines, and c) expanded inner cell can have white bands and/or dark
background due their expansion beyond the culture well. Fig. 1.3 shows some examples of
such challenges.
5
-
(a) (b)
Figure 1.4 Semantic segmentation: (a) an image of street view and its (b) pixel annotatedsegmentation.
1.5 Semantic Segmentation with Deep Learning
Semantic segmentation is a high-level task that facilitates the complete scene understanding.
The semantic segmentation techniques are applied to a wide range of images/videos,
including still two-dimensional images, three-dimensional or volumetric images, and videos;
the techniques are used in various applications including autonomous driving [10], human-
machine interaction [11], computational photography [12], and image search engines [13].
Semantic segmentation relates to the pixel- or voxel-wise image classification task, where
each pixel or voxel is labeled according to the classes present in a two-dimensional or three-
dimensional image; see an example in Fig. 1.4.
Semantic segmentation has been addressed in the past using various computer vision
and machine learning techniques such as active contour/sanke model, clustering algorithm,
watershed algorithm, graph based region merging, random walk, and Markov random field
[14]. Recent advancements in deep learning have shown potential to solve challenging
image segmentation problems [15]. The most popular convolutional neural network (CNN)
model is UNet which shows significant performances in medical image segmentation tasks
[16]. The UNet architecture has been modified for medical image segmentation tasks in
6
-
various medical applications such as retina blood vessel segmentation [17], liver and tumor
segmentation [18], skin lesion segmentation [19], and surgical instrument segmentation [20].
To perform semantic segmentation, the CNNs learn representative features of an image
and convert them into a pixel-wise categorization. In general, semantic segmentation CNN
models consist of an encoding network and a decoding network. The encoder converts an
input image into a set of representative feature maps. The role of the decoder is to convert
the encoded features often in lower spatial resolution into the original high-resolution pixel
space and generate a pixel-wise classification map.
1.6 Outline
This thesis contributes to the embryo image segmentation for IVF treatment. In the
following section, an outline of the thesis is provided.
Chapter 2 describes the methodology, proposed or implemented neural network
architecture, network specification, loss function, implementation details, and dataset.
Chapter 3 describes the evaluation metrics, performance comparison, results, and
discussions.
Chapter 4 summarizes the performance of developed methods and contributions of
our works to some applications/areas. The chapter also discusses future research
work/direction.
7
-
References
[1] N. Gleicher, V. Kushnir, and D. Barad, “Worldwide decline of IVF birth rates and its
probable causes,” Human Reproduction Open, vol. 2019, no. 3, p. hoz017, 2019.
[2] E. Santos Filho, J. Noble, and D. Wells, “A review on automatic analysis of human
embryo microscope images,” The open biomedical engineering journal, vol. 4, p. 170,
2010.
[3] A. Ahlström, C. Westin, E. Reismer, M. Wikland, and T. Hardarson, “Trophectoderm
morphology: an important parameter for predicting live birth after single blastocyst
transfer,” Human Reproduction, vol. 26, no. 12, pp. 3289–3296, 2011.
[4] C. Lagalla, M. Barberi, G. Orlando, R. Sciajno, M. A. Bonu, and A. Borini, “A
quantitative approach to blastocyst quality evaluation: morphometric analysis and
related IVF outcomes,” Journal of Assisted Reproduction and Genetics, vol. 32, no. 5,
pp. 705–712, 2015.
[5] W. B. Schoolcraft, D. K. Gardner, M. Lane, T. Schlenker, F. Hamilton, and D. R.
Meldrum, “Blastocyst culture and transfer: analysis of results and parameters affecting
outcome in two in vitro fertilization programs,” Fertility and Sterility, vol. 72, no. 4,
pp. 604–609, 1999.
[6] T. T. Huang, D. H. Huang, H. J. Ahn, C. Arnett, and C. T. Huang, “Early blastocyst
expansion in euploid and aneuploid human embryos: evidence for a non-invasive and
8
-
quantitative marker for embryo selection,” Reproductive Biomedicine Online, vol. 39,
no. 1, pp. 27–39, 2019.
[7] T. T. Huang, B. C. Walker, M. Harun, A. T. Ohta, M. Rahman, J. Mellinger,
and W. Chang, “Automated computer analysis of human blastocyst expansion from
embryoscope time-lapse image files,” Fertility and Sterility, vol. 112, no. 3, pp. e292–
e293, 2019.
[8] R. T. Scott Jr, K. Ferry, J. Su, X. Tao, K. Scott, and N. R. Treff, “Comprehensive
chromosome screening is highly predictive of the reproductive potential of human
embryos: a prospective, blinded, nonselection study,” Fertility and Sterility, vol. 97,
no. 4, pp. 870–875, 2012.
[9] M. D. Werner, M. P. Leondires, W. B. Schoolcraft, B. T. Miller, A. B. Copperman,
E. D. Robins, F. Arredondo, T. N. Hickman, J. Gutmann, W. J. Schillings et al.,
“Clinically recognizable error rate after the transfer of comprehensive chromosomal
screened euploid embryos is low,” Fertility and Sterility, vol. 102, no. 6, pp. 1613–1618,
2014.
[10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene
understanding,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 3213–3223.
[11] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands deep in deep learning for hand
pose estimation,” arXiv preprint arXiv:1502.06807, 2015.
[12] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learning a deep
convolutional network for light-field image super-resolution,” in Proceedings of the
IEEE international conference on computer vision workshops, 2015, pp. 24–32.
9
-
[13] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep learning
for content-based image retrieval: A comprehensive study,” in Proceedings of the 22nd
ACM international conference on Multimedia, 2014, pp. 157–166.
[14] H. Zhu, F. Meng, J. Cai, and S. Lu, “Beyond pixels: A comprehensive survey from
bottom-up to semantic image segmentation and cosegmentation,” Journal of Visual
Communication and Image Representation, vol. 34, pp. 12–27, 2016.
[15] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A.
Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in
medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
[16] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical Image
Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–
241.
[17] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent
residual convolutional neural network based on U-Net (R2U-Net) for medical image
segmentation,” arXiv preprint arXiv:1802.06955, 2018.
[18] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-denseunet: hybrid
densely connected unet for liver and tumor segmentation from ct volumes,” IEEE
transactions on medical imaging, vol. 37, no. 12, pp. 2663–2674, 2018.
[19] N. Ibtehaz and M. S. Rahman, “Multiresunet: Rethinking the U-Net architecture for
multimodal biomedical image segmentation,” Neural Networks, vol. 121, pp. 74–87,
2020.
[20] Z.-L. Ni, G.-B. Bian, X.-H. Zhou, Z.-G. Hou, X.-L. Xie, C. Wang, Y.-J. Zhou, R.-Q.
Li, and Z. Li, “Raunet: Residual attention u-net for semantic segmentation of cataract
surgical instruments,” in International Conference on Neural Information Processing.
Springer, 2019, pp. 139–149.
10
-
Chapter 2
Methodology
We implement RD-UNet model [1] for segmenting ICM and TE regions in ZP-intact
embryo images obtained by a microscope. The model is based on baseline UNet architecture
[2], residual convolutional units, and dilated convolutional layers. We will discuss each of
these components in the following sections.
For segmenting inner cell regions in ZP-ablated embryos, we propose a UNet based
model. Here, we use embryonic images obtained by time-lapse imaging. However, we adopt
a static image segmentation approach. There are two reasons for this choice: 1) inner cell
vary dramatically in the consecutive time frames with nonperiodic time points across frames
or videos (in general, 3 frames/hour); 2) the collected video dataset is relatively small – it
consists 45 videos with 30 or 31 frames each.
Inspired by successful applications of UNet to medical image segmentation [3], we
developed an improved convolutional neural network (CNN) architecture, called Deep
Dilated Residual Recurrent UNet (D2R2-UNet) for ZP-ablated embryo image segmentation.
Similar to the original UNet architecture [2], the proposed architecture, D2R2-UNet,
consists of encoder and decoder of which the last encoding and the first decoding units
are connected by a central bridge. Inspired by deep residual model [4] and recurrent CNN
[5], we made the following three modifications to the baseline UNet architecture:
11
-
1) We replaced convolutional units of the baseline UNet with residual convolutional
units using two recurrent convolutional layers, called R2 convolutional units, in both the
encoder and decoder.
2) In the central bridge, we replaced the convolutional layers with dilated convolutional
layers.
3) We incorporated a series of residual convolutional layers into the baseline UNet
encoder-decoder skip-connections between encoder and decoder.
We will discuss the details of those modifications in the following sections.
2.1 Baseline UNet Architecture
To better understand our modifications, we first briefly review the baseline UNet
architecture. The baseline UNet is composed of two symmetrical contracting (encoding)
and expansive (decoding) units that are connected to each other via encoder-decoder skip-
connections. The contracting units capture the context, whereas the expanding units enable
localization. The contracting units encode the input image into a set of feature maps using
convolutional layers with no skip connections. The expansive units decode the compact
feature maps into pixel-wise representation, i.e., semantic segmentation. This encoding-
decoding architecture is useful to perform the semantic segmentation task. The encoder and
decoder are built on the conventional CNN architecture, and consist of four down-sampling
and up-sampling convolutional units, respectively. Each down-sampling convolutional unit
involves a sequence of two convolutional layers with the 3 × 3 kernel size, followed by a
rectified linear unit (RELU) activation [6], and a max-pooling with the 2 × 2 window size
and the stride parameter 2. Fig. 2.2(a) demonstrates each convolutional unit in the baseline
UNet. The number of feature channels get doubled after performing down-sampling at each
encoder block. At the decoder side, each up-sampling convolutional unit involves a sequence
of two convolutional layers with the flipped 3×3 kernels of the encoder convolutional units,
followed by a RELU activation and upsampling with the 2×2 window size and the stride 2.
The feature channels are reducedby half at each up-sampling step. Then a concatenation,
12
-
Figure 2.1 The baseline UNet architecture [2].The height and width of each box representsthe image size and number of channels, respectively. The dotted boxes denote copied featuremaps.
i.e., skip connection, is established between down-sampled and upsampled features. In
the final layer in the decoder, a sigmoid activation is performed to generate class-wise
probabilities for each pixel. Both encoder and decoder consist of four convolutional units.
See the baseline UNet architecture in Fig. 2.1.
2.2 Proposed D2R2-UNet Architecture
To better capture the context particularly related to small structures, i.e., to improve
context modulation [5], we use residual recurrent (R2) convolutional units, instead of
typical convolutional units in the baseline UNet. R2 convolutional unit combines the
strength of resdiual and recurrent learning which benefits CNN to increase segmentation
performance. To facilitate context extraction from high-level features at the central bridge
without increasing CNN parameter dimensions, we use dilated convolutional layers [7] in
the central bridge, rather than regular convolutional layers used in the baseline UNet. To
reduce semantic disparity between low-level and high-level features [8] and better recover lost
information during pooling operation, we incorporate series of residual convolutional layers
into the baseline UNet encoder-decoder skip-connections. We describe these modifications
in details in the following subsections.
13
-
2.2.1 Residual Convolutional Unit
Skip connections [4] are incorporated to each convolutional unit of the baseline UNet
[2], based on the empirical results in [9] that having skip connections produced benign
optimization landscape in training. We hypothesize that residual convolutional units
improve the training/testing performances by considering that the baseline UNet is
sufficiently deep (23 convolutional layers; further modifications in the following subsections
lead to 36 convolutional layers). In a residual unit, there exists a residual skip connection
between input to the first convolutional layer and the output of second convolutional layer.
This residual skip connection is implemented by a 1× 1 convolution. Fig. 2.2(b) depicts a
residual convolutional unit.
2.2.2 Recurrent Convolutional Unit
We replace conventional convolutional units in the baseline UNet with recurrent
convolutional units that benefit CNN to better understand contexts especially related to
small objects, while avoiding increasing the number of CNN parameters [5]. At each step
s ≥ 1, we add a recurrent feature and feed-forward feature, each computed by a shared
convolutional kernel. Specifically, we use recurrent convolutional units with the number of
evolution steps S = 3, where a first recurrent convolutional layer performs the following
evolution steps:
x(0)c = A(fc ~ x)
x(1)c = A(rc ~ x(0)c + fc ~ x)
x(2)c = A(rc ~ x(1)c + fc ~ x)
(2.1)
for c = 1, . . . , C, in which C is the number of channels. Here, A is some activation function,
e.g., RELU [6], and ELU [10], the subscript index (·)c denotes the c-th convolutional channel,
the superscript indices (·)(s) denote the step points, s = 0, . . . , S−1, ~ denotes a convolution
operator, fc and rc is a feed-forward and recurrent convolutional kernel at the c-th channel,
respectively, ∀c, and x denotes the input. The second recurrent convolutional layer does not
expand the number of channels, i.e., it replaces x with the output from the first recurrent
14
-
(a) (b) (c) (d) (e)
Figure 2.2 Different convolutional units we compared in UNet. (a) A convolutional unit inthe baseline UNet [2]. (b) A residual convolutional unit [4]. (c) A recurrent convolutionalunit [5]. (d) A R2 convolutional unit [11]. (e) A recurrent convolutional layer [5] with thenumber of evolution steps S = 3. For all UNet variations, we use ELU [10] instead of RELU[6] since ELU slightly improved the embryo image segmentation performance.
convolutional layer, x(2)c , ∀c, in Equation 2.1. See graphical illustrations for a recurrent
convolutional unit consisting of these two recurrent convolutional layers in Fig. 2.2(c), and
a S = 3 recurrent convolutional layer in Fig. 2.2(e).
Using the recurrent convolutional units increases the UNet depth, while avoiding
increasing the UNet complexity by using shared convolutional kernels. We expect that
this is useful to better understand the context while avoiding overfitting risks (using S = 4
recurrent convolutional units improved the image recognition performances over CNN that
has the same depth and number of parameters by simply increasing the depth of CNN [5]).
We observed that in our application, using S = 3 recurrent convolutional units gives better
overall image segmentation performance, compared to using S = 2 and S = 4 recurrent
convolutional units.
2.2.3 Residual Recurrent (R2) Convolutional Unit
To further improve the image segmentation performance, we fuse recurrent convolutional
layers with residual connectivity and form R2 convolutional unit, similar to R2-UNet [11].
Fig. 2.2(d) depicts a R2 convolutional unit. Different from R2-UNet [11] that uses four
evolution steps S = 4, we use three S = 3 evolution steps (four evolution steps did
15
-
not improve the performance). Fig. 2.2(e) shows a R2 convolutional unit. This rigorous
formation develops a more efficient CNN that improves the segmentation performance due
to its better understanding of context during multiple evolution steps.
2.2.4 Dilated Convolution in the Central Bridge
Receptive field of the CNN plays critical role for semantic image segmentation. A broader
receptive field helps to extract information from larger region of the image. Stacking
more convolutional layers increases receptive field size linearly by kernel size, but increases
number of NN parameters [12]. Moreover, adding more down-sampling layers also expands
receptive field size multiplicatively, which comes at a price of spatial information loss [12].
Alternatively, dilated convolution provides exponential expansion of the receptive field with
no increase in NN parameter and loss of spatial information [7]. Unlike typical convolution
with no space between kernel weights, dilated convolution inserts zero(s) between kernel
weights depending on dilation rate and expands receptive field size accordingly. For example,
a 3 × 3 kernel with dilation rate 2 increases receptive field size from 3 × 3 to 7 × 7, while
keeping the number of kernel parameters as 9. After several downsampling steps, we add
multiple dilated convolution layers in the central bridge similar to [13], rather than stacking
additional pooling layers and/or typical convolutional layers. Therefore, we can preserve
spatial information in the central bridge and expand the receptive field of the baseline-
UNet from 140 × 140 to 198 × 198. Thus, adding multiple dilated convolutional layers to
the central bridge helps to expand network’s receptive field with larger access to the input.
This benefits CNN to better capture context and improve segmentation prediction.
2.2.5 Residual Convolutional Skip-Connections between Encoder and
Decoder
The conventional UNet encoder-decoder skip connections copy encoded features in the
encoder to the upsampled features in the decoder, which are supposed to be of higher
level because they are derived at the very deep UNet layers. Merging two sets of these
16
-
Figure 2.3 A residual convolutional encoder-decoder skip-connection consisting of fourresidual convolutional layers, each of which applies 3 × 3 convolution followed by ELUactivation.
Figure 2.4 The D2R2-UNet architecture for inner cell segmentation: we modify the UNetbackbone in Fig. 2.1 by using R2 convolutional units, dilated convolutional layers, andresidual convolutional encoder-decoder skip-connections. The height and width of each boxrepresents the image size and number of channels, respectively. The black and blue dottedboxes denote central bridge and copied feature maps, respectively.
features in the decoder facilitates the spatial information propagation and recovers the
lost information in upsampled features during pooling and/or RELU operations. However,
semantic gap potentially exists between the two sets of features, and this discrepancy might
affect the prediction accuracy [8]. To moderate this potential issue, we adapt the technique
in [8] that incorporates residual convolutional layers into the conventional encoder-decoder
skip-connections. Fig. 2.3 shows a residual convolutional encoder-decoder skip-connection.
2.3 RD-UNet Architecture
We implement a CNN architecture named Residual Dilated UNet (RD-UNet) [1] for the
ICM and TE segmentation. The RD-UNet is a modified version of the baseline UNet [2].
17
-
Figure 2.5 The RD-UNet architecture for ICM and TE segmentation [1]: we modify theUNet backbone in Fig. 2.1 by using residual convolutional units and dilated convolutionallayers.
We included residual convolutional units (see section 2.2.1) in the encoder and decoder
and dilated convolutional layers (see section 2.2.4) in the central bridge to improve the
segmentation performance. See Fig. 2.5 for the detailed architecture of RD-UNet.
2.4 Network Specifications
The encoder and decoder consisting of four R2 convolutional units, are connected via
residual convolutional encoder-decoder skip-connections in D2R2-UNet, as shown in
Fig. 2.4. The number of channels is c = 16 in the first unit of encoder (left most) and
we double the number in each successive unit (towards central bridge). Accordingly, we set
the number of channels in the first unit of decoder (next to central bridge) to c = 8 · 16 and
halve the number in each successive unit (towards final unit at right most). We reduced the
size of feature maps by half at each encoding step and doubled at each decoding step. Then,
we added five dilated convolutional layers to the central bridge using the dilation rates of 1,
2, 4, 8, and 16, successively. Since, the semantic gap between encoder and decoder tends to
decrease from shallow layers (at left) towards deep layers (at center), we gradually reduce
the number of residual convolutional layers (4, 3, 2, and 1) in skip-connections between
encoder and decoder in the direction from shallow layers to deep layers. We also added
batch normalization to accelerate convergence [14]. Besides that, we included 5% dropout
18
-
to prevent overfitting [15].
We compared the proposed D2R2-UNet with other candidate models such as baseline
UNet [2], Dilated UNet (D-UNet) [13], Residual UNet (Res-UNet) [16], Recurrent UNet
(Rec-UNet) [11], Recurrent Residual UNet (R2-UNet) [11], and Residual Dilated UNet (RD-
UNet) [1]. Baseline UNet consists of typical convolutional units. D-UNet employs series of
five dilated convolutional layers with dilation rates 1, 2, 4, 8, and 16 in the central bridge.
Res-UNet includes residual convolutional units. Rec-UNet utilizes recurrent convolutional
units. R2-UNet has R2 convolutional units. Finally, RD-UNet consists of a central bridge
similar to D-UNet and residual convolutional units. For the fair comparison among all the
CNNs, we kept the basic architecture equivalent and optimization hyperparameters and
dataset identical.
The RD-UNet for ICM and TE segmentation consists of four residual convoutional units
in encoder and decoder. It includes series of four dilated convolutional layers with dilation
rates 1, 2, 4, and 8 in the central bridge. Here, we used RELU activation [6] in each
encoding and decoding unit. Fig. 2.5 illustrates the detailed architecture. We compare the
RD-UNet model with existing models i.e., CNN with discrete cosine transform (DCT) [17],
coarse-to-fine texture analysis [18], texture analysis, clustering, and watershed algorithm
[19], VGG16 [20], and SD-UNet [13] for ICM segmentation. For TE segmentation, we
compare the RD-UNet model with existing models i.e., level-set algorithm and Retinex
theory [21], CNN with discrete cosine transform [17], and texture analysis, clustering, and
watershed algorithm [19].
2.5 Loss Function
We aim to classify each pixel based on two classes, target (inner cell corresponding to 1)
and background (non inner cell corresponding to 0), so our image segmentation problem can
19
-
be viewed as a pixel-wise binary classification problem. A natural choice for training loss
function is the binary cross-entropy loss E(S;x,y) that learns an image segmentation CNN
S by using input image x and ground-truth annotation image y ∈ {0, 1} by minimizing
averaged pixel-wise cross-entropy:
E(S;x,y) = − 1N
N∑n=1
yn log(S(x)n) + (1− yn) log(1− S(x)n) (2.2)
where the sigmoid function in the final layer of S gives probability prediction values,
i.e., {S(x)n ∈ (0, 1) : ∀n}.
In our training images, the number of pixels classified as backgrounds often dominates
that classified as inner cell, so the cross-entropy training loss in Equation 2.2 can potentially
underestimate the inner cell prediction. To overcome the class imbalance limitation, we
incorporate the Jaccard index J(S;x,y) [22] that quantifies the similarity between ground
truth annotation y and probability prediction values S(x):
J(S;x,y) = 1N
N∑n=1
ynS(x)nyn + S(x)n − ynS(x)n
, (2.3)
Where J(S;x,y) ∈ [0, 1]. Combining Equation 2.2 and 2.3 gives the following joint training
loss [22]:
L(S;x,y) = E(S;x,y)− log(J(S;x,y)). (2.4)
We do not include a regularization parameter in Equation 2.4, because we observed that
both binary cross entropy and Jaccard loss, Equation 2.2 and 2.3, are in the similar range
[22]. The net effect is that as the total loss minimizes, one can simultaneously improve
the pixel classification accuracy and increase the intersection between ground truth and
predicted segmentation.
20
-
2.6 Implementation Details
We implemented training and testing of all CNNs using Keras with TensorFlow backend.
We used Nadam optimizer (Adam with Nesterov momentum) [23] to minimize the loss
function in Equation 2.4. Here, we set the initial learning rate to 10−3 and reduced it by a
factor of 0.05 in every 5 epochs, whereas the minimum learning rate was 10−5. We trained
all CNNs on a GPU (NVIDIA GTX 1070 with 8GB memory) with the mini-batch size of
4. Since loss and Jaccard values stagnated near 100 epochs, we set maximum epochs to
100. We split the dataset of 1368 images into the training set (75% of dataset) and testing
set (25% of dataset). We randomly sampled the training set in every epoch to improve the
learning. Given a small training set, we applied data augmentation such as horizontal and
vertical flips, rotation in a range up to 270◦, horizontal and vertical shifts up to 10% of
width or height and zoom up to 10% in size. Finally, we used 0.5 threshold for the final
semantic probability map.
2.7 Dataset
2.7.1 Dataset for ICM and TE Segmentation
We used the blastocyst dataset [19]. The ICM and TE regions were manually segmented and
annotated by embryologists at Pacific Centre for Reproductive Medicine, Canada. We used
the human annotated images as ground truth to evaluate the segmentation performance.
The dataset has 249 images in total. We split the dataset into two sets: a training set
consists of 211 images and a testing set consists of 38 images.
2.7.2 Dataset for Inner Cell Segmentation
We constructed the dataset of total 1368 images representing genetic health conditions
(normal/abnormal). The embryologists at the Pacific IVF Institute in Hawai‘i cultured
and monitored the embryos over 6 days using embryoscopes (Vitrolife, USA). At day 5 of
culture, they ablated ZPs using a Lykos laser (Hamilton-Thorne, USA). The embryoscopes
21
-
captured the images of ZP-ablated embryos for 10 hours, using time-lapse imaging technique
[24; 25]. The pixels corresponding to the inner cell were manually annotated by personnel
supervised by embryologists. We use human annotated images as ground truth to train and
test CNNs.
22
-
References
[1] M. Y. Harun, T. Huang, and A. T. Ohta, “Inner cell mass and trophectoderm
segmentation in human blastocyst images using deep neural network,” in 13th IEEE
International Conference on Nano/Molecular Medicine and Engineering. IEEE, 2019,
pp. 214–219.
[2] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical Image
Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–
241.
[3] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A.
Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in
medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 770–778.
[5] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2015, pp. 3367–3375.
23
-
[6] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann
machines,” in Proceedings of the 27th International Conference on Machine Learning
(ICML), 2010, pp. 807–814.
[7] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv
preprint arXiv:1511.07122, 2015.
[8] N. Ibtehaz and M. S. Rahman, “Multiresunet: Rethinking the U-Net architecture for
multimodal biomedical image segmentation,” Neural Networks, vol. 121, pp. 74–87,
2020.
[9] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape
of neural nets,” in Proc. NIPS 31, Montreal, Canada, Dec. 2018, pp. 6389–6399.
[10] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network
learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015.
[11] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent
residual convolutional neural network based on U-Net (R2U-Net) for medical image
segmentation,” arXiv preprint arXiv:1802.06955, 2018.
[12] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field
in deep convolutional neural networks,” in Advances in neural information processing
systems, 2016, pp. 4898–4906.
[13] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Multi-resolutional ensemble of stacked
dilated U-Net for inner cell mass segmentation in human embryonic images,” in 25th
IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 3518–
3522.
[14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
24
-
[15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
a simple way to prevent neural networks from overfitting,” The Journal of Machine
Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[16] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual U-Net,” IEEE
Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.
[17] S. Kheradmand, P. Saeedi, and I. Bajic, “Human blastocyst segmentation using neural
network,” in IEEE Canadian Conference on Electrical and Computer Engineering
(CCECE). IEEE, 2016, pp. 1–4.
[18] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Coarse-to-fine texture analysis for
inner cell mass identification in human blastocyst microscopic images,” in Seventh
International Conference on Image Processing Theory, Tools and Applications (IPTA).
IEEE, 2017, pp. 1–5.
[19] P. Saeedi, D. Yee, J. Au, and J. Havelock, “Automatic identification of human
blastocyst components via texture,” IEEE Transactions on Biomedical Engineering,
vol. 64, no. 12, pp. 2968–2978, 2017.
[20] S. Kheradmand, A. Singh, P. Saeedi, J. Au, and J. Havelock, “Inner cell mass
segmentation in human HMC embryo images using fully convolutional network,” in
IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 1752–
1756.
[21] A. Singh, J. Au, P. Saeedi, and J. Havelock, “Automatic segmentation of trophectoderm
in microscopic images of human blastocysts,” IEEE Transactions on Biomedical
Engineering, vol. 62, no. 1, pp. 382–393, 2014.
[22] V. Iglovikov, S. Mushinskiy, and V. Osin, “Satellite imagery feature detection
using deep convolutional neural network: A Kaggle competition,” arXiv preprint
arXiv:1706.06169, 2017.
25
-
[23] T. Dozat, “Incorporating nesterov momentum into Adam,” 2016.
[24] T. T. Huang, D. H. Huang, H. J. Ahn, C. Arnett, and C. T. Huang, “Early blastocyst
expansion in euploid and aneuploid human embryos: evidence for a non-invasive and
quantitative marker for embryo selection,” Reproductive Biomedicine Online, vol. 39,
no. 1, pp. 27–39, 2019.
[25] T. T. Huang, B. C. Walker, M. Harun, A. T. Ohta, M. Rahman, J. Mellinger,
and W. Chang, “Automated computer analysis of human blastocyst expansion from
embryoscope time-lapse image files,” Fertility and Sterility, vol. 112, no. 3, pp. e292–
e293, 2019.
26
-
Chapter 3
Results and Discussion
3.1 Evaluation
To validate embryo image segmentation, we used different evaluation metrics such as Jaccard
index, Dice Coefficient, accuracy, precision, specificity, and recall. These metrics [1] are
calculated based on four cardinalities i.e., true positive (TP), false positive (FP), true
negative (TN), and false negative (FN). TP measures the number of pixels that are correctly
identified as target class (ICM/TE/inner cell). Analogously, TN shows the number of pixels
that are truly detected as background (non ICM/TE/inner cell). On the contrary, FP
indicates pixels that are incorrectly identified as target class. Similarly, FN counts pixels
that are misclassified as background pixels.
• The Jaccard Index, also termed as intersection over union, is a similarity measure
that is defined as the intersection between two sets A and B divided by their union,
that is:
Jaccard =|A ∩B||A ∪B|
=TP
TP + FP + FN(3.1)
• The Dice Coefficient, also known as overlap index, is also a measure of overlap between
two sets, defined by:
Dice =2× |A ∩B||A|+ |B|
=2× TP
2× TP + FP + FN(3.2)
27
-
Dice and Jaccard both are equal to 1 if there exists 100% overlap between predicted
and ground truth segmentation.
• Accuracy is the ratio of correctly classified pixels, regardless of class, express as follows:
Accuracy =TP + TN
TP + TN + FP + FN(3.3)
• Specificity, also called true negative rate, calculates the percentage of negative pixels
in ground truth that are also detected as negative by CNN. It is given by:
Specificity =TN
TN + FP(3.4)
• Precision, also called positive predictive value, measures the percent of correctly
segmented pixels among all the segmented pixels. In ideal case, precision 1 means
there is no FP in the segmentation. This is defined as follows:
Precision =TP
TP + FP(3.5)
• Recall is the fraction of all the labeled ICM/TE/inner cell pixels that are correctly
predicted, can be expressed as follows:
Recall =TP
TP + FN(3.6)
3.2 ICM and TE Segmentation Results
3.2.1 Quantitative Results
We compare the ICM segmentation performance of RD-UNet model with existing models
i.e., CNN with discrete cosine transform [2], coarse-to-fine texture analysis [3], texture
analysis, clustering, and watershed algorithm [4], VGG16 [5], and SD-UNet [6]; see Table
28
-
Table 3.1 Comparison of ICM results of our method with that of existing methods basedon same data set.
Methods Jaccard (%) Dice (%) Precision (%) Recall (%) Accuracy (%)
CNN with DCT [2] 47.7 64.6 75.6 56.4 93.0
Coarse-to-fine texture analysis [3] 70.3 82.6 78.7 86.8 –
Texture, clustering, and watershed algorithm [4] 71.1 83.1 84.5 78.3 93.3
VGG16 [5] 76.5 86.7 – – 95.6
SD-UNet [6] 81.6 89.5 88.6 91.5 98.3
RD-UNet [8] 89.3 94.3 94.9 93.8 99.1
Table 3.2 Comparison of TE results of our method with that of existing methods based onsame data set.
Methods Jaccard (%) Dice (%) Precision (%) Recall (%) Accuracy (%)
Level-set algorithm and Retinex theory [7] 62.2 76.7 71.3 83.1 86.7
CNN with DCT [2] 58.9 74.2 69.1 80.0 90.0
Texture, clustering, and watershed algorithm [4] 63.0 77.3 69.0 89.0 86.6
RD-UNet [8] 85.3 92.5 91.8 93.2 98.3
3.1. The experimental results illustrate that RD-UNet model achieves better performance
than aforementioned existing models. It outperforms the SD-UNet model [6] by 0.8% in
accuracy, 7.1% in precision, 2.5% in recall, 5.4% in the Dice Coefficient, and 9.4% in the
Jaccard Index.
Moreover, we compare the TE segmentation performance of RD-UNet model with
existing models i.e., level-set algorithm and Retinex theory [7], CNN with discrete cosine
transform [2], and texture analysis, clustering, and watershed algorithm [4]; see Table
3.2. TE segmentation results indicate that RD-UNet model outperforms existing models
particularly, Texture, clustering and watershed model [4] by 13.5% in accuracy, 33% in
precision, 4.7% in recall, 19.7% in the Dice coefficient, and 35.4% in the Jaccard index.
Compared to the existing models, we achieve highest Jaccard and Dice scores
e.g., maximum overlap between network’s predictions and corresponding ground truth
annotations. Furthermore, the RD-UNet model significantly reduces the false positive
(misclassified as ICM/TE) pixels and false negative (misidentified as background) pixels
throughout the dataset; see increased precision and recall scores in Tables 3.1 and 3.2.
The RD-UNet model can understand context better due to the residual convolutional units
29
-
in the encoding and decoding units and dilated convolutional layers in the central bridge.
Consequently, it improves the ICM and TE segmentation performances.
3.2.2 Qualitative Results
We compared predicted ICM and TE segmentation results with ground truth ICM and TE
annotations. The contours of ground truth ICM and TE annotations are overlaid on that of
predicted ICM and TE segments to visualize the differences. To better understand the ICM
segmentation quality, we categorize the results according to best (Jaccard Index of more
than 97%), better (Jaccard Index from 92% to 97%), and fair (Jaccard Index from 77% to
92%) segmentation; see Fig. 3.1. The 36.8%, 50%, and 13.2% of test images are in the best,
better, and fair segmentation categories, respectively in ICM segmentation results.
To better understand the TE segmentation quality, we categorize the results according
to best (Jaccard Index of more than 94%), better (Jaccard Index from 87% to 94%), and
fair (Jaccard Index from 76% to 87%) segmentation; see Fig. 3.2. The 31.6%, 47.4%, and
21% of test images are in the best, better, and fair segmentation categories, respectively
in TE segmentation results. The segmentation results have been classified by the Jaccard
Index since other performance metrics are reasonably high.
Next, we discuss the qualitative ICM segmentation performance of the RD-UNet model;
Fig. 3.1 shows some representative results, where the (i, j)th image denotes an image at the
ith row and jth column in Fig. 3.1. For all the three segmentation categories, RD-UNet
successfully segments ICM regions even if they connect and/or overlap with TE/CM; see
the (1, 4)th, (2, 4)th, (3, 4)th, (4, 4)th images. In general, contours of segmented ICM well
align with that of the ground truth annotations; compare red contours and yellow regions in
the 4th column. However, RD-UNet model shows limited ICM segmentation performances
where some indistinct features exist between ICM and TE/CM; see the fair segmentation
category in Fig. 3.1. For example, in the (5, 4)th image, miss-segmentation exists where
30
-
Figure 3.1 ICM segmentation results by RD-UNet. The background (non ICM) is coloreddark cyan, the annotated ground truth ICM is light green, the network predicted ICM isyellow, and the contour of the ground truth ICM is red. JI and DC stand for Jaccard Indexand Dice Coefficient, respectively.
31
-
Figure 3.2 TE segmentation results by RD-UNet. The background (non TE) is colored darkcyan, the annotated ground truth TE is light green, the network predicted TE is yellow,and the contour of the ground truth TE is red. JI and DC stand for Jaccard Index andDice Coefficient, respectively.
32
-
ICM and TE have similar texture; in the (6, 4)th image, miss-segmentation happens where
ICM has similar texture to that of CM.
We explain the qualitative TE segmentation performance of the RD-UNet model;
Fig. 3.2 demonstrates some representative results, where the (i, j)th image denotes an image
at the ith row and jth column in Fig. 3.2. For all the three segmentation categories, RD-
UNet successfully segments TE regions even if they connect and/or overlap with ICM/CM;
see the (1, 4)th, (2, 4)th, (3, 4)th, (4, 4)th images. Here, contours of segmented TE closely
align with that of the ground truth annotations; compare red contours and yellow regions
in the 4th column. However, RD-UNet model shows limited TE segmentation performances
where some indistinct features exist between TE and ICM/CM; see the fair segmentation
category in Fig. 3.2. For example, in the (5, 4)th image, miss-segmentation exists where it
is challenging to differentiate edges of TE and CM; in the (6, 4)th image, miss-segmentation
happens where it is difficult to differentiate edges of TE and ICM.
3.3 Inner Cell Segmentation Results
3.3.1 Quantitative Results
We compared the D2R2-UNet with other equivalent UNet variant models i.e., UNet [9], D-
UNet [6], Res-UNet [10], Rec-UNet [11], R2-UNet [11], and RD-UNet [8] to gauge the
segmentation potential. Fig. 3.3(a) demonstrates the joint loss (Equation 2.4) during
training process, showing that D2R2-UNet model outperforms other UNet variant models
with 7.79% loss. The proposed model better captures inner cell features as well as effectively
isolates the artifacts and fragmented cellular clusters.
Alongside, Fig. 3.3(b) exhibits the joint loss (Equation 2.4) during testing process, again,
our model surpasses other models with 7.45% loss. Comparing Fig. 3.3(a) and 3.3(b), the
testing loss is less than the training loss which may imply two things. First, the network
has good generalization capability i.e., prevents overfitting. Second, the network underfits
slightly which might be caused by over-regularization i.e., dropout rate, however, this did
33
-
(a) Training loss (b) Testing loss
10 20 30 40 50 60 70 80 90 100
epochs
7
8
9
10
11
12
Lo
ss (
%)
UNet
Res-UNet
RD-UNet
Rec-UNet
D-UNet
R2-UNet
D2R2-UNet
10 20 30 40 50 60 70 80 90 100
epochs
7
8
9
10
11
12
Lo
ss (
%)
UNet
Res-UNet
Rec-UNet
D-UNet
R2-UNet
RD-UNet
D2R2-UNet
Figure 3.3 Comparisons of the joint loss (Equation 2.4) between different UNet variantmodels for inner cell segmentation: (a) training loss and (b) testing loss.
Table 3.3 Comparison among different UNet architectures based on their inner cellsegmentation performance evaluated on same testing set
CNNs Jaccard (%) Dice (%) Precision (%) Accuracy (%) Specificity (%)
UNet [9] 94.04 96.93 96.74 98.78 99.19
D-UNet [6] 95.40 97.64 97.29 99.07 99.33
Res-UNet [10] 95.26 97.57 97.36 99.04 99.35
Rec-UNet [11] 95.28 97.58 97.11 99.04 99.28
R2-UNet [11] 95.53 97.72 97.41 99.09 99.36
RD-UNet [8] 95.55 97.72 97.53 99.10 99.39
Proposed D2R2-UNet 95.65 97.78 97.66 99.12 99.42
not significantly affect the segmentation performance. Here, we finely tuned dropout rate
to prevent the overfitting which causes slight underfitting. Moreover, if we compare the
training loss with the testing loss, we observe that the minimum loss values are almost
similar in both cases. This highlights that our network generalizes well and avoids overfitting
like the baseline UNet. Our network outperforms the baseline UNet by a large margin. The
increased testing performance of D2R2-UNet reflects the fact that it better recognizes the
features relevant to the varying inner cell. Also, it better combines low level and high level
features and understand context well at deeper levels of the network. Finally, we summarize
the overall segmentation performance of all models based on a testing set of 342 images in
Table 3.3.
Although UNet and its variants exhibit significant performance, the D2R2-UNet
provides the best overall performance with Jaccard Index of 95.65% and Dice Coefficient of
34
-
97.78%. The intuition behind this enhanced performance is, the proposed network forms
a robust architecture owing to three major modifications: 1) R2 convolutional units in the
encoder and decoder, 2) dilated convolutional layers in the central bridge, and 3) residual
convolutional layers in the encoder-decoder skip-connections, whereas other UNet models
do not include all of them.
3.3.2 Qualitative Results
The network’s predictions were compared with the corresponding ground truths to evaluate
the segmentation performance. We organized the segmentation results predicted by D2R2-
UNet into three performance categories: 1) best prediction, 2) better prediction, and 3) fair
prediction. This gives a clear idea about the overall segmentation performance throughout
the testing dataset. Here, the best prediction is defined by Jaccard Index more than 96%.
Similarly, the better prediction is based on Jaccard Index from 92% to 96%. Finally, the
fair prediction includes Jaccard Index between 86% and 92%. Of the total 342 testing
images, 167 images fall into the best category, 163 images correspond to the better category
and the remaining 12 images are in the fair performance category. Among all predictions,
the highest individual Jaccard and Dice are 98.55% and 99.27%, respectively. The lowest
individual Jaccard and Dice are 86.06% and 92.51%, respectively.
We discuss the qualitative inner cell segmentation performance of the proposed D2R2-
UNet model; Fig. 3.4 shows some representative results, where the (i, j)th image denotes
an image at the ith row and jth column in Fig. 3.4. For all the three prediction categories,
D2R2-UNet successfully segments inner cell regions beyond the culture well containing white
bands and/or in dark background; see the (2, 1)th, (2, 2)th, (2, 6)th, (5, 3)th, (5, 4)th, and
(5, 5)th images. The proposed D2R2-UNet model effectively identifies an outline of inner
cells even if outlines connect with artifacts, e.g., the (2, 4)th, (5, 2)th images, or fragmented
cellular clusters, e.g., the (2, 3)th, (5, 3)th images. In general, contours of segmented inner
cells well align with that of the ground truth annotations; compare red and blue contours
in the 3rd and 6th rows.
35
-
Figure 3.4 Segmentation results. Light green in 2nd and 5th rows indicates segmented innercell by D2R2-UNet. Red and blue in 3rd and 6th rows indicate the boundaries of groundtruth and predicted inner cell, respectively. JI and DC stand for Jaccard index and Dicecoefficient, respectively.
36
-
D2R2-UNet shows limited segmentation performances where some indistinct features
exist between inner cells and ZPs. For example, in the (3, 4)th image, miss-segmentation
exists where it is challenging to differentiate edges of ZP and inner cell (where the edge
around ZP is stronger than the usual); in the (6, 5)th image, miss-segmentation happens
where inner cell has similar texture to that of ZP; in the (6, 6)th image, miss-segmentation
exists where edges between inner cell and ZP are indistinct.
37
-
References
[1] A. A. Taha and A. Hanbury, “Metrics for evaluating 3D medical image segmentation:
analysis, selection, and tool,” BMC Medical Imaging, vol. 15, no. 1, p. 29, 2015.
[2] S. Kheradmand, P. Saeedi, and I. Bajic, “Human blastocyst segmentation using neural
network,” in IEEE Canadian Conference on Electrical and Computer Engineering
(CCECE). IEEE, 2016, pp. 1–4.
[3] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Coarse-to-fine texture analysis for
inner cell mass identification in human blastocyst microscopic images,” in Seventh
International Conference on Image Processing Theory, Tools and Applications (IPTA).
IEEE, 2017, pp. 1–5.
[4] P. Saeedi, D. Yee, J. Au, and J. Havelock, “Automatic identification of human
blastocyst components via texture,” IEEE Transactions on Biomedical Engineering,
vol. 64, no. 12, pp. 2968–2978, 2017.
[5] S. Kheradmand, A. Singh, P. Saeedi, J. Au, and J. Havelock, “Inner cell mass
segmentation in human HMC embryo images using fully convolutional network,” in
IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 1752–
1756.
[6] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Multi-resolutional ensemble of stacked
dilated U-Net for inner cell mass segmentation in human embryonic images,” in 25th
38
-
IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 3518–
3522.
[7] A. Singh, J. Au, P. Saeedi, and J. Havelock, “Automatic segmentation of trophectoderm
in microscopic images of human blastocysts,” IEEE Transactions on Biomedical
Engineering, vol. 62, no. 1, pp. 382–393, 2014.
[8] M. Y. Harun, T. Huang, and A. T. Ohta, “Inner cell mass and trophectoderm
segmentation in human blastocyst images using deep neural network,” in 13th IEEE
International Conference on Nano/Molecular Medicine and Engineering. IEEE, 2019,
pp. 214–219.
[9] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical Image
Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–
241.
[10] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual U-Net,” IEEE
Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.
[11] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent
residual convolutional neural network based on U-Net (R2U-Net) for medical image
segmentation,” arXiv preprint arXiv:1802.06955, 2018.
39
-
Chapter 4
Conclusion and Future Work
4.1 Conclusion
The thesis has demonstrated works on a biomedical project aiming to improve the existing
IVF treatment for infertility. Automating embryo image segmentation with high accuracy
is important for sustaining healthy pregnancies in IVF, since it is a basic element in
morphological and morphokinetic analysis for evaluating embryo viability. The project
demonstrated deep learning based embryo image segmentation methods to 1) segment ICM
and TE in ZP-intact embryonoic images for morphological analysis and 2) segment inner cell
in ZP-ablated embryonic images for morphokinetic study. Improving semantic segmentation
CNN is particularly useful for ICM and TE segmentation since it is challenging to segment
ICM and TE regions due to the similar textures of embryo regions (ICM/TE/ZP/CM) and
artifacts and image contrast variations. The CNN can resovle these issues by improving
feature extraction and better understanding the context. Furthermore, it is important to
enhance segmentation CNN for inner cell segmentation. Because it is difficult to segment
inner cell with the conventional inner cell segmentation method in an embryoscope due to
irregular expansion of inner cells, some artifacts and cellular clusters near inner cell outlines,
and potential white bands and/or dark backgrounds around expanded inner cell. The CNN
can overcome these challenges by better capturing inner cell features.
40
-
We implemented RD-UNet model and developed D2R2-UNet model in order to overcome
the aforementioned segmentation challenges. We implemented RD-UNet model [1] by
incorporating residual convolutional units in encoder and decoder and adding series of
dilated convolutional layers to the central bridge. The RD-UNet model improves the ICM
segmentation and outperforms the existing models i.e., CNN with discrete cosine transform
[2], coarse-to-fine texture analysis [3], texture analysis, clustering, and watershed algorithm
[4], VGG16 [5], and SD-UNet [6] with a 94.3% Dice Coefficient and a 89.3% Jaccard Index.
It achieves the best performances in TE segmentation with a 92.5% Dice Coefficient and
a 85.3% Jaccard Index compared to existing models i.e., level-set algorithm and Retinex
theory [7], CNN with discrete cosine transform [2], and texture analysis, clustering, and
watershed algorithm [4]. We believe that this model can be used for precisely segmenting
ICM and TE for morphological analysis of embryos towards improved pregnancy outcomes
in IVF.
For inner cell segmentation, we proposed a UNet-based CNN architecture by replacing
UNet encoding-decoding units, central bridge, and encoder-decoder skip-paths with R2
convolutional encoding-decoding units, dilated convolutional central bridge, and residual
convolutional encoder-decoder skip-paths, respectively. The proposed D2R2-UNet model
improves inner cell segmentation performances with a Jaccard Index of 95.65% and a Dice
Coefficient of 97.78% compared to the existing UNet variants i.e., UNet [8], D-UNet [6], Res-
UNet [9], Rec-UNet [10], R2-UNet [10], and RD-UNet [1]. The model better understands
the context and/or reduces semantic disparity between encoder and decoder. We believe
that the proposed model can accurately segment inner cell for morphokinetic analysis of
embryos and facilitate sustained pregnancies in IVF.
4.2 Future Work
Our future work is using temporal information between frames to improve inner cell
segmentation performances. The current dataset has small number of frames per video
41
-
(30 or 31 frames) with long interval (20 minutes/frame), so there exists large and irregular
spatial variations between consecutive frames and modeling temporal changes is challenging.
In addition, the dataset has total 45 videos, so it has limited diversity in training video
segmentation CNNs. To overcome the aforementioned limitations, future works will obtain
more frames per video (regular and shorter interval in time-lapse imaging setup) and more
videos.
42
-
References
[1] M. Y. Harun, T. Huang, and A. T. Ohta, “Inner cell mass and trophectoderm
segmentation in human blastocyst images using deep neural network,” in 13th IEEE
International Conference on Nano/Molecular Medicine and Engineering. IEEE, 2019,
pp. 214–219.
[2] S. Kheradmand, P. Saeedi, and I. Bajic, “Human blastocyst segmentation using neural
network,” in IEEE Canadian Conference on Electrical and Computer Engineering
(CCECE). IEEE, 2016, pp. 1–4.
[3] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Coarse-to-fine texture analysis for
inner cell mass identification in human blastocyst microscopic images,” in Seventh
International Conference on Image Processing Theory, Tools and Applications (IPTA).
IEEE, 2017, pp. 1–5.
[4] P. Saeedi, D. Yee, J. Au, and J. Havelock, “Automatic identification of human
blastocyst components via texture,” IEEE Transactions on Biomedical Engineering,
vol. 64, no. 12, pp. 2968–2978, 2017.
[5] S. Kheradmand, A. Singh, P. Saeedi, J. Au, and J. Havelock, “Inner cell mass
segmentation in human HMC embryo images using fully convolutional network,” in
IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 1752–
1756.
43
-
[6] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Multi-resolutional ensemble of stacked
dilated U-Net for inner cell mass segmentation in human embryonic images,” in 25th
IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 3518–
3522.
[7] A. Singh, J. Au, P. Saeedi, and J. Havelock, “Automatic segmentation of trophectoderm
in microscopic images of human blastocysts,” IEEE Transactions on Biomedical
Engineering, vol. 62, no. 1, pp. 382–393, 2014.
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical Image
Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–
241.
[9] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual U-Net,” IEEE
Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.
[10] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent
residual convolutional neural network based on U-Net (R2U-Net) for medical image
segmentation,” arXiv preprint arXiv:1802.06955, 2018.
44