medical image segmentation for embryo image …...medical image segmentation for embryo image...

MEDICAL IMAGE SEGMENTATION FOR EMBRYO IMAGE ANALYSIS

A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THEUNIVERSITY OF HAWAI‘I IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

IN

ELECTRICAL ENGINEERING

MAY 2020

ByMd Yousuf Harun

Thesis Committee:

Dr. Aaron Ohta, ChairpersonDr. Il Yong Chun, Chairperson

Dr. Victor Lubecke

c© Copyright 2020by

Md Yousuf HarunAll Rights Reserved

ii

To the sundial in the center of the courtyard

iii

Acknowledgements

I would like to express my gratitude to a number of individuals who have been instrumental

to my education over the past two years at the University of Hawai‘i at Mānoa.

First of all, I would like to thank my advisors, Dr. Aaron Ohta and Dr. Il Yong Chun

for their enormous help in my MS research.

I am grateful to Dr. Aaron Ohta for involving me in the exciting interdisciplinary

embryo image segmentation project that brings together doctors and engineers. He always

supported me and gave me the freedom to pursue my research in directions that were of

special interest to me. I have learned many invaluable things from him which will pave my

future research endeavors. Working with him is a great opportunity and rewarding in every

aspect of my academic life.

I want to express my gratitude to Dr. Il Yong Chun for introducing me to the exciting

field of computational medical imaging. He always motivated me and guided me to perform

good research. I am thankful to him for helping me to improve my technical writing skills.

I would like to thank Dr. Victor Lubecke for taking time to be a part of my thesis

committee. I appreciate the input and support he has given in order to improve this thesis.

I want to thank Dr. Thomas Huang for his guidance and collaboration in the embryo

image segmentation project. No result in this thesis would have been possible without the

fruitful cooperation between doctors and engineers.

I am also thankful to M Arifur Rahman, Kareem Elassy, Mohsen Paryavi,

Meenakshi Vohra, Richie Chio and many laboratory colleagues I have had for their

support and guidance. The discussions I have had with them were instrumental to my

iv

research and this manuscript. Special thanks to Arif for his help and suggestions in all

aspects of graduate life

Most of all, I would like to thank my parents, family and friends for their continued support

throughout my life. Their encouragement has driven me to do my best as both a student

and a person. I dedicate this thesis to them.

v

Abstract

This thesis describes a project that applies electrical engineering to biomedical applications.

The project involves the development of a deep learning-based image segmentation method

to identify cellular regions in microscopic images of human embryos for their morphological

and morphokinetic analysis during in vitro fertilization (IVF) treatment. First, we aim

to segment inner cell mass (ICM) and trophectoderm epithelium (TE) in zona pellucida

(ZP)-intact embryos imaged by a microscope for morphological analysis. ICM and TE

segmentation in ZP-intact embryonic images is difficult due to small number of training

images (211 ZP-intact embryonic images) and similar textures among ICM, TE, ZP, and

artifacts. We overcame the aforementioned challenges by leveraging deep learning and

semantic segmentation techniques. In this work, we implemented a UNet variant model

named Residual Dilated UNet (RD-UNet) to segment ICM and TE in ZP-intact embryonic

images. We added residual convolution to the encoding and decoding units and replaced

conventional convolutional layer with multiple dilated convolutional layers at the central

bridge of RD-UNet. The experimental results with a testing set of 38 ZP-intact embryonic

images demonstrate that RD-UNet outperforms existing models. RD-UNet can identify

ICM with a Dice Coefficient of 94.3% and a Jaccard Index of 89.3%. The model can

segment TE with a Dice Coefficient of 92.5% and a Jaccard Index of 85.3%.

Second, we aim to segment inner cell regions in ZP-ablated embryonic images obtained

by time-lapse microscopic imaging for morphokinetic analysis. Segmenting inner cell

regions in ZP-ablated embryonic images has following challenges: irregular expansion of

vi

inner cell, surrounding fragmented cellular clusters and artifacts, and inner cell expansion

beyond culture well. We proposed a UNet based architecture named Deep Dilated Residual

Recurrent UNet (D2R2-UNet) to segment inner cell regions in ZP-ablated embryonic

images. We incorporated residual recurrent convolution into the encoding and decoding

units, dilated convolution into the central bridge, and residual convolution into the

encoder-decoder skip-connections in order to maximize the segmentation performance. The

experimental results with a testing set of 342 ZP-ablated embryonic images demonstrate

that the proposed D2R2-UNet improves inner cell segmentation performances over existing

UNet variants. Our model obtains the best overall performance as compared to other

models in inner cell segmentation, with a Jaccard Index of 95.65% and a Dice Coefficient

of 97.78%.

vii

Table of Contents

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation of Embryo Image Segmentation . . . . . . . . . . . . . . . . . . 1

1.2 Quantitative Evaluation of Embryo Viability . . . . . . . . . . . . . . . . . 3

1.3 ICM and TE Segmentation Challenges in Morphological Analysis . . . . . . 4

1.4 Inner Cell Segmentation Challenges in Morphokinetic Analysis . . . . . . . 5

1.5 Semantic Segmentation with Deep Learning . . . . . . . . . . . . . . . . . . 6

1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Baseline UNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Proposed D2R2-UNet Architecture . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Residual Convolutional Unit . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Recurrent Convolutional Unit . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Residual Recurrent (R2) Convolutional Unit . . . . . . . . . . . . . 15

2.2.4 Dilated Convolution in the Central Bridge . . . . . . . . . . . . . . . 16

2.2.5 Residual Convolutional Skip-Connections between Encoder and

Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

viii

2.3 RD-UNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Network Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7.1 Dataset for ICM and TE Segmentation . . . . . . . . . . . . . . . . 21

2.7.2 Dataset for Inner Cell Segmentation . . . . . . . . . . . . . . . . . . 21

Chapter 3: Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 ICM and TE Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Inner Cell Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Chapter 4: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

ix

List of Tables

3.1 Comparison of ICM results of our method with that of existing methods

based on same data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Comparison of TE results of our method with that of existing methods based

on same data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Comparison among different UNet architectures based on their inner cell

segmentation performance evaluated on same testing set . . . . . . . . . . . 34

x

List of Figures

1.1 (a) an image of an embryo and its (b) annotated regions. Here, ZP, ICM,

CM, and TE denote zona pellucida, inner cell mass, cavity mass, and

trophectoderm epithelium, respectively. . . . . . . . . . . . . . . . . . . . . 3

1.2 Expansion kinetics of (a) a genetically normal embryo (b) a genetically

abnormal embryo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Examples of inner cell segmentation challenges in ZP-ablated embryo: (a)

ZP-ablated embryo, (b) artifacts, (c) inner cell beyond culture well. . . . . 5

1.4 Semantic segmentation: (a) an image of street view and its (b) pixel

annotated segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 The baseline UNet architecture [2].The height and width of each box

represents the image size and number of channels, respectively. The dotted

boxes denote copied feature maps. . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Different convolutional units we compared in UNet. (a) A convolutional

unit in the baseline UNet [2]. (b) A residual convolutional unit [4]. (c) A

recurrent convolutional unit [5]. (d) A R2 convolutional unit [11]. (e) A

recurrent convolutional layer [5] with the number of evolution steps S = 3.

For all UNet variations, we use ELU [10] instead of RELU [6] since ELU

slightly improved the embryo image segmentation performance. . . . . . . . 15

xi

2.3 A residual convolutional encoder-decoder skip-connection consisting of four

residual convolutional layers, each of which applies 3×3 convolution followed

by ELU activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 The D2R2-UNet architecture for inner cell segmentation: we modify

the UNet backbone in Fig. 2.1 by using R2 convolutional units, dilated

convolutional layers, and residual convolutional encoder-decoder skip-

connections. The height and width of each box represents the image size and

number of channels, respectively. The black and blue dotted boxes denote

central bridge and copied feature maps, respectively. . . . . . . . . . . . . . 17

2.5 The RD-UNet architecture for ICM and TE segmentation [1]: we modify the

UNet backbone in Fig. 2.1 by using residual convolutional units and dilated

convolutional layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 ICM segmentation results by RD-UNet. The background (non ICM) is

colored dark cyan, the annotated ground truth ICM is light green, the

network predicted ICM is yellow, and the contour of the ground truth ICM

is red. JI and DC stand for Jaccard Index and Dice Coefficient, respectively. 31

3.2 TE segmentation results by RD-UNet. The background (non TE) is colored

dark cyan, the annotated ground truth TE is light green, the network

predicted TE is yellow, and the contour of the ground truth TE is red. JI

and DC stand for Jaccard Index and Dice Coefficient, respectively. . . . . . 32

3.3 Comparisons of the joint loss (Equation 2.4) between different UNet variant

models for inner cell segmentation: (a) training loss and (b) testing loss. . . 34

3.4 Segmentation results. Light green in 2nd and 5th rows indicates segmented

inner cell by D2R2-UNet. Red and blue in 3rd and 6th rows indicate the

boundaries of ground truth and predicted inner cell, respectively. JI and DC

stand for Jaccard index and Dice coefficient, respectively. . . . . . . . . . . 36

xii

Chapter 1

Introduction

1.1 Motivation of Embryo Image Segmentation

According to Centers for Disease Control and Prevention, almost six million women in

United States suffer from infertility [1]. The World Health Organization reports the total

number of patients worldwide suffering from infertility as almost fifty million [1]. The most

effective treatment for infertility is in vitro fertilization (IVF) and IVF is performed more

than one million times annually around the world [2]. However, IVF suffers from relatively

low birth rates, i.e., less than 30% in the US from 1995 to 2016 [1]. One of the reasons of

such low birth rate is the misidentification of embryo viability. During the IVF process, the

fertilized eggs (embryos) are cultured in controlled environmental conditions and imaged

digitally using microscopes or embryoscopes. When the embryos reach their blastocyst stage

(at least 32 cells on fifth day of culture), the healthiest embryo is selected for implantation.

Morphology assessment is a standard approach for embryo grading in IVF. Several

studies have been conducted to find the most important feature of embryo morphology

[3, 4, 5]. The studies suggest that the morphological features such as inner cell mass

(ICM), trophectoderm epithelium (TE), and degree of blastocoel cavity expansion relative

to the zona pellucida (ZP) are effective measures to determine embryo viability. ICM

eventually develops into a fetus which contains major body organs [4]. Successful

1

hatching of an implanted embryo, i.e., live birth correlates highly with strong TE layer

[3]. Therefore, identification of ICM and TE regions are important to evaluate embryo

implantation potential. In addition, [6] reports that the morphokinetics of an embryo

highly correlates with its genetic quality, i.e., euploid or aneuploid. Here, an embryo with

higher expansion rate has higher reproductive potential. A related study demonstrates that

euploid (genetically normal) embryos expand more rapidly than the aneuploid embryos

(genetically abnormal) [7].

The identification of inner cell expansion is crucial for morphokinetic analysis of embryo

towards genetic quality assessment in IVF. Traditionally, embryologists determines embryo

viability by manually evaluating the morphological features of embryos based on visual

inspection. This subjective and qualitative approach is prone to human bias and does not

consider the genetic quality of an embryo. In addition, it poses high risk of misidentification

of embryo viability, abnormal pregnancies, and health risks; it is a time-consuming task

for embryologists to manually analyze the embryo morphology. This becomes labor and

resource inefficient. To increase the chance of successful pregnancy, multiple embryos

are transferred to mother’s uterus which oftentimes results in multiple pregnancies with

associated health complications. Thus, identification of the single embryo with the highest

potential for a live birth is critical to achieve sustain pregnancies and minimize health risks.

Although preimplantation genetic screening (PGS) provides a good evaluation of embryo

genetics [8, 9], such genetic testing remains very expensive.

All the aforementioned issues necessitate a cost effective, automated, quantitative

method for gauging embryo health. In this study, we developed a deep learning based

segmentation method to precisely identify 1) ICM and TE regions in ZP-intact embryo

images, and 2) inner cell in ZP-ablated embryo images. We use deep neural networks

to recognize both local (texture) and contextual (spatial arrangement) representations of

different embryo regions and segment them in the noisy images.

2

(a) (b)

Figure 1.1 (a) an image of an embryo and its (b) annotated regions. Here, ZP, ICM, CM,and TE denote zona pellucida, inner cell mass, cavity mass, and trophectoderm epithelium,respectively.

1.2 Quantitative Evaluation of Embryo Viability

There are two main approaches for quantitative evaluation of embryo viability:

1) morphological analysis: this is based on morphological attributes of embryo cellular

regions such as inner cell mass and trophectoderm epithelium, as illustrated in Fig. 1.1.

The size of these regions is a good indicator of embryo viability. Biological studies [3, 4, 5]

suggest that embryo morphology correlates with its health.

2) morphokinetic analysis: this is based on morphokinetics of an embryo i.e. how rapidly

the embryo grows in incubation. Studies [6, 7] suggest that morphokinetics of an embryo

highly associates with its genetic information. Genetically normal embryos expand more

rapidly than genetically abnormal embryos. Fig. 1.2 shows that genetically normal embryo

has higher expansion rate (steep slope) as opposed to that of genetically abnormal embryo

(flat or negative slope). In the morphokinetic study, embryologists ablate the zona-pellucida

(ZP) of an embryo to let the inner cell expand beyond ZP region. Then they apply a time-

lapse microscopic imaging using embryoscopes to capture the images of ZP-ablated embryo

during a ten hours observation period. Finally, they estimate the total area of inner cell at

different time points and measure the expansion rate.

3

(a) (b)

Figure 1.2 Expansion kinetics of (a) a genetically normal embryo (b) a genetically abnormalembryo.

1.3 ICM and TE Segmentation Challenges in Morphological

Analysis

ICM and TE analysis plays crucial roles in determining embryo viability for healthy

pregnancies in IVF. At blastocyst stage, an embryo consists of three inner regions: 1)

inner cell mass (ICM), 2) trophectoderm epithelium (TE), and 3) cavity mass (CM). These

inner regions are enclosed by an outer layer named zona pellucida (ZP). For convenience,

we refer this embryo as ZP-intact embryo. Fig. 1.1 illustrate a ZP-intact embryo and its

annotated inner (ICM, CM, TE) and outer (ZP) regions.

The ICM and TE have similar pixel intensity values, i.e., in general, it is hard to

distinguish them. They are also surrounded by two other embryo regions such as zona

pellucida (ZP) and cavity mass (CM) that share similar pixel intensity values, similar to

ICM and TE. In addition, there exist undesirable fragments and artifacts near the ICM and

TE regions. The similar pixel intensity values of surrounding CM, ZP, artifacts, fragments,

and image contrast variations make it challenging to differentiate between ICM and TE

regions and precisely segment them. The number of training images in the dataset is also

small (211 images); this poses an additional challenge in the ICM and TE segmentation,

such as less diversity in training data and overfitting to training data.

4

(a) (b) (c)

Figure 1.3 Examples of inner cell segmentation challenges in ZP-ablated embryo: (a) ZP-ablated embryo, (b) artifacts, (c) inner cell beyond culture well.

1.4 Inner Cell Segmentation Challenges in Morphokinetic

Analysis

The inner cell segmentation is critical for the morphokinetic study using an embryoscope

[6]. The inner cell expansion rate is measured over a ten-hour observation period using

time-lapse microscopic imaging. In these embryos, the embryologists ablate the ZP to

perform preimplantation genetic screening. The goal of this project is to segment inner cell

to facilitate the measurement of morphokinetics of an embryo i.e., how rapidly the inner

cell expands by estimating their total area. To estimate the total area of an embryo, [6]

segmented objects with circular shapes using the embryoscope software tool.

There exist significant challenges in this segmentation method, because a) inner cells

expand with irregular rates, b) some artifacts and fragmented cellular clusters can exist

close to inner cell outlines, and c) expanded inner cell can have white bands and/or dark

background due their expansion beyond the culture well. Fig. 1.3 shows some examples of

such challenges.

5

(a) (b)

Figure 1.4 Semantic segmentation: (a) an image of street view and its (b) pixel annotatedsegmentation.

1.5 Semantic Segmentation with Deep Learning

Semantic segmentation is a high-level task that facilitates the complete scene understanding.

The semantic segmentation techniques are applied to a wide range of images/videos,

including still two-dimensional images, three-dimensional or volumetric images, and videos;

the techniques are used in various applications including autonomous driving [10], human-

machine interaction [11], computational photography [12], and image search engines [13].

Semantic segmentation relates to the pixel- or voxel-wise image classification task, where

each pixel or voxel is labeled according to the classes present in a two-dimensional or three-

dimensional image; see an example in Fig. 1.4.

Semantic segmentation has been addressed in the past using various computer vision

and machine learning techniques such as active contour/sanke model, clustering algorithm,

watershed algorithm, graph based region merging, random walk, and Markov random field

[14]. Recent advancements in deep learning have shown potential to solve challenging

image segmentation problems [15]. The most popular convolutional neural network (CNN)

model is UNet which shows significant performances in medical image segmentation tasks

[16]. The UNet architecture has been modified for medical image segmentation tasks in

6

various medical applications such as retina blood vessel segmentation [17], liver and tumor

segmentation [18], skin lesion segmentation [19], and surgical instrument segmentation [20].

To perform semantic segmentation, the CNNs learn representative features of an image

and convert them into a pixel-wise categorization. In general, semantic segmentation CNN

models consist of an encoding network and a decoding network. The encoder converts an

input image into a set of representative feature maps. The role of the decoder is to convert

the encoded features often in lower spatial resolution into the original high-resolution pixel

space and generate a pixel-wise classification map.

1.6 Outline

This thesis contributes to the embryo image segmentation for IVF treatment. In the

following section, an outline of the thesis is provided.

Chapter 2 describes the methodology, proposed or implemented neural network

architecture, network specification, loss function, implementation details, and dataset.

Chapter 3 describes the evaluation metrics, performance comparison, results, and

discussions.

Chapter 4 summarizes the performance of developed methods and contributions of

our works to some applications/areas. The chapter also discusses future research

work/direction.

7

References

[1] N. Gleicher, V. Kushnir, and D. Barad, “Worldwide decline of IVF birth rates and its

probable causes,” Human Reproduction Open, vol. 2019, no. 3, p. hoz017, 2019.

[2] E. Santos Filho, J. Noble, and D. Wells, “A review on automatic analysis of human

embryo microscope images,” The open biomedical engineering journal, vol. 4, p. 170,

2010.

[3] A. Ahlström, C. Westin, E. Reismer, M. Wikland, and T. Hardarson, “Trophectoderm

morphology: an important parameter for predicting live birth after single blastocyst

transfer,” Human Reproduction, vol. 26, no. 12, pp. 3289–3296, 2011.

[4] C. Lagalla, M. Barberi, G. Orlando, R. Sciajno, M. A. Bonu, and A. Borini, “A

quantitative approach to blastocyst quality evaluation: morphometric analysis and

related IVF outcomes,” Journal of Assisted Reproduction and Genetics, vol. 32, no. 5,

pp. 705–712, 2015.

[5] W. B. Schoolcraft, D. K. Gardner, M. Lane, T. Schlenker, F. Hamilton, and D. R.

Meldrum, “Blastocyst culture and transfer: analysis of results and parameters affecting

outcome in two in vitro fertilization programs,” Fertility and Sterility, vol. 72, no. 4,

pp. 604–609, 1999.

[6] T. T. Huang, D. H. Huang, H. J. Ahn, C. Arnett, and C. T. Huang, “Early blastocyst

expansion in euploid and aneuploid human embryos: evidence for a non-invasive and

8

quantitative marker for embryo selection,” Reproductive Biomedicine Online, vol. 39,

no. 1, pp. 27–39, 2019.

[7] T. T. Huang, B. C. Walker, M. Harun, A. T. Ohta, M. Rahman, J. Mellinger,

and W. Chang, “Automated computer analysis of human blastocyst expansion from

embryoscope time-lapse image files,” Fertility and Sterility, vol. 112, no. 3, pp. e292–

e293, 2019.

[8] R. T. Scott Jr, K. Ferry, J. Su, X. Tao, K. Scott, and N. R. Treff, “Comprehensive

chromosome screening is highly predictive of the reproductive potential of human

embryos: a prospective, blinded, nonselection study,” Fertility and Sterility, vol. 97,

no. 4, pp. 870–875, 2012.

[9] M. D. Werner, M. P. Leondires, W. B. Schoolcraft, B. T. Miller, A. B. Copperman,

E. D. Robins, F. Arredondo, T. N. Hickman, J. Gutmann, W. J. Schillings et al.,

“Clinically recognizable error rate after the transfer of comprehensive chromosomal

screened euploid embryos is low,” Fertility and Sterility, vol. 102, no. 6, pp. 1613–1618,

2014.

[10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,

U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene

understanding,” in Proceedings of the IEEE conference on computer vision and pattern

recognition, 2016, pp. 3213–3223.

[11] M. Oberweger, P. Wohlhart, and V. Lepetit, “Hands deep in deep learning for hand

pose estimation,” arXiv preprint arXiv:1502.06807, 2015.

[12] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learning a deep

convolutional network for light-field image super-resolution,” in Proceedings of the

IEEE international conference on computer vision workshops, 2015, pp. 24–32.

9

[13] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deep learning

for content-based image retrieval: A comprehensive study,” in Proceedings of the 22nd

ACM international conference on Multimedia, 2014, pp. 157–166.

[14] H. Zhu, F. Meng, J. Cai, and S. Lu, “Beyond pixels: A comprehensive survey from

bottom-up to semantic image segmentation and cosegmentation,” Journal of Visual

Communication and Image Representation, vol. 34, pp. 12–27, 2016.

[15] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A.

Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in

medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.

[16] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for

biomedical image segmentation,” in International Conference on Medical Image

Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–

241.

[17] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari, “Recurrent

residual convolutional neural network based on U-Net (R2U-Net) for medical image

segmentation,” arXiv preprint arXiv:1802.06955, 2018.

[18] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-denseunet: hybrid

densely connected unet for liver and tumor segmentation from ct volumes,” IEEE

transactions on medical imaging, vol. 37, no. 12, pp. 2663–2674, 2018.

[19] N. Ibtehaz and M. S. Rahman, “Multiresunet: Rethinking the U-Net architecture for

multimodal biomedical image segmentation,” Neural Networks, vol. 121, pp. 74–87,

2020.

[20] Z.-L. Ni, G.-B. Bian, X.-H. Zhou, Z.-G. Hou, X.-L. Xie, C. Wang, Y.-J. Zhou, R.-Q.

Li, and Z. Li, “Raunet: Residual attention u-net for semantic segmentation of cataract

surgical instruments,” in International Conference on Neural Information Processing.

Springer, 2019, pp. 139–149.

10

Chapter 2

Methodology

We implement RD-UNet model [1] for segmenting ICM and TE regions in ZP-intact

embryo images obtained by a microscope. The model is based on baseline UNet architecture

[2], residual convolutional units, and dilated convolutional layers. We will discuss each of

these components in the following sections.

For segmenting inner cell regions in ZP-ablated embryos, we propose a UNet based

model. Here, we use embryonic images obtained by time-lapse imaging. However, we adopt

a static image segmentation approach. There are two reasons for this choice: 1) inner cell

vary dramatically in the consecutive time frames with nonperiodic time points across frames

or videos (in general, 3 frames/hour); 2) the collected video dataset is relatively small – it

consists 45 videos with 30 or 31 frames each.

Inspired by successful applications of UNet to medical image segmentation [3], we

developed an improved convolutional neural network (CNN) architecture, called Deep

Dilated Residual Recurrent UNet (D2R2-UNet) for ZP-ablated embryo image segmentation.

Similar to the original UNet architecture [2], the proposed architecture, D2R2-UNet,

consists of encoder and decoder of which the last encoding and the first decoding units

are connected by a central bridge. Inspired by deep residual model [4] and recurrent CNN

[5], we made the following three modifications to the baseline UNet architecture:

11

1) We replaced convolutional units of the baseline UNet with residual convolutional

units using two recurrent convolutional layers, called R2 convolutional units, in both the

encoder and decoder.

2) In the central bridge, we replaced the convolutional layers with dilated convolutional

layers.

3) We incorporated a series of residual convolutional layers into the baseline UNet

encoder-decoder skip-connections between encoder and decoder.

We will discuss the details of those modifications in the following sections.

2.1 Baseline UNet Architecture

To better understand our modifications, we first briefly review the baseline UNet

architecture. The baseline UNet is composed of two symmetrical contracting (encoding)

and expansive (decoding) units that are connected to each other via encoder-decoder skip-

connections. The contracting units capture the context, whereas the expanding units enable

localization. The contracting units encode the input image into a set of feature maps using

convolutional layers with no skip connections. The expansive units decode the compact

feature maps into pixel-wise representation, i.e., semantic segmentation. This encoding-

decoding architecture is useful to perform the semantic segmentation task. The encoder and

decoder are built on the conventional CNN architecture, and consist of four down-sampling

and up-sampling convolutional units, respectively. Each down-sampling convolutional unit

involves a sequence of two convolutional layers with the 3 × 3 kernel size, followed by a

rectified linear unit (RELU) activation [6], and a max-pooling with the 2 × 2 window size

and the stride parameter 2. Fig. 2.2(a) demonstrates each convolutional unit in the baseline

UNet. The number of feature channels get doubled after performing down-sampling at each

encoder block. At the decoder side, each up-sampling convolutional unit involves a sequence

of two convolutional layers with the flipped 3×3 kernels of the encoder convolutional units,

followed by a RELU activation and upsampling with the 2×2 window size and the stride 2.

The feature channels are reducedby half at each up-sampling step. Then a concatenation,

12

Figure 2.1 The baseline UNet architecture [2].The height and width of each box representsthe image size and number of channels, respectively. The dotted boxes denote copied featuremaps.

i.e., skip connection, is established between down-sampled and upsampled features. In

the final layer in the decoder, a sigmoid activation is performed to generate class-wise

probabilities for each pixel. Both encoder and decoder consist of four convolutional units.

See the baseline UNet architecture in Fig. 2.1.

2.2 Proposed D2R2-UNet Architecture

To better capture the context particularly related to small structures, i.e., to improve

context modulation [5], we use residual recurrent (R2) convolutional units, instead of

typical convolutional units in the baseline UNet. R2 convolutional unit combines the

strength of resdiual and recurrent learning which benefits CNN to increase segmentation

performance. To facilitate context extraction from high-level features at the central bridge

without increasing CNN parameter dimensions, we use dilated convolutional layers [7] in

the central bridge, rather than regular convolutional layers used in the baseline UNet. To

reduce semantic disparity between low-level and high-level features [8] and better recover lost

information during pooling operation, we incorporate series of residual convolutional layers

into the baseline UNet encoder-decoder skip-connections. We describe these modifications

in details in the following subsections.

13

2.2.1 Residual Convolutional Unit

Skip connections [4] are incorporated to each convolutional unit of the baseline UNet

[2], based on the empirical results in [9] that having skip connections produced benign

optimization landscape in training. We hypothesize that residual convolutional units

improve the training/testing performances by considering that the baseline UNet is

sufficiently deep (23 convolutional layers; further modifications in the following subsections

lead to 36 convolutional layers). In a residual unit, there exists a residual skip connection

between input to the first convolutional layer and the output of second convolutional layer.

This residual skip connection is implemented by a 1× 1 convolution. Fig. 2.2(b) depicts a

residual convolutional unit.

2.2.2 Recurrent Convolutional Unit

We replace conventional convolutional units in the baseline UNet with recurrent

convolutional units that benefit CNN to better understand contexts especially related to

small objects, while avoiding increasing the number of CNN parameters [5]. At each step

s ≥ 1, we add a recurrent feature and feed-forward feature, each computed by a shared

convolutional kernel. Specifically, we use recurrent convolutional units with the number of

evolution steps S = 3, where a first recurrent convolutional layer performs the following

evolution steps:

x(0)c = A(fc ~ x)

x(1)c = A(rc ~ x(0)c + fc ~ x)

x(2)c = A(rc ~ x(1)c + fc ~ x)

(2.1)

for c = 1, . . . , C, in which C is the number of channels. Here, A is some activation function,

e.g., RELU [6], and ELU [10], the subscript index (·)c denotes the c-th convolutional channel,

the superscript indices (·)(s) denote the step points, s = 0, . . . , S−1, ~ denotes a convolution

operator, fc and rc is a feed-forward and recurrent convolutional kernel at the c-th channel,

respectively, ∀c, and x denotes the input. The second recurrent convolutional layer does not

expand the number of channels, i.e., it replaces x with the output from the first recurrent

14

(a) (b) (c) (d) (e)

Figure 2.2 Different convolutional units we compared in UNet. (a) A convolutional unit inthe baseline UNet [2]. (b) A residual convolutional unit [4]. (c) A recurrent convolutionalunit [5]. (d) A R2 convolutional unit [11]. (e) A recurrent convolutional layer [5] with thenumber of evolution steps S = 3. For all UNet variations, we use ELU [10] instead of RELU[6] since ELU slightly improved the embryo image segmentation performance.

convolutional layer, x(2)c , ∀c, in Equation 2.1. See graphical illustrations for a recurrent

convolutional unit consisting of these two recurrent convolutional layers in Fig. 2.2(c), and

a S = 3 recurrent convolutional layer in Fig. 2.2(e).

Using the recurrent convolutional units increases the UNet depth, while avoiding

increasing the UNet complexity by using shared convolutional kernels. We expect that

this is useful to better understand the context while avoiding overfitting risks (using S = 4

recurrent convolutional units improved the image recognition performances over CNN that

has the same depth and number of parameters by simply increasing the depth of CNN [5]).

We observed that in our application, using S = 3 recurrent convolutional units gives better

overall image segmentation performance, compared to using S = 2 and S = 4 recurrent

convolutional units.

2.2.3 Residual Recurrent (R2) Convolutional Unit

To further improve the image segmentation performance, we fuse recurrent convolutional

layers with residual connectivity and form R2 convolutional unit, similar to R2-UNet [11].

Fig. 2.2(d) depicts a R2 convolutional unit. Different from R2-UNet [11] that uses four

evolution steps S = 4, we use three S = 3 evolution steps (four evolution steps did

15

not improve the performance). Fig. 2.2(e) shows a R2 convolutional unit. This rigorous

formation develops a more efficient CNN that improves the segmentation performance due

to its better understanding of context during multiple evolution steps.

2.2.4 Dilated Convolution in the Central Bridge

Receptive field of the CNN plays critical role for semantic image segmentation. A broader

receptive field helps to extract information from larger region of the image. Stacking

more convolutional layers increases receptive field size linearly by kernel size, but increases

number of NN parameters [12]. Moreover, adding more down-sampling layers also expands

receptive field size multiplicatively, which comes at a price of spatial information loss [12].

Alternatively, dilated convolution provides exponential expansion of the receptive field with

no increase in NN parameter and loss of spatial information [7]. Unlike typical convolution

with no space between kernel weights, dilated convolution inserts zero(s) between kernel

weights depending on dilation rate and expands receptive field size accordingly. For example,

a 3 × 3 kernel with dilation rate 2 increases receptive field size from 3 × 3 to 7 × 7, while

keeping the number of kernel parameters as 9. After several downsampling steps, we add

multiple dilated convolution layers in the central bridge similar to [13], rather than stacking

additional pooling layers and/or typical convolutional layers. Therefore, we can preserve

spatial information in the central bridge and expand the receptive field of the baseline-

UNet from 140 × 140 to 198 × 198. Thus, adding multiple dilated convolutional layers to

the central bridge helps to expand network’s receptive field with larger access to the input.

This benefits CNN to better capture context and improve segmentation prediction.

2.2.5 Residual Convolutional Skip-Connections between Encoder and

Decoder

The conventional UNet encoder-decoder skip connections copy encoded features in the

encoder to the upsampled features in the decoder, which are supposed to be of higher

level because they are derived at the very deep UNet layers. Merging two sets of these

16

Figure 2.3 A residual convolutional encoder-decoder skip-connection consisting of fourresidual convolutional layers, each of which applies 3 × 3 convolution followed by ELUactivation.

Figure 2.4 The D2R2-UNet architecture for inner cell segmentation: we modify the UNetbackbone in Fig. 2.1 by using R2 convolutional units, dilated convolutional layers, andresidual convolutional encoder-decoder skip-connections. The height and width of each boxrepresents the image size and number of channels, respectively. The black and blue dottedboxes denote central bridge and copied feature maps, respectively.

features in the decoder facilitates the spatial information propagation and recovers the

lost information in upsampled features during pooling and/or RELU operations. However,

semantic gap potentially exists between the two sets of features, and this discrepancy might

affect the prediction accuracy [8]. To moderate this potential issue, we adapt the technique

in [8] that incorporates residual convolutional layers into the conventional encoder-decoder

skip-connections. Fig. 2.3 shows a residual convolutional encoder-decoder skip-connection.

2.3 RD-UNet Architecture

We implement a CNN architecture named Residual Dilated UNet (RD-UNet) [1] for the

ICM and TE segmentation. The RD-UNet is a modified version of the baseline UNet [2].

17

Figure 2.5 The RD-UNet architecture for ICM and TE segmentation [1]: we modify theUNet backbone in Fig. 2.1 by using residual convolutional units and dilated convolutionallayers.

We included residual convolutional units (see section 2.2.1) in the encoder and decoder

and dilated convolutional layers (see section 2.2.4) in the central bridge to improve the

segmentation performance. See Fig. 2.5 for the detailed architecture of RD-UNet.

2.4 Network Specifications

The encoder and decoder consisting of four R2 convolutional units, are connected via

residual convolutional encoder-decoder skip-connections in D2R2-UNet, as shown in

Fig. 2.4. The number of channels is c = 16 in the first unit of encoder (left most) and

we double the number in each successive unit (towards central bridge). Accordingly, we set

the number of channels in the first unit of decoder (next to central bridge) to c = 8 · 16 and

halve the number in each successive unit (towards final unit at right most). We reduced the

size of feature maps by half at each encoding step and doubled at each decoding step. Then,

we added five dilated convolutional layers to the central bridge using the dilation rates of 1,

2, 4, 8, and 16, successively. Since, the semantic gap between encoder and decoder tends to

decrease from shallow layers (at left) towards deep layers (at center), we gradually reduce

the number of residual convolutional layers (4, 3, 2, and 1) in skip-connections between

encoder and decoder in the direction from shallow layers to deep layers. We also added

batch normalization to accelerate convergence [14]. Besides that, we included 5% dropout

18

to prevent overfitting [15].

We compared the proposed D2R2-UNet with other candidate models such as baseline

UNet [2], Dilated UNet (D-UNet) [13], Residual UNet (Res-UNet) [16], Recurrent UNet

(Rec-UNet) [11], Recurrent Residual UNet (R2-UNet) [11], and Residual Dilated UNet (RD-

UNet) [1]. Baseline UNet consists of typical convolutional units. D-UNet employs series of

five dilated convolutional layers with dilation rates 1, 2, 4, 8, and 16 in the central bridge.

Res-UNet includes residual convolutional units. Rec-UNet utilizes recurrent convolutional

units. R2-UNet has R2 convolutional units. Finally, RD-UNet consists of a central bridge

similar to D-UNet and residual convolutional units. For the fair comparison among all the

CNNs, we kept the basic architecture equivalent and optimization hyperparameters and

dataset identical.

The RD-UNet for ICM and TE segmentation consists of four residual convoutional units

in encoder and decoder. It includes series of four dilated convolutional layers with dilation

rates 1, 2, 4, and 8 in the central bridge. Here, we used RELU activation [6] in each

encoding and decoding unit. Fig. 2.5 illustrates the detailed architecture. We compare the

RD-UNet model with existing models i.e., CNN with discrete cosine transform (DCT) [17],

coarse-to-fine texture analysis [18], texture analysis, clustering, and watershed algorithm

[19], VGG16 [20], and SD-UNet [13] for ICM segmentation. For TE segmentation, we

compare the RD-UNet model with existing models i.e., level-set algorithm and Retinex

theory [21], CNN with discrete cosine transform [17], and texture analysis, clustering, and

watershed algorithm [19].

2.5 Loss Function

We aim to classify each pixel based on two classes, target (inner cell corresponding to 1)

and background (non inner cell corresponding to 0), so our image segmentation problem can

19

be viewed as a pixel-wise binary classification problem. A natural choice for training loss

function is the binary cross-entropy loss E(S;x,y) that learns an image segmentation CNN

S by using input image x and ground-truth annotation image y ∈ {0, 1} by minimizing

averaged pixel-wise cross-entropy:

E(S;x,y) = − 1N

N∑n=1

yn log(S(x)n) + (1− yn) log(1− S(x)n) (2.2)

where the sigmoid function in the final layer of S gives probability prediction values,

i.e., {S(x)n ∈ (0, 1) : ∀n}.

In our training images, the number of pixels classified as backgrounds often dominates

that classified as inner cell, so the cross-entropy training loss in Equation 2.2 can potentially

underestimate the inner cell prediction. To overcome the class imbalance limitation, we

incorporate the Jaccard index J(S;x,y) [22] that quantifies the similarity between ground

truth annotation y and probability prediction values S(x):

J(S;x,y) = 1N

N∑n=1

ynS(x)nyn + S(x)n − ynS(x)n

, (2.3)

Where J(S;x,y) ∈ [0, 1]. Combining Equation 2.2 and 2.3 gives the following joint training

loss [22]:

L(S;x,y) = E(S;x,y)− log(J(S;x,y)). (2.4)

We do not include a regularization parameter in Equation 2.4, because we observed that

both binary cross entropy and Jaccard loss, Equation 2.2 and 2.3, are in the similar range

[22]. The net effect is that as the total loss minimizes, one can simultaneously improve

the pixel classification accuracy and increase the intersection between ground truth and

predicted segmentation.

20

2.6 Implementation Details

We implemented training and testing of all CNNs using Keras with TensorFlow backend.

We used Nadam optimizer (Adam with Nesterov momentum) [23] to minimize the loss

function in Equation 2.4. Here, we set the initial learning rate to 10−3 and reduced it by a

factor of 0.05 in every 5 epochs, whereas the minimum learning rate was 10−5. We trained

all CNNs on a GPU (NVIDIA GTX 1070 with 8GB memory) with the mini-batch size of

4. Since loss and Jaccard values stagnated near 100 epochs, we set maximum epochs to

100. We split the dataset of 1368 images into the training set (75% of dataset) and testing

set (25% of dataset). We randomly sampled the training set in every epoch to improve the

learning. Given a small training set, we applied data augmentation such as horizontal and

vertical flips, rotation in a range up to 270◦, horizontal and vertical shifts up to 10% of

width or height and zoom up to 10% in size. Finally, we used 0.5 threshold for the final

semantic probability map.

2.7 Dataset

2.7.1 Dataset for ICM and TE Segmentation

We used the blastocyst dataset [19]. The ICM and TE regions were manually segmented and

annotated by embryologists at Pacific Centre for Reproductive Medicine, Canada. We used

the human annotated images as ground truth to evaluate the segmentation performance.

The dataset has 249 images in total. We split the dataset into two sets: a training set

consists of 211 images and a testing set consists of 38 images.

2.7.2 Dataset for Inner Cell Segmentation

We constructed the dataset of total 1368 images representing genetic health conditions

(normal/abnormal). The embryologists at the Pacific IVF Institute in Hawai‘i cultured

and monitored the embryos over 6 days using embryoscopes (Vitrolife, USA). At day 5 of

culture, they ablated ZPs using a Lykos laser (Hamilton-Thorne, USA). The embryoscopes

21

captured the images of ZP-ablated embryos for 10 hours, using time-lapse imaging technique

[24; 25]. The pixels corresponding to the inner cell were manually annotated by personnel

supervised by embryologists. We use human annotated images as ground truth to train and

test CNNs.

22

References

[1] M. Y. Harun, T. Huang, and A. T. Ohta, “Inner cell mass and trophectoderm

segmentation in human blastocyst images using deep neural network,” in 13th IEEE

International Conference on Nano/Molecular Medicine and Engineering. IEEE, 2019,

pp. 214–219.




241.

[3] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A.

Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in

medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

2016, pp. 770–778.

[5] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,”

in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

2015, pp. 3367–3375.

23

[6] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann

machines,” in Proceedings of the 27th International Conference on Machine Learning

(ICML), 2010, pp. 807–814.

[7] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv

preprint arXiv:1511.07122, 2015.

[8] N. Ibtehaz and M. S. Rahman, “Multiresunet: Rethinking the U-Net architecture for

multimodal biomedical image segmentation,” Neural Networks, vol. 121, pp. 74–87,

2020.

[9] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape

of neural nets,” in Proc. NIPS 31, Montreal, Canada, Dec. 2018, pp. 6389–6399.

[10] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network

learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015.




[12] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field

in deep convolutional neural networks,” in Advances in neural information processing

systems, 2016, pp. 4898–4906.

[13] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Multi-resolutional ensemble of stacked

dilated U-Net for inner cell mass segmentation in human embryonic images,” in 25th

IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 3518–

3522.

[14] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by

reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

24

[15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:

a simple way to prevent neural networks from overfitting,” The Journal of Machine

Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[16] Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual U-Net,” IEEE

Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 749–753, 2018.

[17] S. Kheradmand, P. Saeedi, and I. Bajic, “Human blastocyst segmentation using neural

network,” in IEEE Canadian Conference on Electrical and Computer Engineering

(CCECE). IEEE, 2016, pp. 1–4.

[18] R. M. Rad, P. Saeedi, J. Au, and J. Havelock, “Coarse-to-fine texture analysis for

inner cell mass identification in human blastocyst microscopic images,” in Seventh

International Conference on Image Processing Theory, Tools and Applications (IPTA).

IEEE, 2017, pp. 1–5.

[19] P. Saeedi, D. Yee, J. Au, and J. Havelock, “Automatic identification of human

blastocyst components via texture,” IEEE Transactions on Biomedical Engineering,

vol. 64, no. 12, pp. 2968–2978, 2017.

[20] S. Kheradmand, A. Singh, P. Saeedi, J. Au, and J. Havelock, “Inner cell mass

segmentation in human HMC embryo images using fully convolutional network,” in


1756.

[21] A. Singh, J. Au, P. Saeedi, and J. Havelock, “Automatic segmentation of trophectoderm

in microscopic images of human blastocysts,” IEEE Transactions on Biomedical

Engineering, vol. 62, no. 1, pp. 382–393, 2014.

[22] V. Iglovikov, S. Mushinskiy, and V. Osin, “Satellite imagery feature detection

using deep convolutional neural network: A Kaggle competition,” arXiv preprint

arXiv:1706.06169, 2017.

25

[23] T. Dozat, “Incorporating nesterov momentum into Adam,” 2016.

[24] T. T. Huang, D. H. Huang, H. J. Ahn, C. Arnett, and C. T. Huang, “Early blastocyst

expansion in euploid and aneuploid human embryos: evidence for a non-invasive and

quantitative marker for embryo selection,” Reproductive Biomedicine Online, vol. 39,

no. 1, pp. 27–39, 2019.

[25] T. T. Huang, B. C. Walker, M. Harun, A. T. Ohta, M. Rahman, J. Mellinger,

and W. Chang, “Automated computer analysis of human blastocyst expansion from

embryoscope time-lapse image files,” Fertility and Sterility, vol. 112, no. 3, pp. e292–

e293, 2019.

26

Chapter 3

Results and Discussion

3.1 Evaluation

To validate embryo image segmentation, we used different evaluation metrics such as Jaccard

index, Dice Coefficient, accuracy, precision, specificity, and recall. These metrics [1] are

calculated based on four cardinalities i.e., true positive (TP), false positive (FP), true

negative (TN), and false negative (FN). TP measures the number of pixels that are correctly

identified as target class (ICM/TE/inner cell). Analogously, TN shows the number of pixels

that are truly detected as background (non ICM/TE/inner cell). On the contrary, FP

indicates pixels that are incorrectly identified as target class. Similarly, FN counts pixels

that are misclassified as background pixels.

• The Jaccard Index, also termed as intersection over union, is a similarity measure

that is defined as the intersection between two sets A and B divided by their union,

that is:

Jaccard =|A ∩B||A ∪B|

=TP

TP + FP + FN(3.1)

• The Dice Coefficient, also known as overlap index, is also a measure of overlap between

two sets, defined by:

Dice =2× |A ∩B||A|+ |B|

=2× TP

2× TP + FP + FN(3.2)

27

Dice and Jaccard both are equal to 1 if there exists 100% overlap between predicted

and ground truth segmentation.

• Accuracy is the ratio of correctly classified pixels, regardless of class, express as follows:

Accuracy =TP + TN

TP + TN + FP + FN(3.3)

• Specificity, also called true negative rate, calculates the percentage of negative pixels

in ground truth that are also detected as negative by CNN. It is given by:

Specificity =TN

TN + FP(3.4)

• Precision, also called positive predictive value, measures the percent of correctly

segmented pixels among all the segmented pixels. In ideal case, precision 1 means

there is no FP in the segmentation. This is defined as follows:

Precision =TP

TP + FP(3.5)

• Recall is the fraction of all the labeled ICM/TE/inner cell pixels that are correctly

predicted, can be expressed as follows:

Recall =TP

TP + FN(3.6)

3.2 ICM and TE Segmentation Results

3.2.1 Quantitative Results

We compare the ICM segmentation performance of RD-UNet model with existing models

i.e., CNN with discrete cosine transform [2], coarse-to-fine texture analysis [3], texture

analysis, clustering, and watershed algorithm [4], VGG16 [5], and SD-UNet [6]; see Table

28

Table 3.1 Comparison of ICM results of our method with that of existing methods basedon same data set.

Methods Jaccard (%) Dice (%) Precision (%) Recall (%) Accuracy (%)

CNN with DCT [2] 47.7 64.6 75.6 56.4 93.0

Coarse-to-fine texture analysis [3] 70.3 82.6 78.7 86.8 –

Texture, clustering, and watershed algorithm [4] 71.1 83.1 84.5 78.3 93.3

VGG16 [5] 76.5 86.7 – – 95.6

SD-UNet [6] 81.6 89.5 88.6 91.5 98.3

RD-UNet [8] 89.3 94.3 94.9 93.8 99.1

Table 3.2 Comparison of TE results of our method with that of existing methods based onsame data set.

Methods Jaccard (%) Dice (%) Precision (%) Recall (%) Accuracy (%)

Level-set algorithm and Retinex theory [7] 62.2 76.7 71.3 83.1 86.7

CNN with DCT [2] 58.9 74.2 69.1 80.0 90.0

Texture, clustering, and watershed algorithm [4] 63.0 77.3 69.0 89.0 86.6

RD-UNet [8] 85.3 92.5 91.8 93.2 98.3

3.1. The experimental results illustrate that RD-UNet model achieves better performance

than aforementioned existing models. It outperforms the SD-UNet model [6] by 0.8% in

accuracy, 7.1% in precision, 2.5% in recall, 5.4% in the Dice Coefficient, and 9.4% in the

Jaccard Index.

Moreover, we compare the TE segmentation performance of RD-UNet model with

existing models i.e., level-set algorithm and Retinex theory [7], CNN with discrete cosine

transform [2], and texture analysis, clustering, and watershed algorithm [4]; see Table

3.2. TE segmentation results indicate that RD-UNet model outperforms existing models

particularly, Texture, clustering and watershed model [4] by 13.5% in accuracy, 33% in

precision, 4.7% in recall, 19.7% in the Dice coefficient, and 35.4% in the Jaccard index.

Compared to the existing models, we achieve highest Jaccard and Dice scores

e.g., maximum overlap between network’s predictions and corresponding ground truth

annotations. Furthermore, the RD-UNet model significantly reduces the false positive

(misclassified as ICM/TE) pixels and false negative (misidentified as background) pixels

throughout the dataset; see increased precision and recall scores in Tables 3.1 and 3.2.

The RD-UNet model can understand context better due to the residual convolutional units

29

in the encoding and decoding units and dilated convolutional layers in the central bridge.

Consequently, it improves the ICM and TE segmentation performances.

3.2.2 Qualitative Results

We compared predicted ICM and TE segmentation results with ground truth ICM and TE

annotations. The contours of ground truth ICM and TE annotations are overlaid on that of

predicted ICM and TE segments to visualize the differences. To better understand the ICM

segmentation quality, we categorize the results according to best (Jaccard Index of more

than 97%), better (Jaccard Index from 92% to 97%), and fair (Jaccard Index from 77% to

92%) segmentation; see Fig. 3.1. The 36.8%, 50%, and 13.2% of test images are in the best,

better, and fair segmentation categories, respectively in ICM segmentation results.

To better understand the TE segmentation quality, we categorize the results according

to best (Jaccard Index of more than 94%), better (Jaccard Index from 87% to 94%), and

fair (Jaccard Index from 76% to 87%) segmentation; see Fig. 3.2. The 31.6%, 47.4%, and

21% of test images are in the best, better, and fair segmentation categories, respectively

in TE segmentation results. The segmentation results have been classified by the Jaccard

Index since other performance metrics are reasonably high.

Next, we discuss the qualitative ICM segmentation performance of the RD-UNet model;

Fig. 3.1 shows some representative results, where the (i, j)th image denotes an image at the

ith row and jth column in Fig. 3.1. For all the three segmentation categories, RD-UNet

successfully segments ICM regions even if they connect and/or overlap with TE/CM; see

the (1, 4)th, (2, 4)th, (3, 4)th, (4, 4)th images. In general, contours of segmented ICM well

align with that of the ground truth annotations; compare red contours and yellow regions in

the 4th column. However, RD-UNet model shows limited ICM segmentation performances

where some indistinct features exist between ICM and TE/CM; see the fair segmentation

category in Fig. 3.1. For example, in the (5, 4)th image, miss-segmentation exists where

30

Figure 3.1 ICM segmentation results by RD-UNet. The background (non ICM) is coloreddark cyan, the annotated ground truth ICM is light green, the network predicted ICM isyellow, and the contour of the ground truth ICM is red. JI and DC stand for Jaccard Indexand Dice Coefficient, respectively.

31

Figure 3.2 TE segmentation results by RD-UNet. The background (non TE) is colored darkcyan, the annotated ground truth TE is light green, the network predicted TE is yellow,and the contour of the ground truth TE is red. JI and DC stand for Jaccard Index andDice Coefficient, respectively.

32

ICM and TE have similar texture; in the (6, 4)th image, miss-segmentation happens where

ICM has similar texture to that of CM.

We explain the qualitative TE segmentation performance of the RD-UNet model;

Fig. 3.2 demonstrates some representative results, where the (i, j)th image denotes an image

at the ith row and jth column in Fig. 3.2. For all the three segmentation categories, RD-

UNet successfully segments TE regions even if they connect and/or overlap with ICM/CM;

see the (1, 4)th, (2, 4)th, (3, 4)th, (4, 4)th images. Here, contours of segmented TE closely

align with that of the ground truth annotations; compare red contours and yellow regions

in the 4th column. However, RD-UNet model shows limited TE segmentation performances

where some indistinct features exist between TE and ICM/CM; see the fair segmentation

category in Fig. 3.2. For example, in the (5, 4)th image, miss-segmentation exists where it

is challenging to differentiate edges of TE and CM; in the (6, 4)th image, miss-segmentation

happens where it is difficult to differentiate edges of TE and ICM.

3.3 Inner Cell Segmentation Results

3.3.1 Quantitative Results

We compared the D2R2-UNet with other equivalent UNet variant models i.e., UNet [9], D-

UNet [6], Res-UNet [10], Rec-UNet [11], R2-UNet [11], and RD-UNet [8] to gauge the

segmentation potential. Fig. 3.3(a) demonstrates the joint loss (Equation 2.4) during

training process, showing that D2R2-UNet model outperforms other UNet variant models

with 7.79% loss. The proposed model better captures inner cell features as well as effectively

isolates the artifacts and fragmented cellular clusters.

Alongside, Fig. 3.3(b) exhibits the joint loss (Equation 2.4) during testing process, again,

our model surpasses other models with 7.45% loss. Comparing Fig. 3.3(a) and 3.3(b), the

testing loss is less than the training loss which may imply two things. First, the network

has good generalization capability i.e., prevents overfitting. Second, the network underfits

slightly which might be caused by over-regularization i.e., dropout rate, however, this did

33

(a) Training loss (b) Testing loss

10 20 30 40 50 60 70 80 90 100

epochs

7

8

9

10

11

12

Lo

ss (

%)

UNet

Res-UNet

RD-UNet

Rec-UNet

D-UNet

R2-UNet

D2R2-UNet

10 20 30 40 50 60 70 80 90 100

epochs

7

8

9

10

11

12

Lo

ss (

%)

UNet

Res-UNet

Rec-UNet

D-UNet

R2-UNet

RD-UNet

D2R2-UNet

Figure 3.3 Comparisons of the joint loss (Equation 2.4) between different UNet variantmodels for inner cell segmentation: (a) training loss and (b) testing loss.

Table 3.3 Comparison among different UNet architectures based on their inner cellsegmentation performance evaluated on same testing set

CNNs Jaccard (%) Dice (%) Precision (%) Accuracy (%) Specificity (%)

UNet [9] 94.04 96.93 96.74 98.78 99.19

D-UNet [6] 95.40 97.64 97.29 99.07 99.33

Res-UNet [10] 95.26 97.57 97.36 99.04 99.35

Rec-UNet [11] 95.28 97.58 97.11 99.04 99.28

R2-UNet [11] 95.53 97.72 97.41 99.09 99.36

RD-UNet [8] 95.55 97.72 97.53 99.10 99.39

Proposed D2R2-UNet 95.65 97.78 97.66 99.12 99.42

not significantly affect the segmentation performance. Here, we finely tuned dropout rate

to prevent the overfitting which causes slight underfitting. Moreover, if we compare the

training loss with the testing loss, we observe that the minimum loss values are almost

similar in both cases. This highlights that our network generalizes well and avoids overfitting

like the baseline UNet. Our network outperforms the baseline UNet by a large margin. The

increased testing performance of D2R2-UNet reflects the fact that it better recognizes the

features relevant to the varying inner cell. Also, it better combines low level and high level

features and understand context well at deeper levels of the network. Finally, we summarize

the overall segmentation performance of all models based on a testing set of 342 images in

Table 3.3.

Although UNet and its variants exhibit significant performance, the D2R2-UNet

provides the best overall performance with Jaccard Index of 95.65% and Dice Coefficient of

34

97.78%. The intuition behind this enhanced performance is, the proposed network forms

a robust architecture owing to three major modifications: 1) R2 convolutional units in the

encoder and decoder, 2) dilated convolutional layers in the central bridge, and 3) residual

convolutional layers in the encoder-decoder skip-connections, whereas other UNet models

do not include all of them.

3.3.2 Qualitative Results

The network’s predictions were compared with the corresponding ground truths to evaluate

the segmentation performance. We organized the segmentation results predicted by D2R2-

UNet into three performance categories: 1) best prediction, 2) better prediction, and 3) fair

prediction. This gives a clear idea about the overall segmentation performance throughout

the testing dataset. Here, the best prediction is defined by Jaccard Index more than 96%.

Similarly, the better prediction is based on Jaccard Index from 92% to 96%. Finally, the

fair prediction includes Jaccard Index between 86% and 92%. Of the total 342 testing

images, 167 images fall into the best category, 163 images correspond to the better category

and the remaining 12 images are in the fair performance category. Among all predictions,

the highest individual Jaccard and Dice are 98.55% and 99.27%, respectively. The lowest

individual Jaccard and Dice are 86.06% and 92.51%, respectively.

We discuss the qualitative inner cell segmentation performance of the proposed D2R2-

UNet model; Fig. 3.4 shows some representative results, where the (i, j)th image denotes

an image at the ith row and jth column in Fig. 3.4. For all the three prediction categories,

D2R2-UNet successfully segments inner cell regions beyond the culture well containing white

bands and/or in dark background; see the (2, 1)th, (2, 2)th, (2, 6)th, (5, 3)th, (5, 4)th, and

(5, 5)th images. The proposed D2R2-UNet model effectively identifies an outline of inner

cells even if outlines connect with artifacts, e.g., the (2, 4)th, (5, 2)th images, or fragmented

cellular clusters, e.g., the (2, 3)th, (5, 3)th images. In general, contours of segmented inner

cells well align with that of the ground truth annotations; compare red and blue contours

in the 3rd and 6th rows.

35

Figure 3.4 Segmentation results. Light green in 2nd and 5th rows indicates segmented innercell by D2R2-UNet. Red and blue in 3rd and 6th rows indicate the boundaries of groundtruth and predicted inner cell, respectively. JI and DC stand for Jaccard index and Dicecoefficient, respectively.

36

D2R2-UNet shows limited segmentation performances where some indistinct features

exist between inner cells and ZPs. For example, in the (3, 4)th image, miss-segmentation

exists where it is challenging to differentiate edges of ZP and inner cell (where the edge

around ZP is stronger than the usual); in the (6, 5)th image, miss-segmentation happens

where inner cell has similar texture to that of ZP; in the (6, 6)th image, miss-segmentation

exists where edges between inner cell and ZP are indistinct.

37

References

[1] A. A. Taha and A. Hanbury, “Metrics for evaluating 3D medical image segmentation:

analysis, selection, and tool,” BMC Medical Imaging, vol. 15, no. 1, p. 29, 2015.



(CCECE). IEEE, 2016, pp. 1–4.




IEEE, 2017, pp. 1–5.



vol. 64, no. 12, pp. 2968–2978, 2017.




1756.



38


3522.







pp. 214–219.




241.






39

Chapter 4

Conclusion and Future Work

4.1 Conclusion

The thesis has demonstrated works on a biomedical project aiming to improve the existing

IVF treatment for infertility. Automating embryo image segmentation with high accuracy

is important for sustaining healthy pregnancies in IVF, since it is a basic element in

morphological and morphokinetic analysis for evaluating embryo viability. The project

demonstrated deep learning based embryo image segmentation methods to 1) segment ICM

and TE in ZP-intact embryonoic images for morphological analysis and 2) segment inner cell

in ZP-ablated embryonic images for morphokinetic study. Improving semantic segmentation

CNN is particularly useful for ICM and TE segmentation since it is challenging to segment

ICM and TE regions due to the similar textures of embryo regions (ICM/TE/ZP/CM) and

artifacts and image contrast variations. The CNN can resovle these issues by improving

feature extraction and better understanding the context. Furthermore, it is important to

enhance segmentation CNN for inner cell segmentation. Because it is difficult to segment

inner cell with the conventional inner cell segmentation method in an embryoscope due to

irregular expansion of inner cells, some artifacts and cellular clusters near inner cell outlines,

and potential white bands and/or dark backgrounds around expanded inner cell. The CNN

can overcome these challenges by better capturing inner cell features.

40

We implemented RD-UNet model and developed D2R2-UNet model in order to overcome

the aforementioned segmentation challenges. We implemented RD-UNet model [1] by

incorporating residual convolutional units in encoder and decoder and adding series of

dilated convolutional layers to the central bridge. The RD-UNet model improves the ICM

segmentation and outperforms the existing models i.e., CNN with discrete cosine transform

[2], coarse-to-fine texture analysis [3], texture analysis, clustering, and watershed algorithm

[4], VGG16 [5], and SD-UNet [6] with a 94.3% Dice Coefficient and a 89.3% Jaccard Index.

It achieves the best performances in TE segmentation with a 92.5% Dice Coefficient and

a 85.3% Jaccard Index compared to existing models i.e., level-set algorithm and Retinex

theory [7], CNN with discrete cosine transform [2], and texture analysis, clustering, and

watershed algorithm [4]. We believe that this model can be used for precisely segmenting

ICM and TE for morphological analysis of embryos towards improved pregnancy outcomes

in IVF.

For inner cell segmentation, we proposed a UNet-based CNN architecture by replacing

UNet encoding-decoding units, central bridge, and encoder-decoder skip-paths with R2

convolutional encoding-decoding units, dilated convolutional central bridge, and residual

convolutional encoder-decoder skip-paths, respectively. The proposed D2R2-UNet model

improves inner cell segmentation performances with a Jaccard Index of 95.65% and a Dice

Coefficient of 97.78% compared to the existing UNet variants i.e., UNet [8], D-UNet [6], Res-

UNet [9], Rec-UNet [10], R2-UNet [10], and RD-UNet [1]. The model better understands

the context and/or reduces semantic disparity between encoder and decoder. We believe

that the proposed model can accurately segment inner cell for morphokinetic analysis of

embryos and facilitate sustained pregnancies in IVF.

4.2 Future Work

Our future work is using temporal information between frames to improve inner cell

segmentation performances. The current dataset has small number of frames per video

41

(30 or 31 frames) with long interval (20 minutes/frame), so there exists large and irregular

spatial variations between consecutive frames and modeling temporal changes is challenging.

In addition, the dataset has total 45 videos, so it has limited diversity in training video

segmentation CNNs. To overcome the aforementioned limitations, future works will obtain

more frames per video (regular and shorter interval in time-lapse imaging setup) and more

videos.

42

References




pp. 214–219.



(CCECE). IEEE, 2016, pp. 1–4.




IEEE, 2017, pp. 1–5.



vol. 64, no. 12, pp. 2968–2978, 2017.




1756.

43




3522.







241.






44

medical image segmentation for embryo image …...medical image segmentation for embryo image...

Documents