breast cancer tma an end-to-end deep learning histochemical …€¦ · an end-to-end deep learning...

13
WestminsterResearch http://www.westminster.ac.uk/westminsterresearch An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Liu, J., Xu, B., Zheng, C., Gong, Y., Garibaldi, J.M., Soria, D., Green, A., Ellis, I.O., Zou, W. and Qiu, G. This is a copy of the author’s accepted version of a paper subsequently published in IEEE Transactions on Medical Imaging, doi: 10.1109/TMI.2018.2868333. It is available online at: https://dx.doi.org/10.1109/TMI.2018.2868333 © 2018 IEEE . Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The WestminsterResearch online digital archive at the University of Westminster aims to make the research output of the University available to a wider audience. Copyright and Moral Rights remain with the authors and/or copyright owners. Whilst further distribution of specific materials from within this archive is forbidden, you may freely distribute the URL of WestminsterResearch: ((http://westminsterresearch.wmin.ac.uk/). In case of abuse or copyright appearing without permission e-mail [email protected]

Upload: others

Post on 07-Jul-2020

36 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

WestminsterResearchhttp://www.westminster.ac.uk/westminsterresearch

An End-to-End Deep Learning Histochemical Scoring System for

Breast Cancer TMA

Liu, J., Xu, B., Zheng, C., Gong, Y., Garibaldi, J.M., Soria, D.,

Green, A., Ellis, I.O., Zou, W. and Qiu, G.

This is a copy of the author’s accepted version of a paper subsequently published in

IEEE Transactions on Medical Imaging, doi: 10.1109/TMI.2018.2868333.

It is available online at:

https://dx.doi.org/10.1109/TMI.2018.2868333

© 2018 IEEE . Personal use of this material is permitted. Permission from IEEE must be

obtained for all other uses, in any current or future media, including

reprinting/republishing this material for advertising or promotional purposes, creating

new collective works, for resale or redistribution to servers or lists, or reuse of any

copyrighted component of this work in other works.

The WestminsterResearch online digital archive at the University of Westminster aims to make the

research output of the University available to a wider audience. Copyright and Moral Rights remain

with the authors and/or copyright owners.

Whilst further distribution of specific materials from within this archive is forbidden, you may freely

distribute the URL of WestminsterResearch: ((http://westminsterresearch.wmin.ac.uk/).

In case of abuse or copyright appearing without permission e-mail [email protected]

Page 2: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

An End-to-End Deep Learning HistochemicalScoring System for Breast Cancer TMA

Jingxin Liu, Bolei Xu, Chi Zheng, Yuanhao Gong, Jon Garibaldi, Daniele Soria, Andew Green, Ian O. Ellis,Wenbin Zou, Guoping Qiu

Abstract—One of the methods for stratifying different molecular classes of breast cancer is the Nottingham Prognostic Index Plus(NPI+) which uses breast cancer relevant biomarkers to stain tumour tissues prepared on tissue microarray (TMA). To determine themolecular class of the tumour, pathologists will have to manually mark the nuclei activity biomarkers through a microscope and use asemi-quantitative assessment method to assign a histochemical score (H-Score) to each TMA core. Manually marking positivelystained nuclei is a time consuming, imprecise and subjective process which will lead to inter-observer and intra-observerdiscrepancies. In this paper, we present an end-to-end deep learning system which directly predicts the H-Score automatically. Oursystem imitates the pathologists’ decision process and uses one fully convolutional network (FCN) to extract all nuclei region (tumourand non-tumour), a second FCN to extract tumour nuclei region, and a multi-column convolutional neural network which takes theoutputs of the first two FCNs and the stain intensity description image as input and acts as the high-level decision making mechanismto directly output the H-Score of the input TMA image. To the best of our knowledge, this is the first end-to-end system that takes aTMA image as input and directly outputs a clinical score. We will present experimental results which demonstrate that the H-Scorespredicted by our model have very high and statistically significant correlation with experienced pathologists’ scores and that theH-Score discrepancy between our algorithm and the pathologists is on par with the inter-subject discrepancy between the pathologists.

Index Terms—H-Score, Immunohistochemistry, Diaminobenzidine, Convolutional Neural Network, Breast Cancer

F

1 INTRODUCTION

Breast cancer (BC) is a heterogeneous group of tumourswith varied genotype and phenotype features [1]. Recentresearch of Gene Expression Profiling (GEP) suggests thatBC can be divided into distinct molecular tumour groups[2]. Personalised BC management often utilizes robust com-monplace technology such as immunohistochemistry (IHC)for tumour molecular profiling [3].

Diaminobenzidine (DAB) based IHC techniques stainthe target antigens (detected by biomarkers) with browncolouration (positive) against a blue colouration (negative)counter-stained by Hematoxylin (see Fig.1 for some ex-ample images). To determine the biological class of thetumour, pathologists will mark the nuclei activity biomark-ers through a microscope and give a score based on asemi-quantitative assessment method called the modifiedhistochemical scoring (H-Score) [4]. The H-Scores of tissuesamples stained with different biomarkers are combinedtogether to determine the biological class of a case. Clini-cal decision making is to choose an appropriate treatmentfrom a number of available treatment options accordingto the biological class of the tumour. For instance, one of

• J. Liu, B. Xu, Y. Gong, W. Zou, and G. Qiu are with the Collegeof Information Engineering, Shenzhen University and Guangdong KeyLaboratory of Intelligent Information Processing, China; G. Qiu is alsowith School of Computer Science, The University of Nottingham, UK

• C. Zheng is with Ningbo Yongxin Optics Co., LTD, Zhejiang, China• J. Garibaldi is with School of Computer Science, The University of

Nottingham, UK• D. Soria is with Department of Computer Science, The University of

Westerminster, UK• A. Green and I. O. Ellis are with Faculty of Medicine & Health Sciences,

The University of Nottingham, United Kingdom.

the methods for stratifying different molecular classes isthe Nottingham Prognosis Index Plus (NPI +) [1] whichuses 10 breast cancer relevant biomarkers to stain tumourtissues prepared on tissue microarray (TMA). Tissue sam-ples stained by each of these 10 biomarkers are given ahistochemical score (H-Score) and these 10 scores togetherwill determine the biological class of the case.

Therefore, H-Score is one of the most important pieces ofinformation for molecular tumour classification. When thetumour region occupies more than 15% of the TMA section,a H-Score is calculated based on a linear combination of thepercentage of strongly stained nuclei (SSN), the percentageof moderately stained nuclei (MSN) and the percentage ofweakly stained nuclei (WSN) according to:

H− Score = 1×WSN + 2×MSN + 3× SSN (1)

The final score has a numerical value ranges from 0 to 300.Thus, the histochemical assessment of the TMA’s is basedon the following semi-quantitative information: the totalnumber of cells, the number of tumour cells and the stainintensity distributions within the tumour cells.

In clinical practice, diagnosis requires averaging twoexperienced pathologists’ assessments. Manually markingthe positively stained nuclei is obviously a time consumingprocess. As visual assessment of the TMA’s is subjective,there is the problem of inter-observer discrepancy and theissue of repeatability. The semi-quantitative nature of themethod (strongly stained, moderately stained and weaklystained, the definitions of strong, moderate and weak cannotbe precise and subjective), makes it even more difficult toensure inter-subject as well as intra-subject consistency.

Page 3: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Total Number of Cell

Positive Stain Intensity

Number of Tumour Cell 1*WSN% + 2*MSN% + 3*SSN% H-Score

Fig. 1: Top: Example images extracted from digital TMA slides. Each red circle contains one TMA core stained byDiaminobenzidine-Hematoxylin (DAB-H). The brown colours indicate positive and the blue colours indicate negative.Bottom: A schematic illustration of the traditional manual H-Scoring procedure: it needs to first count the total number ofnuclei, then the number of strongly stained, moderately stained and weakly stained tumour nuclei, respectively. The finalH-Score is then calculated according to Eq.1.

With the increasing application of clinicopathologicprognosis, Computer Aided Diagnosis (CAD) systems havebeen proposed to support the pathologists’ decision making.The key parameters in tissue image assessment include thenumber of tumour cells, the positive staining intensitieswithin these cells and the total number of all cells in theimage. To classify the positively stained pixels and theirstain intensity, methods such as colour deconvolution thatperform mathematical transformation of the RGB image [5][6] are widely used to separate positive stains from negativestains. Numerous computer-assisted approaches have beenproposed for cell or nuclei detection and segmentation [7].Most literature [8], [9] on histopathology image analysisperform various low-level quantification steps, there is stilllittle attempt to perform end-to-end assessment of the imagedirectly.

Traditional assessment methods have at least three un-solved issues for both the pathologists and the CAD sys-tems. Firstly, the positive staining intensity needs to becategorized into four classes: unstained, weak, moderate, andstrong. However, there is no standard quantitative criterionfor classifying the DAB stain intensity. Thus, two pathol-ogists often classify the same staining intensity into twodifferent categories or two different intensities into the samecategory. Furthermore, the human visual system may paymore attention to strongly stained regions but they are oftensurrounded by a variety of staining intensities [9], whichmay also affect the assessment results. Secondly, cell/nucleiinstance counting is a very important parameter in theassessment. Nevertheless, both human and computer stillcannot deal with the difficulty of counting overlappingcells very well. Moreover, variability in the appearance ofdifferent types of nucleus, heterogeneous staining, and thecomplex tissue architectures make individually segmentingcell/nuclei a very challenging problem. Thirdly, the ap-parent size differences between tumour nuclei and normalnuclei will affect the quantitative judgement of tumournuclei assessment. Examples of these challenging cases areillustrated in Fig. 2.

(a) (b) (c)Fig. 2: Examples of challenging cases of quantitative mea-surement of biomarkers based on visual assessment. (a) Avariety of stain intensities; (b) unclear staining and overlap-ping of nucleus; (c) Size differences between different typeof nucleus.

In this paper, we ask this question: is it possible todevelop a CAD model that would directly give a high-level assessment of a digital pathological image, just likean experienced pathologist would, for example, to give outa H-Score directly? In an attempt to answer this question,we propose an end-to-end deep learning system for directlypredicting the H-Scores of breast cancer TMA images, seeFig. 3. Instead of pushing the raw digital images into theneural network directly, we follow a similar process thatpathologists use for H-Score estimation. We first constructa stain intensity nuclei image (SINI) and a stain intensitytumour image (SITI), which contain all general nuclei region(tumour and non-tumour) and tumour region respectivelyand their corresponding stain intensity information. TheSINI and SITI block irrelevant background pixels whileonly retain useful information for calculating the H-Score.These two H-Score relevant images are then fed into adual-channel convolutional neural network with two inputpipelines, which are finally merged into one pipeline togive an output (H-Score). The innovative characteristic ofour proposed method is that it is mimicking the H-Scoringprocess of the pathologists, while without counting thetotal number of nuclei, the number of tumour nuclei, andcategorising the tumour nuclei based on the intensity oftheir positive stain. To the best of our knowledge, this isa first work that attempts to develop deep learning basedend-to-end TMA processing model that directly outputs the

Page 4: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Colour Deconvolution &

Luminance Labelling

T

Ila

GG

T Ila

Ilax

x

=

=

Score

Stage 1,2

Stage 3

Fig. 3: The overview of our proposed H-Score prediction framework. An input TMA image is first processed by twoFCNs to extract tumour cells and all cells (tumour and non-tumour) to produce two mask images. The input image isalso processed by colour deconvolution and positive stain classification to output a stain intensity description image. Thetwo mask images are used to filter out irrelevant information in the stain intensity description image and only the usefulinformation is fed to a deep convolutional neural network for the prediction of H-Score of the input TMA.

histochemical scores. We will present experimental resultswhich demonstrate that the H-Scores predicted by ourmodel have high and statistically significant correlation withexperienced pathologists’ scores and that the H-Scoringdiscrepancy between our algorithm and the pathologists ison par with that between the pathologists. Although it is stillperhaps a long way from clinical use, this work neverthelessdemonstrates the possibility of automatically scoring cancerTMA’s based on deep learning.

2 RELATED WORKS

Researchers have proposed various computer-assisted anal-ysis methods related to histochemical score assessment. Forpixel-level positive stain separation, Pham [10] adapted theYellow channel in the CMYK model, which is believed tohave strong correlation with the DAB stain. Ruifrok [5]presented the brown image calculated based on mathe-matical transformation of the RGB image. His later workutilized the predefined stain matrix for stain colour decom-position [6]. CAD systems have also been developed forcell nucleus segmentation. Early works employed differentfilters to segment cells [11], [12]. While morphology-basedmethods are applied in recent proposed conventional cellsegmentation systems. Arslan et al. [13] constructed anattributed relational graph on four predefined types of cellnucleus boundaries for fluorenscence images; Shu et al. [14]proposed utilizing morphological filtering and seeded wa-tershed for overlapping nuclei segmentation in IHC stainedhistopathology images.

With the development of deep learning techniques, var-ious deep neural network based CAD models have beenpublished. Ronneberger et al. [15] proposed U-Net withsymmetric U shape architecture, which have been provedhave strong universality on biomedical segmentation appli-cations [16], [17]. Deep convolutional networks with deeperarchitectures can be used to build more complex modelswhich will result in more powerful solutions. Li [18] useda 88-layer residual network for simultaneously human ep-ithelial type 2 (HEp-2) cell segmentation and classification.

AggNet with a novel aggregation layer is proposed formitosis detection in breast cancer histology images [19].Google brain presented a multi scale convolutional neuralnetwork (CNN) model to aid breast cancer metastasis detec-tion in lymph nodes [20]. A deep learning-based system isproposed for the detection of metastatic cancer from wholeslide images, which won the Camelyon Grand Challenge2016 [21]. Shah et al. [22] presented the first completelydata-driven model integrated numerous biologically salientclassifiers for invasive breast cancer prognosis.

Most existing high-level CAD frameworks directly fol-low the assessment criteria by extracting quantitative infor-mation from the digital images. Masmoudi et al. [8] pro-posed an automatic Human Epidermal Growth Factor Re-ceptor 2 (HER2) assessment method, which is an assemblealgorithm of colour pixel classification, nuclei segmentationand cell membrane modelling. Gaussian-based bar filter wasused for membrane isolation after colour decomposition in[23]. Trahearn et al. [9] established a two-stage registrationprocess for IHC stained WSI scoring. Thresholds were de-fined for DAB stain intensity groups, and tumour region andnuclei were detected by two different detectors. Recently,Zhu [24] proposed to train an aggregation model basedon deep convolutional network for patient survival statusprediction.

3 METHODOLOGY

In this paper, we propose to develop a CNN based CADframework for biomarker assessment of TMA images. In-stead of using CNN as a feature extractor or for low levelprocessing such as cell segmentation and counting, we havedeveloped an end-to-end system which directly predicts thebiomarker score (H-Score).

The overview of the H-Score prediction framework isillustrated in Fig.3. It consists of three stages: 1) generalnuclei and tumour regions segmentation; 2) stain intensitydescription; 3)constructing the SINI and SITI, and predictingthe final histochemical score (H-Score) by the Region Atten-

Page 5: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

tion Multi-column Convolutional Neural Network (RAM-CNN).

The rationale of this architecture is as follows: as thenumber of nuclei, the number of tumour nuclei and the stainintensity of the tumour nuclei are the useful information forpredicting H-Score, we therefore extract two region masksthat contains these information. Rather than setting artificialboundaries for the categories of stain intensity, we retaina continuous description of the stain intensity. Only theinformation useful for predicting the H-Score is presentedto a deep CNN to estimate the H-Score of the input image.This is in contrast to many work in the literature where thewhole image is thrown to the CNN regardless if a region isuseful or not for the purpose.

3.1 Stain Intensity Description

Although various DAB stain separation methods have beenproposed [6], [25], few work studied the stain intensitydescription and grouping. Since there is no formal defi-nitions for the boundaries between stain intensity groups(e.g, strong, moderate, weak), previous works used manuallydefined thresholds for pixel-wise classification to segmentpositive stains into each stain group [9].

In this work, we propose to directly use the luminancevalues of the image to describe the staining intensity in-stead of setting artificial intensity category boundaries. Theoriginal RGB image I is first transformed into three-channelstain component image (I

DAB−H= [I

DAB, I

H, I

Other]) using

colour deconvolution [6]:

IDAB−H

= M−1IOD, (2)

where M is the stain matrix composed of staining coloursequal to 0.268 0.570 0.776

0.650 0.704 0.2860.0 0.0 0.0

(3)

for DAB-H stained images, and IOD is Optical Densityconverted image calculated according Lambert-Beers law:

IOD = −log(I

I0), (4)

I0 = [255, 255, 255] is the spectral radiation intensity fora typical 8bit RGB camera [26]. Only the DAB channelimage I

DABfrom the three colour deconvolution output

channels is used, which describes the DAB stain accordingthe chroma difference.

Original image DAB channel image Stain intensity discription image

Fig. 4: A comparison of different images generated duringthe process of stain intensity description. The highlightedsubimage contains strongly stained nuclei.

Most previous works set a single threshold on IDAB

to separate positively stained tissues. However, as shownin Fig.4, the deeply stained positive nuclei can have darkand light pixel values on the DAB channel image, since thestrongly stained pixels will have significantly broader huespectrum. The same DAB channel value can correspondto different stain intensities even different colours. In thispaper, we use the Luminance Adaptive Multi-Thresholding(LAMT) method developed by the authors [27] to classifypositively stained pixels. Specifically, the transformed pixelIDAB

(m,n) is divided into K equal intervals according tothe luminance:

IkDAB

(m,n) = {IkDAB

(m,n) ∈ IDAB )|ξk < Il(m,n) ≤ ζk} (5)

where k = 1, 2, ...,K ; ξk and ζk are lower and upper boundrespectively of the kth luminance interval. Il is the lumi-nance image of the original RGB image calculated accordingto Rec. 601 [28]:

Il = 0.299× IR + 0.587× IG + 0.114× IB . (6)

The transformed pixels are thresholded with different val-ues according to its luminance instead of a single threshold,the threshold tk is assigned as follows:

tk = argmaxc∈C

P (c|IkDAB

(m,n)) (7)

where C = {cDAB

, cH} is the stain label.

Once we have separated the positive stain from thenegative stain, we need to find a way to describe the stainintensity. As we have already seen in Fig.4, the pixel valuesof I

DABcan not describe the biomarker stain intensity. We

propose to use a scheme described in Eq.8 to assign stainintensity values to pixels:

Ila(m,n) ={Il(m,n) if I

DAB(m,n) is positive

255 + 255− Il(m,n) if IDAB

(m,n) is negative(8)

where Ila(m,n) is the stain intensity description image.The idea is that for the positive stain pixels, Ila(m,n) is

the same as the luminance component of the original imagein order to preserve the morphology of the positive nuclei;for the negative stain pixels, Ila(m,n) will have a highervalue for strongly stained pixels (darker blue colour) and alower value for weakly stained pixels (lighter blue colour).In order to separate the positive and negative pixel valuesclearly, we add an offset of 255 to the negatively stainedpixels (most negative stain pixels will have a high Il(m,n)and positive stain pixels will have a low Il(m,n), the valueof positive and negative pixels will be clearly separated inIla(m,n)). Therefore, the larger Ila(m,n) is, the weaker thestain is; the smaller Ila(m,n) is, the stronger is the stain.When Ila(m,n) is below or equal to 255, it is a positivestain pixel. In this way, we have obtained an image whichgives a continuous description of the stain intensity of theimage. Instead of setting artificial boundaries to separatethe different degrees of stain intensity, we now have acontinuous description of the stain intensity (see Fig.5). Notethat the pixel values of final image are normalized to therange from 0 to 1.

Page 6: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

3.2 Nuclei and Tumour Maps

As discussed above, the important information pathologistsuse to come up with the H-Score is the number of nucleiand the number of tumour nuclei in the TMA image. Wetherefore need to extract these two pieces of informationand we use two separate FCNs, one for segmenting theforeground region which contains all nucleus and the otherfor segmenting the region of tumour nucleus only.

Recent proposed fully convolutional network cantake an input image of arbitrary size and produce acorrespondingly-same-sized output by an encoder-decoderarchitecture [15], [29]. We add skip connections so thatthe high resolution features from the contracting path arecombined with the output from the upsampling path, whichallows the network to learn the high resolution contex-tual information. Specifically, skip connections are addedbetween the last convolutional layer of each block in theencoder and the first layer of the corresponding block in thedecoder. The loss function is designed according to the Dicecoefficient as:

Lmask = −2∑

m,n ωτ∑m,n ω +

∑m,n τ

, (9)

where ω is the predicted pixel and τ is the ground truth.To segment the tumour regions, we use our own man-

ually pixel-wise labelled tumour TMA images to train theFCN; while transfer learning strategy is utilized for generalnuclei detection, and the training data is obtained fromthree different datasets. Detailed datasets information wellbe presented in Section 4.1. Training on a mixed image setcould help to reduce overfitting on limited medical datasetand further boost the performance and robustness [30].

3.3 The H-Score Prediction Framework

The detail of the first two stages have been described inSection 3.1 and 3.2. As illustrated in Fig.3, an input TMAimage I(m,n) is processed by the tumour detection networkwhich will output a binary image mask, T (m,n), markingall the tumour nuclei, where T (m,n) = 1 if I(m,n) isa part of a tumour nuclei and T (m,n) = 0 otherwise;by the general nuclei detection network which will outputanother binary image mask, G(m,n), marking all tumourand non-tumour nuclei, where G(m,n) = 1 if I(m,n) isa part of a nuclei and G(m,n) = 0 otherwise; and by thecolour deconvolution and stain intensity labelling operationof Equation (8) to produce the stain intensity descriptionimage Ila(m,n).

In the third stage, we firstly construct SINI and SITI bymultiplying the nuclei mask image G(m,n) and tumourmask image T (m,n) with the stain intensity descriptionimage Ila(m,n), i.e. SINI = Ila(m,n) × G(m,n), andSITI = Ila(m,n) × T (m,n). Hence, all background pixels

are zero, while only region of interests (ROI) are retainedin SINI and SITI. All necessary information is preserved forhistochemical assessment. Removing the background andonly retaining ROI will enable the RAM-CNN convolutionallayers to focus on foreground objects [31] which will signifi-cantly reduce computational cost and improve performance.

layer name output size input size/ patch size, strideInput 512× 512× 1 512× 512× 1Conv1 512× 512× 8 7× 7× 8 7× 7× 8

MaxPool 128× 128× 8 4× 4, stride 4 4× 4, stride 4Conv2 128× 128× 16 5× 5× 16 5× 5× 16

MaxPool 64× 64× 16 3× 3, stride 2 3× 3, stride 2Concat 64× 64× 32Conv3 64× 64× 64 3× 3× 64

MaxPool 32× 32× 64 3× 3, stride 2Conv4 32× 32× 64 3× 3× 64

MaxPool 16× 16× 64 3× 3, stride 2FC1 2048-d 2048

Dropout 0.3FC2 1024-d 1024

Dropout 0.3Output 1-d 1

TABLE 1: The architecture of Region Attention Multi-channel Convolutional Neural Network (RAM-CNN). Thepipeline consists of convolutional layer (Conv), maxpoolinglayer (MaxPool), concatenation (Concat), and fully con-nected layer (FC).

The proposed RAM-CNN is a deep regression modelwith dual input channels. The architecture of RAM-CNNis shown in Table 1. Two inputs correspond to SINI and SITIrespectively, and the input size is 512×512. The parametersof the two individual branches are updated independentlyfor extracting cell and tumour features respectively, withoutinterfering with each other. The two pipelines are concate-nated into one after the second convolutional-pooling block.Since the NPI+ dataset is small and the network requireslarge input image size, using very deep network such asResNet [32] would lead to overfitting and huge computationcost. Rectified linear unit (ReLU) activation function areutilized for the convolutional layers, and linear activationfunction is applied for H-Score prediction. The loss functionfor H-Score prediction is defined as:

Lscore =1

N

N∑i=1

‖FRAM

(SINIi, SITIi)− li‖2, (10)

where FRAM

(SINIi, SITIi) is the estimated score gener-ated by RAM-CNN. li is the ground truth H-Score.

4 EXPERIMENTS AND RESULTS

4.1 DatasetsThe H-Score dataset used in our experiment contains 105TMA images of breast adenocarcinomas from the NPI+

Fig. 5: An illustration of the value of lla and its corresponding stain intensity. The red dot lines are the thresholds of stainintensity groups [9].

Page 7: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

0 50 100 150 200 250 3000

2

4

6

Imag

e N

umbe

r

0 50 100 150 200 250 3000

20

40

60

Aug

men

ted

Num

ber

0 50 100 150 200 250 300H-Score

0

30

60

90

120

S

Fig. 6: The top graph is the original dataset label histogram;The Middle is the augmented label histogram; The bottomis the augmented label number (S) for each image accordingto its corresponding H-Score.

set [1]. Each image contains one whole TMA core. Thetissues are cropped from a sample of one patient which arestained with three different nuclei activity biomarkers: ER,p53, and PgR. The original images are captured at a highresolution of 40× optical magnification, and then resized to1024 × 1024 × 3 pixels. The dataset is manually markedby two experienced pathologists with H-Score based oncommon practice. For each TMA core, the pathologists givethe percentage of nuclei of different stain intensity levels,and then calculate the H-Score using Eq.1. The final label (H-Score) is determined by averaging two pathologists’ scores.The dataset is available from the authors on request.

To train the general nuclei detection network, we con-structed a mixed dataset consist of 3997 images by randomlycropping 224 × 224 image patches without overlappingfrom immunofluorescence (IIF) stained HEp-2 cell dataset(1932 images) [33], Warwick hematoxylin and eosin (H&E)stained colon cancer dataset (1840 images) [34], and our ownDAB-H stained NPI+ images (115 images). Since these threeimage sets are stained with different types of biomarker, wetransform the colour image into grayscale for training. AsHEp-2 cell images are IIF stained, the gray value should beinversed. We labelled 33 DAB-H stained TMA core images(outside of H-Score dataset), and we also random croppedimage patches for training the tumour detection network.

4.2 Data and Label AugmentationAs in typical medical imaging applications, the dataset sizesare relatively small. In developing deep learning basedsolutions, it is a common practice to augment the trainingdataset for training. The training images for general nucleidetection network and tumour detection network are aug-mented by randomly cropping sub-images as input samples.For the H-Score dataset, rotation with random angles andrandomly shifting the image horizontally and verticallywithin 5% of image height and width are performed toaugment the training set.

As shown in the top row of Fig.6, the distribution ofthe label (H-Score) in the original dataset is unbalanced,some labels (H-Scores) have far more samples than others.

Furthermore, one of the biggest problems is that becausewe have only limited number of samples, the H-Scorevalues are discrete and discontinuous. There are many gapsbetween two H-Scores that has no data. Also, the valuesof the TMA image score given by the pathologists have aquantitative step-size of 5. Therefore, if an image has a scoreof 125, it means it has a value of around 125, the values inthe vicinity of 125, i.e., 126 or 124 should also be suitable forlabelling that image. In order to solve the ambiguity issue,we introduce Distributed Label Augmentation (DLA) whichwas inspired by the work of [35], [36].

In the traditional regression method, a given dataset{(I1, l1), (I2, l2), · · · , (ID, lD)} pairs the instance Id for 1 ≤d ≤ D with one single ld from the finite class labelspace L =

{l0, l1, · · · , lC

}, where C is the label size (e.g.,

C = 301 for H-Score). In this paper, the label is aug-mented so that one instance is associated with a num-ber of labels. Formally, the dataset can be described as{(I1, Y1), (I2, Y2), · · · , (ID, YD)}, and Yd ⊆ Y is a set oflabels {y(1)d , y

(2)d , · · · , y(S)

d }, where S is the augmented labelnumber for Id. y(s)d is sampled repeatedly from L basedon a probability density function of following Gaussiandistribution:

p(y(s)d = lc) =

1

σ√

2πexp(− (lc − µ)2

2σ2) (11)

where µ is the mean which equal to ld and σ is standarddeviation. Thus,

∑Ss=1 p(y

(s)d ) = 1 for each original TMA

image. Consequently, for an image xi from the augmentedtraining set, its ground truth labels are assigned by repeat-edly sampling from L according to Eq.11. The augmentedlabel histogram is shown at the bottom row of Fig.6.

4.3 Implementation DetailsWe employ U-Net [15] architecture for both the tumournuclei detection and general nuclei detection models in astraightforward fashion. We add dropout layers after thefirst convolutional layer in each encoding and decodingblock with the rate of 0.1. The filter size of tumour detectionnet is half narrower than that of general cell detection net.The final cell and tumour region maps are predicted usingsliding window.

A leave 5 out cross validation strategy is used for RAM-CNN model training, which means that in each round oftesting, we randomly sample 5 TMAs as testing and theother 100 TMAs as training images. As explained previously,the training set is augmented via rotation and shift. Specif-ically, the data is firstly balanced with random horizontallyand vertically shift within 10% of image edge length, andthen augment the dataset to 5314 images. The images arefirstly resized to 512×512 before fed into the RAM-CNN. Weset σ = 0.9 to generate the H-Score distribution for groundtruth label augmentation, to ensure the augmented labelsare within the range of plus or minus 2 around the groundtruth. The regression network is optimized by Adam [37]with a constant learning rate of 0.001.

4.4 Results and Discussions4.4.1 intermediate resultsFig.7 shows some examples of the intermediate images inthe automatic H-Score prediction pipeline. It is seen that

Page 8: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Fig. 7: Examples of intermediate images in the automatic H-Score prediction pipeline. From left to right: the original RGBimage, luminance labelled stain intensity image, nuclei mask image, and tumour mask image respectively.

Fig. 8: Results of segmentation obtained with FCN-16s, SegNet, U-Net, and Res18-UNet. The top row is the general nucleisegmentation results, while bottom row represents the tumour segmentation results.

the luminance labelled stain intensity image marks a sharpdistinction between positive and negative stains. This shows

that our maximum a posteriori (MAP) classifier based Lu-minance Adaptive Multi-Thresholding (LAMT) method [27]

Page 9: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

can reliably separate positive DAB stains for a variety of im-ages. It also shows that our stain intensity labelling strategycan preserve the morphology of the nuclei, separate positiveand negative stains while retaining a continuous descriptionof the positive stain intensity.

We evaluated the segmentation performance in terms ofpixel-wise distance metrics, e.g. Dice coefficient (DC) andsegmentation accuracy (SA) [38]:

DC(PM,GTM) =|PM ∩GTM ||PM |+|GTM |

(12)

SA = 1− Total no. of misclassified pixels

Total no. of pixels in the image(13)

where PM indicates a predicted mask, and GM means theground truth mask.

Model General Nuclei TumourDC SA DC SA

ARGraph 0.4535 0.5063 - -ImageJ IHC IAT 0.4732 0.5278 - -

FCN-16s 0.5018 0.5788 0.6832 0.7142SegNet 0.7961 0.8360 0.8460 0.8730U-Net 0.8384 0.8816 0.8782 0.9219

Res18-UNet 0.8860 0.9716 0.8704 0.9799

TABLE 2: Quantitative comparison of different methods forgeneral nuclei segmentation and tumour nuclei segmenta-tion.

The U-Net is compared with widely used segmenta-tion network of FCN-16s [39], SegNet [40], and Res18-UNet residual encoder network of ResNet18 [32]; two con-ventional unsupervised cell nuclei segmentation methods,ARGraph [13] and NIH ImageJ [41] IHC Image AnalysisToolbox plugin [14], which are unable to distinguish tumournucleus from normal nucleus. Deep learning based modelsare superior to conventional methods. The U-Net achievesthe highest Dice coefficient on tumour segmentation taskwhile Res18-UNet provides the best performance on generalnuclei segmentation (see Table.2). Example segmentationresults in NPI+ dataset are shown in Fig.8. It can be seenthat all four models can correctly segment the general nu-cleus and tumour regions. U-Net and Res18-UNet producedrelatively more accurate segmentation results compared toFCN-16s and SegNet. The impact of segmentation accuracyon H-Score prediction is evaluated in Section 4.4.3.

More segmentation results generated by U-Net are illus-trated in Fig.7. The nuclei mask images show that the deepconvolutional network trained with mixed datasets usingtransfer learning can successfully detect the nuclei in ourH-Score dataset. The tumour segmentation network is ableto identify tumour region from normal tissues. It is worthnoting that the ground truth masks for the two detectionnetworks are different. All nucleus in Warwick colon cancerimages are labelled with circular masks with a uniformsize, while tumour region masks are pixel level labelled.Therefore, the final predicted maps generated by the twonetworks for the same nucleus are different. In addition, itis found that the mask dilation become evident with theincrease of DAB stain intensity. One possible reason is thatthe strong homogeneous stain makes the nuclei texture andedge feature difficult to extract.

4.4.2 H-Score prediction results

We compared the proposed model, RAM-CNN, with twobaseline single input pipeline CNNs: RGB-CNN and RA-CNN (i.e., region attention CNN), and two well-establishedmodels: AlexNet [42] and ResNet-18 [32]. The RGB-CNNtakes the original RGB TMA image with the shape of512×512×3 as input, and output the H-Score prediction. Toinvestigate the effect of multi-column architecture, we com-bine SINI and SITI as a two channel image of 512× 512× 2for the input of RA-CNN. The architectures of RGB-CNNand RA-CNN are the same as a single pipeline RAM-CNN(see Table.1). We also calculate the H-Score using Eq.1 basedon the nuclei area percentage (NAP) and nuclei numberpercentage (NNP). Specifically, the luminance labelled stainintensity description image Ila is first calculated accordingto the description in Section 3.1. The pre-defined thresholds[9] are utilized for categorizing the pixels into differentDAB-H stain intensity groups. For the NAP method, thepredicted H-Score is calculated according to the percentagesof area in different stain intensity groups. NNP employs theNIH ImageJ tool [41] with a multi-stage cell segmentationtechnique [14] for cell detection. The detected cells areclassified into unstained, weak, moderate, and strong groupsusing the pre-defined thresholds [9] (see Fig.5) for H-Scorecalculation.

Model MAE SD CC P valueNAP 47.09 46.03 0.868 < 0.001NNP 46.48 55.18 0.824 < 0.001

AlexNet 29.47 37.78 0.913 < 0.001ResNet-18 40.03 52.54 0.835 < 0.001RGB-CNN 32.01 44.46 0.873 < 0.001RA-CNN 27.22 35.72 0.922 < 0.001

RAM-CNN 21.33 29.14 0.949 < 0.001Human 20.63 30.55 0.954 < 0.001

TABLE 3: Performance comparison with different regressionmodels. The last line (Human) are difference between the H-Scores given by the two pathologists.

In this paper, Mean Absolute Error (MAE), StandardDeviation (SD) and the correlation coefficient (CC) betweenthe predicted H-Score and the average H-Score of the twopathologists are used as the evaluation metrics. As a refer-ence, we also calculate the MAE, SD and CC between the H-Scores given by the two pathologists of all original diagnosisdata.

As can be seen from Table.3, the NAP based predictiongives the highest MAE with large deviations in the crossvalidation, which is followed by NNP. Our RAM-CNNframework achieves the lowest prediction error (21.33); atraditional CNN setting with the proposed SINI and SITIas input gives the second lowest prediction error (27.22).This verifies the effectiveness of our proposed approachto filtering out irrelevant pixels and only retain H-Scorerelevant information in SINI and SITI. All deep learningbased methods outperform NAP and NNP by a large mar-gin, except ResNet-18 which has an MAE of 40.03. Onereason is that the very deep model easily leads to overfitting.AlexNet with large convolutional kernels achieves betterresult than RGB-CNN and ResNet-18. To investigate thestatistical significance of automatically predicted H-Scores,

Page 10: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

0 50 100 150 200 250 300Pathologist's Score

0

50

100

150

200

250

300Pr

edic

ted

Scor

e

(a) NAP

0 50 100 150 200 250 300Pathologist's Score

0

50

100

150

200

250

300

Pred

icte

d Sc

ore

(b) NNP

0 50 100 150 200 250 300Pathologist's Score

0

50

100

150

200

250

300

Pred

icte

d Sc

ore

(c) AlexNet

0 50 100 150 200 250 300Pathologist's Score

0

50

100

150

200

250

300

Pred

icte

d Sc

ore

(d) ResNet-18

0 50 100 150 200 250 300Pathologist's Score

0

50

100

150

200

250

300

Pred

icte

d Sc

ore

(e) RGB-CNN

0 50 100 150 200 250 300Pathologist's Score

0

50

100

150

200

250

300

Pred

icte

d Sc

ore

(f) RA-CNN

0 50 100 150 200 250 300Pathologist's Score

0

50

100

150

200

250

300

Pred

icte

d Sc

ore

(g) RAM-CNN

0 50 100 150 200 250 300Pathologist 1

0

50

100

150

200

250

300

Path

olog

ist 2

(h) Human

Fig. 9: (a-g) are the scatter plots of the predicted scores of different models vs the pathologists’ manual scores; (h) is thescatter plot of the scores of assessed by two different pathologists.

the correlation of the predicted and those of the pathologistsscores and its P value are also calculated. The correlationbetween pathologists scores and those predicted by RAM-CNN is 0.949 with a P value of < 0.001, which meansthe correlation is considered significant and there is strongevidence against the null hypothesis that there is no rela-tionship between pathologists scores and predicted scores[43].

It is interesting to observe that the difference betweenour RAM-CNN predicted H-Scores and the average of thetwo pathologists H-Scores (MAE = 21.33, SD = 29.14, CC= 0.949) are on par with the difference between the twopathologists (MAE = 20.63, SD = 30.55, CC = 0.954). Whilethe MAE between the RAM-CNN and humans is slightlyhigher than that between humans, the SD between humansis higher than that between RAM-CNN and humans. TheCC between humans and machine and that between hu-mans are the same.

Fig.9 illustrates the scatter plots between the modelpredicted scores and the pathologists’ scores. Most of thepredicted scores of NAP are lower than the ground truth.At the lower end, NNP predicted scores are lower thanthe ground truth while at the higher end the predictedscores are higher than the ground truth. These two methodsare affected by several low-level processing componentsincluding the pre-defined stain intensity thresholds andthe nuclei segmentation accuracy. Our proposed frameworkgives more accurate prediction results compared to tradi-tional baseline CNN, further demonstrating that imitatingthe pathologists’ H-Scoring process by only keeping usefulinformation is an effective approach.

4.4.3 segmentation accuracy and label augmentation eval-uationThe objective of this section is to evaluate the impact ofsegmentation accuracy on H-Score prediction and the use-fulness of label augmentation.

Model Segmentation Accuracy (DC) H-Score PredictionGeneral Nuclei Tumour MAE SD

FCN 0.5018 0.6832 28.41 35.79SegNet 0.7961 0.8460 25.74 33.78U-Net 0.8384 0.8782 21.33 29.14

Res18-UNet 0.8860 0.8704 22.64 31.59

TABLE 4: Performance comparison of H-Score predictionwith different general nuclei and tumour nuclei segmenta-tion models.

Table.4 compares the H-Score prediction results of RAM-CNN with different general nuclei and tumour nuclei seg-mentation models. It can be observed that the performanceof RAM-CNN is improved with the segmentation accuracy.However, Res18-UNet with the best general nuclei segmen-tation performance has a slightly higher MAE than thatusing U-Net. One possible reason is that the training setof general nuclei segmentation is consist of three differenttypes of histopathology images with different label masks,and the well-trained model may less compatible with NPI+dataset.

To investigate the impact of label augmentation, wecompare the MAE of different H-Score prediction modelwith or without using label augmentation. Fig.10 clearlyshows that label augmentation can promote the predictionperformance of our proposed models.

Page 11: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

RA-CNN RAM-CNN0

5

10

15

20

25

30

MA

EOriginal Label

Augmented Label

Fig. 10: Performance comparison with or without using labelaugmentation.

4.4.4 DiscussionsIn this paper, we introduced an end-to-end system to pre-dicted H-Score. To gain an insight into the scoring dis-crepancy between the proposed algorithm and the pathol-ogists, we firstly compare the H-Score prediction resultsfor different biomarkers as shown in Table.5. The proposedframework gives the best accuracy in all three biomarkerimages. The performances are slightly different for differentbiomarkers. This is to be expected because different markerswill stain the tissues differently. Although the differenceis not large, whether it will be useful to train a separatenetwork for different biomarkers is something worth inves-tigating in the future.

Biomarker ER P53 PgRNo. of TMA 32 33 40

NAP 42.02 50.68 48.17NNP 43.53 46.72 48.92

AlexNet 30.65 28.16 29.62ResNet-18 34.91 39.35 44.67RGB-CNN 24.90 31.57 38.19RA-CNN 25.43 23.39 31.82

RAM-CNN 21.01 16.66 25.44

TABLE 5: Comparing MAE of different methods on threedifferent biomarkers.

To see how the algorithms perform differently across thedataset, we divide the TMA images into 6 groups accordingto their pathologists’ scores. For each group, we count thenumber of TMAs with absolute error (AE) smaller than 10,between 10 and 30, and larger than 30 respectively. Theresults of different methods are shown in Fig.11. It is seenthat in the low H-Score group of 0-49, traditional methodsof NAP and NNP give more accurate predicted scores thanCNN based methods. It is found that most low score TMAsare unstained or weakly stained. The accurate predictionsfrom NAP and NNP indicate that the predefined thresholdfor separating unstained and weak (see Fig. 5) is compatiblewith pathologists’ criteria. The deep learning based methodsdo not set stain intensity thresholds explicitly and theirperformances across the six groups are relatively even.

The accuracies of NAP and NNP decrease rapidly withthe increase of the H-Score. The stain intensity and imagecomplexity increase with the H-Score which directly affect

0

10

20

30AE < 1010<= AE < 30AE >= 30

0

10

20

30

0

10

20

30

0

10

20

30

0

10

20

30

0

10

20

30

0-49 50-99 100-149 150-199 200-249 250-3000

10

20

30

NAP

NNP

AlexNet

ResNet-18

RGB-CNN

RA-CNN

RAM-CNN

Fig. 11: Comparison of performances of different methodsin different H-Score groups.

the performance of traditional methods. The result alsoindicates that the pre-defined stain intensity thresholds formoderate and strong classes (see Fig. 5) are less compatiblewith the pathologists’ criteria. Furthermore, the large coeffi-cients of moderate and strong stain (see Eq.1) would magnifythe errors of area and nuclei segmentation in NAP and NNPrespectively. The deep learning based methods give worseresults on the groups with fewer images (i.e., group 50-99 and 250-300), which indicates the importance of a largetraining data size. It is also found that AlexNet producesmore prediction results with AE larger than 30 than RGB-CNN, despite the overall MAE is lower.

We further analyse the TMAs individually to investigatethe effect of image quality on the proposed algorithm. Wefound that for those TMAs where the tissues are clearlystained, and the cellular structure is clear without severeoverlap, our algorithm can give very accurate prediction(see Fig.13). On the other hand, poor image quality causeserrors. In the images that are most easily mis-scored byour algorithms, we found three significant characteristics asshown in Fig.12.

The TMA core in Fig.12(a) contains large out-of-focusregions, which happens more commonly on strongly stainedtissues. The blur regions directly affect the performance ofnuclei segmentation, as well as the nuclei and tumour detec-tion accuracy. They also hinder the final regression networkfrom extracting topological and morphological information.

Tissue folds (see Fig.12(b)) occurs when a thin tissue slice

Page 12: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

(a) (b) (c)

Fig. 12: Examples of sources of big scoring discrepancy between algorithm and pathologist. (a) out of focus; (b) tissue folds;(c) heterogeneity and overlapping.

Fig. 13: Examples of accurately scored TMAs by proposedalgorithm. The absolute errors generated by RAM-CNN ofboth two TMAS are smaller than 2.

folds on itself, and it can happen easily during slide prepara-tion especially in TMA slides. Tissue-fold would cause out-of-focus during slide scanning. Furthermore, a tissue foldin a lightly stained image can be similar in appearance toa tumour region in a darkly stained image [44]. Hence, thesegmentation accuracy of colour deconvolution would begreatly affected in tissue-fold regions.

Heterogeneity and overlapping as shown in Fig.12(c)also affect the automatic scoring performance. The stainheterogeneity gives rise to a large discrepancy of stainintensity in a single nucleus, and nuclei overlapping addsto the difficulty.

These three difficulties directly affect the predicted re-sults of the proposed method, and we found that mostlarge mis-scored TMAs contain one or more of these char-acteristics. We found that there were 9 low image qualityTMAs in our dataset and if we exclude these 9 lowest-quality TMA images, the average MAE of our RAM-CNNis 18.79. Therefore, future works need to overcome theseissues in order to achieve a high prediction performance.To solve the problem of out-of-focus, heterogeneity andoverlapping, adding corresponding images in the training

set to promote robustness is one potential quality assurancemethods. Image pre-processing using variational modelswith certain prior knowledges are also recommended [45],[46]. In addition, the deep learning based scoring systemcan be developed to add nuclei number estimation functionfor accurate assessment. It is also necessary to add thefunction of automated detection and elimination of tissue-fold regions before H-Score assessment.

5 CONCLUDING REMARKS

In this paper, we have developed a deep learning frame-work for automatic end-to-end H-Score assessment forbreast cancer TMAs. Experimental results show that auto-matic assessment for TMA H-Score is feasible. The H-Scorespredicted by our model have a high correlation with H-Scores given by experienced pathologists. We show thatthe discrepancies between our deep learning model and thepathologits are on par with those between the pathologists.We have identified image out of focus, tissue fold and over-lapping nuclei as the three major sources of error. We alsofound that the major discrepancies between pathologistsand machine predictions occurred in images that will havea high H-Score value. These findings have suggested futureresearch directions for improving accuracy.

REFERENCES

[1] E. Rakha, D. Soria, A. R. Green, C. Lemetre, D. G. Powe, C. C.Nolan, J. M. Garibaldi, G. Ball, and I. O. Ellis, “Nottinghamprognostic index plus (npi+): a modern clinical decision makingtool in breast cancer,” British journal of cancer, vol. 110, no. 7, pp.1688–1697, 2014.

[2] T. O. Nielsen, J. S. Parker, S. Leung, D. Voduc, M. Ebbert, T. Vick-ery, S. R. Davies, J. Snider, I. J. Stijleman, J. Reed et al., “A compari-son of pam50 intrinsic subtyping with immunohistochemistry andclinical prognostic factors in tamoxifen-treated estrogen receptor–positive breast cancer,” Clinical cancer research, pp. 1078–0432, 2010.

[3] A. Green, D. Powe, E. Rakha, D. Soria, C. Lemetre, C. Nolan,F. Barros, R. Macmillan, J. Garibaldi, G. Ball et al., “Identificationof key clinical phenotypes of breast cancer using a reduced panelof protein biomarkers,” British journal of cancer, vol. 109, no. 7, p.1886, 2013.

[4] H. Goulding, S. Pinder, P. Cannon, D. Pearson, R. Nicholson,D. Snead, J. Bell, C. Elston, J. Robertson, R. Blamey et al., “A newimmunohistochemical antibody for the assessment of estrogenreceptor status on routine formalin-fixed tissue samples,” Humanpathology, vol. 26, no. 3, pp. 291–294, 1995.

Page 13: Breast Cancer TMA An End-to-End Deep Learning Histochemical …€¦ · An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer TMA Jingxin Liu, Bolei Xu, Chi Zheng,

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

[5] A. C. Ruifrok, “Quantification of immunohistochemical stainingby color translation and automated thresholding.” Analytical andquantitative cytology and histology/the International Academy of Cytol-ogy [and] American Society of Cytology, vol. 19, no. 2, pp. 107–113,1997.

[6] A. C. Ruifrok, D. A. Johnston et al., “Quantification of histochem-ical staining by color deconvolution,” Analytical and quantitativecytology and histology, vol. 23, no. 4, pp. 291–299, 2001.

[7] H. Irshad, A. Veillard, L. Roux, and D. Racoceanu, “Methodsfor nuclei detection, segmentation, and classification in digitalhistopathology: a review current status and future potential,” IEEEreviews in biomedical engineering, vol. 7, pp. 97–114, 2014.

[8] H. Masmoudi, S. M. Hewitt, N. Petrick, K. J. Myers, and M. A.Gavrielides, “Automated quantitative assessment of her-2/neuimmunohistochemical expression in breast cancer,” IEEE transac-tions on medical imaging, vol. 28, no. 6, pp. 916–925, 2009.

[9] N. Trahearn, Y. W. Tsang, I. A. Cree, D. Snead, D. Epstein, andN. Rajpoot, “Simultaneous automatic scoring and co-registrationof hormone receptors in tumor areas in whole slide images ofbreast cancer tissue slides,” Cytometry Part A, 2016.

[10] N.-A. Pham, A. Morrison, J. Schwock, S. Aviel-Ronen, V. Iakovlev,M.-S. Tsao, J. Ho, and D. W. Hedley, “Quantitative image anal-ysis of immunohistochemical stains using a cmyk color model,”Diagnostic pathology, vol. 2, no. 1, p. 8, 2007.

[11] D. Anoraganingrum, “Cell segmentation with median filter andmathematical morphology operation,” in Image Analysis and Pro-cessing, 1999. Proceedings. International Conference on. IEEE, 1999,pp. 1043–1046.

[12] K. Jiang, Q.-M. Liao, S.-Y. Dai et al., “A novel white blood cellsegmentation scheme using scale-space filtering and watershedclustering,” in Proceedings of the 2003 International Conference onMachine Learning and Cybernetics. IEEE, 2003, pp. 2820–2825.

[13] S. Arslan, T. Ersahin, R. Cetin-Atalay, and C. Gunduz-Demir,“Attributed relational graphs for cell nucleus segmentation influorescence microscopy images,” IEEE transactions on medicalimaging, vol. 32, no. 6, pp. 1121–1131, 2013.

[14] J. Shu, H. Fu, G. Qiu, P. Kaye, and M. Ilyas, “Segmenting overlap-ping cell nuclei in digital histopathology images,” in Engineering inMedicine and Biology Society (EMBC), 2013 35th Annual InternationalConference of the IEEE. IEEE, 2013, pp. 5445–5448.

[15] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in InternationalConference on Medical Image Computing and Computer-Assisted In-tervention. Springer, 2015, pp. 234–241.

[16] Y. Han and J. C. Ye, “Framing u-net via deep convolutionalframelets: Application to sparse-view ct,” IEEE transactions onmedical imaging, vol. 37, no. 6, pp. 1418–1429, 2018.

[17] A. P. Harrison, Z. Xu, K. George, L. Lu, R. M. Summers, and D. J.Mollura, “Progressive and multi-path holistically nested neuralnetworks for pathological lung segmentation from ct images,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2017, pp. 621–629.

[18] Y. Li, L. Shen, and S. Yu, “Hep-2 specimen image segmentationand classification using very deep fully convolutional network,”IEEE Transactions on Medical Imaging, 2017.

[19] S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci,and N. Navab, “Aggnet: deep learning from crowds for mitosisdetection in breast cancer histology images,” IEEE transactions onmedical imaging, vol. 35, no. 5, pp. 1313–1321, 2016.

[20] Y. Liu, K. Gadepalli, M. Norouzi, G. E. Dahl, T. Kohlberger,A. Boyko, S. Venugopalan, A. Timofeev, P. Q. Nelson, G. S. Cor-rado et al., “Detecting cancer metastases on gigapixel pathologyimages,” arXiv preprint arXiv:1703.02442, 2017.

[21] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, “Deeplearning for identifying metastatic breast cancer,” arXiv preprintarXiv:1606.05718, 2016.

[22] M. Shah, C. Rubadue, D. Suster, and D. Wang, “Deep learningassessment of tumor proliferation in breast cancer histologicalimages,” arXiv preprint arXiv:1610.03467, 2016.

[23] B. H. Hall, M. Ianosi-Irimie, P. Javidian, W. Chen, S. Ganesan,and D. J. Foran, “Computer-assisted assessment of the humanepidermal growth factor receptor 2 immunohistochemical assayin imaged histologic sections using a membrane isolation algo-rithm and quantitative analysis of positive controls,” BMC MedicalImaging, vol. 8, no. 1, p. 11, 2008.

[24] Z. Xinliang, Y. Jiawen, Z. Feiyun, and H. Junzhou, “Wsisa: Making

survival prediction from whole slide pathology images,” in CVPR,2017.

[25] E. M. Brey, Z. Lalani, C. Johnston, M. Wong, L. V. McIntire,P. J. Duke, and C. W. Patrick Jr, “Automated selection of dab-labeled tissue for immunohistochemical quantification,” Journal ofHistochemistry & Cytochemistry, vol. 51, no. 5, pp. 575–584, 2003.

[26] P. Haub and T. Meckel, “A model based survey of colour deconvo-lution in diagnostic brightfield microscopy: Error estimation andspectral consideration,” Scientific reports, vol. 5, 2015.

[27] J. Liu, G. Qiu, and L. Shen, “Luminance adaptive biomarkerdetection in digital pathology images,” Procedia Computer Science,vol. 90, pp. 113–118, 2016.

[28] I. REC, “Bt. 601-5: Studio encoding parameters of digital televisionfor standard 4: 3 and wide-screen 16: 9 aspect ratios,” 1995.

[29] W. Xie, J. A. Noble, and A. Zisserman, “Microscopy cell countingand detection with fully convolutional regression networks,” Com-puter methods in biomechanics and biomedical engineering: Imaging &Visualization, vol. 6, no. 3, pp. 283–292, 2018.

[30] H. Chen, X. Qi, L. Yu, and P.-A. Heng, “Dcan: Deep contour-awarenetworks for accurate gland segmentation,” in Proceedings of theIEEE conference on Computer Vision and Pattern Recognition, 2016,pp. 2487–2496.

[31] X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang, “Not all pixelsare equal: Difficulty-aware semantic segmentation via deep layercascade,” 2017.

[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proceedings of the IEEE conference on computervision and pattern recognition, 2016, pp. 770–778.

[33] P. Hobson, B. C. Lovell, G. Percannella, A. Saggese, M. Vento, andA. Wiliem, “Hep-2 staining pattern recognition at cell and speci-men levels: datasets, algorithms and results,” Pattern RecognitionLetters, vol. 82, pp. 12–22, 2016.

[34] K. Sirinukunwattana, S. E. A. Raza, Y.-W. Tsang, D. R. Snead,I. A. Cree, and N. M. Rajpoot, “Locality sensitive deep learningfor detection and classification of nuclei in routine colon cancerhistology images,” IEEE transactions on medical imaging, vol. 35,no. 5, pp. 1196–1206, 2016.

[35] B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, and X. Geng, “Deep labeldistribution learning with label ambiguity,” IEEE Transactions onImage Processing, 2017.

[36] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li, “Multi-instancemulti-label learning,” Artificial Intelligence, vol. 176, no. 1, pp. 2291–2320, 2012.

[37] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980, 2014.

[38] M. Saha and C. Chakraborty, “Her2net: A deep framework forsemantic segmentation and classification of cell membranes andnuclei in breast cancer evaluation,” IEEE Transactions on ImageProcessing, vol. 27, no. 5, pp. 2189–2200, 2018.

[39] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-works for semantic segmentation,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2015, pp. 3431–3440.

[40] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deepconvolutional encoder-decoder architecture for image segmenta-tion,” IEEE transactions on pattern analysis and machine intelligence,vol. 39, no. 12, pp. 2481–2495, 2017.

[41] C. A. Schneider, W. S. Rasband, and K. W. Eliceiri, “Nih image toimagej: 25 years of image analysis,” Nature methods, vol. 9, no. 7,pp. 671–675, 2012.

[42] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-cation with deep convolutional neural networks,” in Advances inneural information processing systems, 2012, pp. 1097–1105.

[43] R. L. Wasserstein and N. A. Lazar, “The asa’s statement on p-values: context, process, and purpose,” 2016.

[44] S. Kothari, J. H. Phan, and M. D. Wang, “Eliminating tissue-foldartifacts in histopathological whole-slide images for improvedimage-based prediction of cancer grade,” Journal of pathology in-formatics, vol. 4, 2013.

[45] Y. Gong and I. F. Sbalzarini, “A natural-scene gradient distributionprior and its application in light-microscopy image processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 1, pp.99–114, 2016.

[46] ——, “Curvature filters efficiently reduce certain variational en-ergies,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp.1786–1798, 2017.