gisca: gradient-inductive segmentation network...

12
Received April 17, 2019, accepted May 3, 2019, date of publication May 8, 2019, date of current version May 24, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2915513 GISCA: Gradient-Inductive Segmentation Network With Contextual Attention for Scene Text Detection MENG CAO 1 , YUEXIAN ZOU 1,2 , DONGMING YANG 1 , AND CHAO LIU 1 1 ADSPLAB, School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China 2 Peng Cheng Laboratory, Shenzhen 518055, China Corresponding author: Yuexian Zou ([email protected]) This work was supported in part by the Shenzhen Science and Technology Fundamental Research Program under Grant JCYJ20160330095814461, and in part by the National Engineering Laboratory for Video Technology–Shenzhen Division and Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality under Grant ZDSYS201703031405467. ABSTRACT Scene text detection (STD) is an irreplaceable step in a scene text reading system. It remains a more challenging task than general object detection since text objects are of arbitrary orientations and varying sizes. Generally, segmentation methods that use U-Net or hourglass-like networks are the mainstream approaches in multi-oriented text detection tasks. However, experience has shown that text-like objects in the complex background have high response values on the output feature map of U-Net, which leads to the severe false positive detection rate and degrades the STD performance. To tackle this issue, an adaptive soft attention mechanism called contextual attention module (CAM) is devised to integrate into U-Net to highlight salient areas and meanwhile retains more detail information. Besides, the gradient vanishing and exploding problems make U-Net more difficult to train because of the nonlinear deconvolution layer used in the up-sampling process. To facilitate the training process, a gradient-inductive module (GIM) is carefully designed to provide a linear bypass to make the gradient back-propagation process more stable. Accordingly, an end-to-end trainable Gradient-Inductive Segmentation network with Contextual Attention is proposed (GISCA). The experimental results on three public benchmarks have demonstrated that the proposed GISCA achieves the state-of-the-art results in terms of f -measure: 92.1%, 87.3%, and 81.4% for ICDAR 2013, ICDAR 2015, and MSRA TD500, respectively. INDEX TERMS Scene text detection, multi-oriented text, segmentation network, contextual attention, gradient vanishing/exploding problems. I. INTRODUCTION Scene text is one of the most common objects in the nature, which frequently appears on many practical scenes and con- tains important information for many applications [1]–[4], such as blind navigation, scene understanding, autonomous driving, etc. Scene text reading is thus of great importance in computer vision which has made tremendous progress benefiting from the development of Deep Convolution Neural Networks (DCNNs) [5]. Generally, scene text reading can be divided into two main sub-tasks: scene text detection (STD) and scene text recognition. We focus on the task of STD in this paper since it’s the prerequisite and crucial step of scene text The associate editor coordinating the review of this manuscript and approving it for publication was Jafar A. Alzubi. recognition. Despite the similarity to traditional OCR, it is noticed that STD is still challenging because text instances exhibit vast diversity in fonts, scales, arbitrary orientations, as well as uncontrollable illumination affects [6], etc. Thanks to the recent development of general object detection [7], [8] and instance segmentation methods [9], STD has witnessed a gigantic progress. Deep learning based methods for STD directly learn hierarchical features from input images, which demonstrates more accurate and efficient performance in various STD public benchmarks. However, due to the characteristics of scene text, there still remain many problems to address in practice. Firstly, false positive (FP) is a common mistake in STD, which means detectors tend to mis- take non-text areas for text areas. Some examples are given in Figure 1. Thus, the FP problem may significantly affect VOLUME 7, 2019 2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 62805

Upload: others

Post on 09-Feb-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

Received April 17, 2019, accepted May 3, 2019, date of publication May 8, 2019, date of current version May 24, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2915513

GISCA: Gradient-Inductive SegmentationNetwork With Contextual Attentionfor Scene Text DetectionMENG CAO 1, YUEXIAN ZOU 1,2, DONGMING YANG1, AND CHAO LIU11ADSPLAB, School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China2Peng Cheng Laboratory, Shenzhen 518055, China

Corresponding author: Yuexian Zou ([email protected])

This work was supported in part by the Shenzhen Science and Technology Fundamental Research Program under GrantJCYJ20160330095814461, and in part by the National Engineering Laboratory for Video Technology–Shenzhen Division and ShenzhenKey Laboratory for Intelligent Multimedia and Virtual Reality under Grant ZDSYS201703031405467.

ABSTRACT Scene text detection (STD) is an irreplaceable step in a scene text reading system. It remainsa more challenging task than general object detection since text objects are of arbitrary orientations andvarying sizes. Generally, segmentationmethods that useU-Net or hourglass-like networks are themainstreamapproaches in multi-oriented text detection tasks. However, experience has shown that text-like objects inthe complex background have high response values on the output feature map of U-Net, which leads tothe severe false positive detection rate and degrades the STD performance. To tackle this issue, an adaptivesoft attention mechanism called contextual attention module (CAM) is devised to integrate into U-Net tohighlight salient areas and meanwhile retains more detail information. Besides, the gradient vanishing andexploding problems make U-Net more difficult to train because of the nonlinear deconvolution layer used inthe up-sampling process. To facilitate the training process, a gradient-inductive module (GIM) is carefullydesigned to provide a linear bypass to make the gradient back-propagation process more stable. Accordingly,an end-to-end trainable Gradient-Inductive Segmentation network with Contextual Attention is proposed(GISCA). The experimental results on three public benchmarks have demonstrated that the proposed GISCAachieves the state-of-the-art results in terms of f -measure: 92.1%, 87.3%, and 81.4% for ICDAR 2013,ICDAR 2015, and MSRA TD500, respectively.

INDEX TERMS Scene text detection, multi-oriented text, segmentation network, contextual attention,gradient vanishing/exploding problems.

I. INTRODUCTIONScene text is one of the most common objects in the nature,which frequently appears on many practical scenes and con-tains important information for many applications [1]–[4],such as blind navigation, scene understanding, autonomousdriving, etc. Scene text reading is thus of great importancein computer vision which has made tremendous progressbenefiting from the development of Deep Convolution NeuralNetworks (DCNNs) [5]. Generally, scene text reading can bedivided into two main sub-tasks: scene text detection (STD)and scene text recognition.We focus on the task of STD in thispaper since it’s the prerequisite and crucial step of scene text

The associate editor coordinating the review of this manuscript andapproving it for publication was Jafar A. Alzubi.

recognition. Despite the similarity to traditional OCR, it isnoticed that STD is still challenging because text instancesexhibit vast diversity in fonts, scales, arbitrary orientations,as well as uncontrollable illumination affects [6], etc.Thanks to the recent development of general object

detection [7], [8] and instance segmentation methods [9],STD has witnessed a gigantic progress. Deep learning basedmethods for STD directly learn hierarchical features frominput images, which demonstratesmore accurate and efficientperformance in various STD public benchmarks. However,due to the characteristics of scene text, there still remainmanyproblems to address in practice. Firstly, false positive (FP) is acommon mistake in STD, which means detectors tend to mis-take non-text areas for text areas. Some examples are givenin Figure 1. Thus, the FP problem may significantly affect

VOLUME 7, 20192169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.

Personal use is also permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

62805

Page 2: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

FIGURE 1. The examples of the false positive results. The false positivedetection results are illustrated in red while the correct ones are in green.

the precision value in text detection. Secondly, although it hasbeen proved that the network depth is of crucial importancein various computer vision task, the network still cannot betoo deep because of the vanishing or exploding gradient andthe consequent degradation problem [10]. This has become abottleneck to improve STD performance. Therefore, design-ing a suitable backbone network is crucial to boost the STDperformance.

To solve these two problems, we propose a Gradient-Inductive Segmentation network with Contextual Attentionnamed as GISCA for multi-oriented scene text detection. Thebackbone network of GISCA incorporates two innovativemodules, namely Contextual Attention Module (CAM) andGradient-inductive Module (GIM) to solve the two problemsmentioned above respectively. Concretely, CAMs adaptivelymake the network focus on the salient areas and suppressthe background information, thus alleviating the FP problem.As for GIM, this well-designed module has a linear bypasswhich makes the gradient back-propagation process morestable. To the best of our knowledge, it is the first time topropose this kind of linear/non-linear decouple mechanism inthe STD task. Following the convention of text segmentationmethods, this novel STD detector generates the detectionresults directly from the instance segmentation map. Besides,it’s worth mentioning that the segmentation map is generatedthrough two kinds of pixel-wise prediction, text/non-text pre-diction and link prediction. Similar to PixelLink [11], the linkpredictions between two pixels are labeled as positive if theyare within the same text instance, otherwise they are labeledas negative. Therefore, the text score map and the eight linkmaps (every pixel has 8 neighbors) form the final segmen-tation map using the positive linkage to link the positivepixels. The final bounding boxes are generated by applyingminAreaRect method in OpenCV to the text instances. Withthese delicate designs, GISCA reveals a competitive perfor-mance in STD. We will show by experiments in Section IVthat our proposed detector outperforms other state-of-the-artmethods.

The baseline model of this study called PixelLink wasproposed in [11]. Our proposed method extends PixelLinkwith two delicately designed modules CAM and GIM.The attention mechanism is widely adopted in variouscomputer vision tasks. Compared to previous methods,CAM not only suppresses the background clutter but also

enhances the detail expression of multi-layer feature maps.As for the gradient back-propagation problems, this paperextensively studies related publications and proposes theGradient-inductiveModule which provides an effective lineargradient back-propagation bypass during the up-samplingprocess.

The main contributions of this paper are summarized asfollows: 1) This paper proposes an adaptive soft Contex-tual Attention Module (CAM) which serves as a probabilityheat map and efficiently highlights the salient areas. Com-pared to other attention mechanisms, CAM avoids the addi-tional input supervisory constraints and skillful loss functiondesigns. Experiments show CAM efficiently extracts mean-ingful information from the input feature maps and enhancesthe detail characteristics in salient areas which make thenetwork more suitable for pixel-level and region-level tasks.2) A novel up-sampling pattern named Gradient-inductiveModule (GIM) is proposed to provide both sufficient featureexpression and convenient gradient back propagation. GIMfacilities the training process and improves the detectionresults at the same time. 3) The proposed GISCA outper-forms state-of-the-art algorithms on three widely adoptedmulti-oriented text datasets.

The rest of the paper is organized as follows. Section IIreviews some related works of our study. Section III presentspipelines and algorithms of our proposed GISCA. Section IVpresents the extensive experimental results. Finally, section Vdraws the conclusion.

II. RELATED WORKAs a special case of general object detection, scene textdetection (STD) basically follows the same framework ofobjection detection. In section A, we will first review themainstream approaches in the development progress of scenetext detection. Then issues about the application of attentionmechanisms are analyzed in Section B. Finally, Section Cdescribes the methods proposed recently for alleviating thegradient vanishing/exploding problems.

A. SCENE TEXT DETECTIONSTD aims to localize text instances in images commonlyusing word bounding boxes. Primarily, most recent meth-ods may roughly be classified into regression-based andsegmentation-based methods.

1) REGRESSION-BASED METHODSMore specifically, regression-based methods can be furtherdivided into two categories according to the regression target:proposal-based and part-based methods.

Proposal-based methods follow the routine of one-stage ortwo-stage detectors such as SSD [8] and Faster-RCNN [7].TextBoxes [13] modifies SSD [8] for scene text detectionwith long default boxes and the vertical offset strategy to dealwith extreme aspect ratio texts. TextBoxes++ [14] extendsTextBoxes by using quadrilateral bounding boxes to detectmulti-oriented texts. RRPN [15] transforms the RPN in Faster

62806 VOLUME 7, 2019

Page 3: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

RCNN to Rotated Region Proposal Network which facili-tates the detection of multi-oriented texts.Wordsup [16] takesnotice that there are few character level annotations in STDand introduces the weak supervision strategy to adjust charac-ter coordinates. EAST [17] eliminates the steps of candidateaggregation and word partition. Instead, it predicts words ortext lines of arbitrary orientations and quadrilateral shapesdirectly. Liao et al. propose to separate text/non-text clas-sification and regression tasks using rotation-invariant andsensitive features [18]. Wang et al. come up with an InstanceTransformation Network (ITN) [19] to learn geometry-awarerepresentation tailored for scene text in an end-to-end net-work.

Part-based methods tend to regress text parts and pre-dict the linking relationships between them. CTPN [20] pro-poses a vertical anchor mechanism to predict the locationand text/non-text scores for each proposal. Then a recurrentneural network (RNN) is utilized to connect the sequentialproposals. SegLink [21] decomposes text into two elements:segments which are bounding boxes covering part of the wordand links which connect two adjacent segments. MCN [22]converts an image into a Stochatic FlowGraph(SFG) and thenconducts Markov Clustering on this graph to form the finalbounding boxes.

2) SEGMENTATION-BASED METHODSegmentation-based methods regard the detection task as theinstance segmentation problem. Multi-oriented text detectionis especially suitable to follow the segmentation methodalthough it invokes complex post-processing. Zhang et al.apply FCN to predict the salient map of text region followedby the character component combination [23]. HMCP [24]utilizes the modified HED to obtain the text region map,character map and linking orientation map. Then the textline results are achieved through the problem-specific graphmodel. He et al. employ a novel text detection algorithmusing two cascaded FCN, one to extract text block regionsand the other to remove false positive [25]. Wu et al. developa lightweight FCN to effectively learn the borders in textimages and introduce the border class to text detection [26].

Considering the superiority of the segmentation-basedmethods in detecting multi-oriented scene texts, this paperuses segmentation as the baseline route.

B. ATTENTION MECHANISMAn important feature of human perception is that a persondoes not process the entire scene at once when parsing thepicture. Instead, humans selectively focus on a certain part ofthe visual space to get the salient information which presentsan overall understanding. This well-known attention mech-anism is widely studied in various computer vision tasks.Generally, the attention mechanism falls under two maincategories: hard attention [27] and soft attention [28], [29].Hard attention, e.g. iterative region proposal and cropping,is often a fixed correction mechanism and it tends to be non-differentiable, which makes model training more difficult.

In contrast, the trainable soft attention mechanism is a moresimple and efficient way to highlight important areas withoutintroducing additional inputs. More importantly, trainablesoft attention places only a small computation burden on themodel.

Attention mechanisms are widely adopted in scenetext detection (STD) to improve the detection accu-racy. Text-attention CNN [30] integrates a mask branch inthe network and thus provides rich supervision informa-tion. This mask branch enhances the capability to dis-criminate ambiguous text instances. SSTD [31] takes thepixel-wise binary mask of text images as an additionalinput, which is used to refine the original feature map.Huang et al. incorporate the proposed Pyramid AttentionNetwork (PAN) into the Mask R-CNN framework to enhancethe text detection performance [32].R2CNN++ [33] adopts amulti-dimensional attention mechanism which contains bothpixel-wise and channel-wise attention to reduce the adverseimpact of noise.

Most of the methods mentioned above either use additionalinput or require the problem-specific loss function to con-strain the optimization process. In this paper, the proposedCAM only uses the contextual information of the networkand does not require explicit supervisory constraints. Moreimportantly, CAM not only suppresses background clutter,but also enriches the detail features of salient regions.

C. GRADIENT VANISH PROBLEMSDCNNs tend to be substantially deep to achieve accurateresults, which has been proved in [10]. However, many recentpublications [10], [34]–[37] address the problem that the gra-dient update information may vanish or explode during theback-propagation process. Almost all papers that deal withgradient vanish/explosion problems design a shortcut con-nection between layers close to the input and those close tothe output. More specifically, the shortcut connections areimplemented either by residual blocks or by concatenationoperations.

For residual block based models, they utilize the residuallearning framework proposed in [10] to ease the networktraining process and the following degradation problem [10].ResNet [10] hypothesizes that the residual functions are eas-ier to optimize than the original unknown mapping functions.Therefore, ResNet proposes identity shortcuts and refor-mulates the layers to learn the residual functions. Inspiredby Long Short-Term Memory recurrent networks (LSTM),Highway Network [34] designs the bypassing paths and thegating units which provide a method to effectively trainnetworks with hundreds of layers in an end-to-end manner.Stochastic Depth [35] randomly drops residual blocks duringthe training process to further mitigate the gradient vanishingproblem.

The concatenation operation has also been extensivelyused as another way for alleviating gradient vanish-ing/exploding problems. DenseNet [36] connects each layerto every other layers in the proposed Dense Block. Every two

VOLUME 7, 2019 62807

Page 4: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

FIGURE 2. Architecture of GISCA. Firstly, An image is fed to the down-sampling VGG16 process, namelythe D1∼D5 layers. D5 is generated by converting fc6/fc7 to the convolution layer using 1× 1 filter. Theannotations in parentheses are the same as those defined in [12]. Note that #c stands for the number ofchannels. #c: 2/16 stands for feature maps with 2 or 16 channels, for text/non-text prediction or linkprediction individually. Then D5 is input to the deconvolution layer and U4 is the output. The red dottedbox contains two separate modules CAM and GIM. The attention signal is refined through CAM (the greyblock depicted in the picture). After the bilinear interpolation, the previous up-sampling layer and thecorresponding down-sampling layer are input into GIM to generate the output of the next layer. Thefinal predictions (text/non-text prediction and link prediction) are conducted on U1 which maintains thesame resolution as the input image. The eight feature maps in blue dashed box stand for the linkpredictions in eight directions. The post-processing procedure is adopted by connecting positive pixelswith positive links.

Dense blocks are connected by transition layers to changefeature map sizes via convolution and pooling. Hyper-columnnetwork [37] defines the pixel-level hypercolumn as thevector of activation values of all the shallow layers inCNN, which proves to perform well in fine-grained local-ization tasks. For the segmentation task, U-Net [38] designsa skip connection to concatenate the down-sampling andup-sampling feature maps.

However, when carefully evaluating these structures,we find that they cannot achieve the direct gradientback-propagation in the whole training process. For example,residual Block based methods have to utilize 1× 1 convolu-tion layers in the bypass branches to deal with the channelmismatch between input and output layers and this nonlin-ear bypass is unfavorable for gradient back-propagation. ForDenseNet, the dense concatenation is limited in the denseblock because only feature maps in the same dense blockpreserve the same resolution while the resize of feature

map is implemented through convolution and pooling oper-ations. Therefore, only the gradient within the DenseBlockis preserved and the gradient back-propagation between theDenseBlocks is still stuck. U-Net does connect the featuremaps between up-sampling and down-sampling processes,but the up-sampling process is still implemented through thenon-linear deconvolution operation. To achieve the directgradient back-propagation, this paper delicately designs anovel Gradient-inductive Module which contains a linearbypass in the up-sampling process. With our design, GIMenables the gradient from the very deep layers to be directlypropagated to shallow layers.

III. METHODOLOGYA. OVERVIEWThe proposed GISCA is an end-to-end trainable, fully con-volutional network to detect multi-oriented texts. The overallarchitecture is presented in Figure 2. The framework can be

62808 VOLUME 7, 2019

Page 5: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

Algorithm 1 Algorithm for the Proposed MethodInput: RGB images D0Output: detection results

1© Down-sampling process1: for i ∈ [1, 5] do2: Di = pooling(ReLU (Conv(Di−1)))3: end for

2© Up-sampling process4: if (i == 4) then5: Ui = deconv(D5)6: end if7: if (i ∈ [1, 3]) then8: Dti = CAM (Di,U ′i+1)

Ui = GIM (Dti ,U′

i+1)9: end if

3© Pixel-level prediction process10: The pixel-level binary classification is conducted on U1.11: The link prediction (8 values for each pixel) is applied on

U1.4© Post-processing process

12: The positive pixels are grouped with the positive linkage.

divided into four parts: A modified U-Net network [38] withGradient-inductive Module (GIM) serving as the backbonenetwork, the Contextual AttentionModule (CAM) for featurerefinements, the pixel-level prediction module to output theinstance segmentation maps and the post-processing moduleto generate the bounding boxes.

In our design, the well-known VGG16 network [12] isadopted as the feature extractor. By convention, the fullyconnected layers fc6 and fc7 are converted into convolu-tional layers using 1 × 1 filters. Inspired by [38], we designthe U-shape feature fusion network for scene text instancesegmentation. The convolution layers conv2_2, conv3_3,conv4_3, conv5_3 and fc_7 are utilized for the featurefusion implementation. Different from U-Net, our modifiedU-Net is up-sampled through two modules, namely Contex-tual Attention Module (CAM) and the Gradient-inductiveModule (GIM). CAM, the delicately designed attentionmodule, aims to suppress irrelevant background and high-light the salient text areas. Then the feature layers aregenerated through GIM sequentially, which enables thegradient update information to be directly propagated toshallow layers. Follwing PixelLink [11], two sets of predic-tions are conducted on the final feature maps, one branchfor text/non-text classification and the other one for link-age prediction. Positive pixels are then grouped togetherusing positive links, which generates a collection of con-nected components (CCs), each representing a detected textinstance.

Here we give the pseudo code.The design of CAM is detailed in Section B. Section C

presents the structure of GIM, including the mathematicalproof of its effectiveness. The pixel-level predictions areillustrated in Section D. Finally, the overall optimization

procedures including the specific loss function andpost-processing strategies are given in Section E.

B. CONTEXTUAL ATTENTION MODULE(CAM)The false positives (FP) problem remains a big challengefor scene text detection [39]. Because of the complex back-ground interference and the large aspect ratio of the textshape, the current text detection algorithms tend to mistakesome similar objects such as a batch of fence or certaindisc-shaped objects as the text areas. The cause of misclassifi-cation is probably lack of local detail information in the deepfeaturemaps. Generally, the detail informationwill be filteredduring the convolutional down-sampling process. One typicaldown-sampling process is depicted in Figure 3. From Fig-ure 3, we can see that low-level feature maps capture moredetails but they cannot clearly indicate where the text areasare. On the opposite, the high-level feature maps explicitlydemonstrate the text areas with the cost of detail informationloss.

Although the up-sampling process of U-Net outputs thefeature map with the same resolution as the input image,the detail information is still missing during the tediousconvolution proceeding. As a result, these feature maps arenot suitable for pixel-level or region-level tasks owing to thecontradiction between the resolution and semantic informa-tion. Therefore, it is important for the trainable model tofocus on the salient areas on one hand and retain more detailinformation on the other hand.

Based on the soft attention mechanism, we propose theContextual Attention Module (CAM). The architecture ofCAM is illustrated in Figure 4, from which we can see thatthe output feature map of CAM retains both detail character-istics and high-level semantic information (highlighted areasindicating where texts are). The common skip connections inU-Net are replaced byCAMs during the up-sampling process.Compared to the traditional U-Net models, CAM adaptivelyemphasizes the salient areas and suppresses irrelevant back-ground information without adding a large number of modelparameters. Figure 5 illustrates the input feature maps andthe output refinement results of CAM. We can clearly seethat the glass curtain walls in the input feature maps havehigh response values, while the CAM output feature mapsare more focused on the text area.

The attention map has shape Win × Hin × Cin with eachattention coefficient ranging from 0 to 1. The output of CAMis the element-wise multiplication result of the original skipconnection feature map (from low-level layers) and the atten-tion coefficient map. As depicted in Figure 4, CAM takes thefeature map Di and Ui+1, i = 1, 2, 3 as inputs. The adaptiveattention map is generated through two 3 × 3 convolutionallayers and one 1×1 convolutional layer. The generated atten-tion map has two channel representing text/non-text scoresrespectively. Therefore, the designed generation process ofCAM is as follows:

m = ReLU (Conv3×3(Di)+ Conv3×3(Ui+1)) (1)

VOLUME 7, 2019 62809

Page 6: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

FIGURE 3. The illustration of down-sampling process results. (a) The input image. (b)–(f) The feature map D1∼D5. Lower level featuremaps capture more local details while higher level feature maps contain more semantic information.

FIGURE 4. The architecture of CAM. Di , i = 1, 2, 3, represents the feature map extracted from the down-sampling process. U ′i+1 denotes the bilinearinterpolation of Ui+1. The output result Dt

i contains both the details in Di and the semantic information in U ′i+1.

FIGURE 5. Comparisons of the (a) input and (b) output of CAM. Thesymbol in the upper right corner is the name of the feature map.

s = eSoftmax(ReLU (Conv1×1(m))) (2)

whereConv3×3 denotes the convolution layer with kernel size3×3 andConv1×1 represents 1×1 convolution. Softmax func-tion is used to normalize the attention coefficients. Throughthe exponential function, the salient map is enhanced becausethe gap between text and non-text areas becomes larger.

In both Eq(1) and Eq(2), the rectified linear unit ReLU isutilized as the activation function. The final attention resultis given as follows:

Dti = s∗ � Di (3)

where s∗ is generated by broadcasting s to the same channel asDi, namelyCin.� represents the pixel-to-pixel multiplicationof s∗ and Di.

C. GRADIENT-INDUCTIVE MODULE (GIM)Generally, for the segmentation task, the well-known net-works such U-Net [38], FPN [40] and stacked hourglass [41]having the hourglass-like structure are widely used to obtaindeep feature maps with high resolutions. Unfortunately,the training process of the networks which consist of bothdown-sampling and up-sampling processes becomes moredifficult when the DCNNs become increasingly deeper. Thisphenomenon can be attributed to the gradient vanishing [42]or gradient exploding problems [43] when the gradients prop-agate from deep feature maps to shallow ones. For instance,the parameter update may stuck in a local optimal region orthe gradient value may become infinite.

To address this problem, we carefully design a novelGradient-inductive Module (GIM), which enables the gradi-ent update information to be directly propagated to shallowlayers. Considering that the nonlinear structure is deft atformulating the mapping between input and output whilethe linear structure can easily propagate gradients, GIM is

62810 VOLUME 7, 2019

Page 7: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

FIGURE 6. The flowchart of GIM. Dti is the output result of CAM while

U ′i+1 denotes the bilinear interpolation result of Ui+1. The symbol c©represents the concatenation operation and ⊕ represents element-wiseaddition.

designed to decouple the linear and nonlinear expressionsas two branches. The architecture is depicted in Figure 6.The concatenation layers followed by convolution and ReLUactivation functions form the non-linear branches. Throughthis branch, the neural networks can approximate the map-ping function well in the training process. The element-wiseaddition layer is the linear branch of GIM. In a certainsense, this design embeds a ‘‘constant’’ value in the output.Therefore, the existence of the linear layer guarantees thegradient update of each layer to be close to one constantvalue Compared with the conventional tandem structure toconnect linear and nonlinear layers [5], [44], our decoupledmodule shows better gradient stability during the trainingprocess.

Here, we denote the feature maps as Di(i = 1, 2, 3, 4, 5)and Uj(j = 1, 2, 3, 4) for the down-sampling andup-sampling process respectively. Then GIM can beexpressed as follows:

Ui =

{Dti + r(concat(D

ti ,U′

i+1)) i = 1, 2, 3deconv(D5) i = 4

(4)

where the Dti denotes the feature map transferred from Di,namely the output of CAM. The concat(·) represents featureconcatenation and r represents Conv + ReLU operations.For the sake of clarity, we expand the formula gradually asfollows:

Mi = concat(Dti ,U′

i+1) (5)

M̃i = ReLU (conv(Mi)) (6)

Ui = Dti + M̃i (7)

where conv in Eq (6) denotes the 1 × 1 × Cin convolutionoperations.

To generateUi(i = 1, 2, 3), the GIM takes the CAM outputDti with shapeWin×Hin×Cin and the previous up-samplingresult U ′i+1 with the same shape as input. It should benoted that U ′i+1 is the bilinear interpolation result ofUi+1, which avoids introducing the nonlinear deconvolutionlayers.

The concatenation result of Dti and U′

i+1 is defined as Miin Eq (5). The convolution operations are utilized to extractthe messages from Mi and adjust the number of channels,as described in Eq (6). With our design, the nonlinear branchof GIM outputs the feature map with the same shape as theinput. As shown in equation (7), the linear branch of GIMtakes the element-wise addition results of Dti and M̃i as thefinal outputs. In this manner, feature maps U3, U2, U1 aregenerated sequentially.

The explanation of GIM’s principle may be given from theperspective of mathematical derivation. Ignoring the influ-ence of CAM, we only integrate GIM in U-Net. Accordingto the architecture of the proposed backbone network shownin Figure 2, we have the following expression.

Ui = Di +Hφ i(Di,Ui+1), i = 1, 2, 3 (8)

where Hφ i is the asymptotically approximation function inthe up-sampling process. Sequentially from Eq (8), we getthe following equation:

U1 = D1 + D2 + D3 +

3∑i=1

Hφ(Di,Ui+1) (9)

For convenience,Hφ is adopted here to represent the simpli-fication result of nested iterations of Hφ i, i = 1, 2, 3. Usingthe chain rule, we can derive as follows:

∂L∂D1=

∂L∂U1·∂U1

∂D1

=∂L∂U1

(1+∂

∂U1

3∑i=1

F(Di,Ui+1)), i = 1, 2, 3 (10)

where L is the loss function. The additive term ∂L∂U1

ensuresthat the gradient of deep feature map U1 can be directlypropagated to the shallow one D1. Therefore, the Gradient-inductive Module (GIM) provides a bypass which ‘‘induce’’the deep gradient to the shallow feature map. It comes out tobe efficient to alleviate the gradient vanishing problem andthat is why GIM is named as Gradient-inductive Module.Similarly, we find that the gradient of U2 and U3 can bepropagated to D2 and D3 respectively.

∂L∂D2=

∂L∂U2

(1+3∑i=2

∂U2G(Di,Ui+1)) (11)

∂L∂D3=

∂L∂U3

(1+3∑i=2

∂U3M(Di,Ui+1)) (12)

where G and M are defined as the nonlinear mapping func-tion. It is noted that the existence of ∂L

∂U2and ∂L

∂U3makes the

gradient update in layers D2 and D3 more stable.

D. PIXEL LEVEL PREDICTIONThe segmentation method is widely used in STD bycasting the detection task as text/non-text classificationtasks especially when dealing with the multi-oriented orarbitrary-oriented text detection. This paper follows the

VOLUME 7, 2019 62811

Page 8: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

methodology in PixelLink [11] which predicts the link-age between the pixel and its eight adjacent pixels. Forthe text/non-text prediction, we utilize softmax function togenerate the final feature map with channel 1 × 2 = 2,representing text/non-text score respectively; For the pixel-to-pixel linkage prediction, we output the feature map withchannel 8 × 2 = 16 representing positive/negative scoresfor every eight adjacent pixels. Then two separate thresholdsare applied to generate the positive pixels and the positivelinkage. The output segmentation map is obtained by linkingthe positive pixels using positive linkages. Finally, ConnectedComponents (CC) algorithm is employed to find the individ-ual text areas and minAreaRext in OpenCV [45] is applied toobtain the bounding boxes for every CC.

E. OPTIMIZATIONThe overall loss function is given as follows:

L = λLpixel + Llink

where Lpixel and Llink denote the pixel loss and the link lossseparately. λ is set to 2.0 because the pixel classification taskis more important than the link prediction task.

Generally, we follow the same methodology in Pix-elLink [11]. For pixel loss computing, a novel instance-balanced cross-entropy loss is adopted to deal with largedifference of the text size, which takes the text instance areasas the modulation factor. Link loss is a kind of class-balancedcross-entropy loss which is calculated on positive pixels only.Specific details can refer to PixelLink [11].

IV. EXPERIMENTSTo evaluate the proposed method, we conduct experimentson three commonly used public datasets: ICDAR2013 (IC13)[46], ICDAR2015 (IC15) [47], MSRA-TD500 (TD500) [48].The experiment settings and the comparison results arepresented as follows. A concise description of these threedatasets and the accepted evaluation protocols are givenin Section A. The implementation details are illustrated inSection B. Experimental results for ICDAR 2013, ICDAR2015 and MSRA Td500 are given in Section C, Section Dand Section E respectively. To reveal the effectiveness of twoproposed modules, sufficient ablation studies are given inSection F.

A. DATASETS AND EVALUATION PROTOCOL1) SynthText in theWild (Synth Text) [49]: The SynthText

dataset contains 800k synthesized text images whichare rendered on natural images. With a proper learningalgorithm to choose and transform the text location,the images tend to have a realistic look. Annotationsare given in character, word and text line level. Thisdataset is used to pre-train the model.

2) ICDAR 2015 (IC15) [47]: IC15 is the most commonlyused benchmark for detecting multi-oriented text withword-level quadrilaterals. It is composed of 1000 train-ing images and 500 testing images. This dataset is

challenging because of the arbitrary orientations,motion blur and low resolutions.

3) ICDAR 2013 (IC13) [46]: IC13 consists of 229 train-ing images and 233 testing images with word-levelannotations provided. It is the standard benchmark forevaluating near-horizontal text detection.

4) MSRA TD500 (TD500) [48]: TD500 contains500 images (300 for training, 200 for testing) con-taining both Chinese and English texts. Different fromIC15 and IC13, annotations of TD500 are at the linelevel which are represented by the aligned horizontalrectangles.

5) Evaluation protocol: In our experiments, we utilize thestandard evaluation protocols for text detection, namelyprecision (P), recall (R) and f-measure (F). Precisely,they are defined as follows:

P =TP

TP+ FP(13)

R =TP

TP+ FN(14)

F = 2×P× RP+ R

(15)

where TP,FP,FN denotes the correctly classified textinstances, incorrectly spotted instances and the missinginstances respectively. Given a detected text instanceb, it will be assigned to the positive detection if theIOU between b and a ground truth is larger than agiven threshold (normally set to 0.5). Since there isa trade-off between recall and precision, f -measureis a more objective measurement for performanceassessments.

B. IMPLEMENTATION DETAILSThe experiments are conducted on one single NVIDIA TitanX GPU with 12GB memory. The whole model is optimizedby SGDwith momentum= 0.9 and weight decay= 5×10−4.The model is pre-trained on SynthText for 500K iterationswhich takes about 0.9s per iterations. The learning rate isset to 10−3 in the pre-trained process. Then we fine-tunethe model on IC15, IC13 and TD500 respectively. We setthe learning rate to 1 × 10−3 for the first 100K iterationsand 1 × 10−4 for the rest 80K iterations. The input imagesare resized to 1280 × 768, 768 × 768, 512 × 512 for IC15,IC13 and Td-500 respectively. We take the data augmentationstrategies following SSD and PixelLink. Input images arefirstly rotated at a probability of 0.2, by a random angleof 0,π/2,π or 3π/2. Then we randomly crop them with areasranging from 0.1 from 1, and aspect ratios of the imagesare ranging from 0.5 to 2. As for the text instance in theprocessed images, the shorter side less than 10 pixels areignored.

C. HORIZONTAL TEXT DETECTION PERFORMANCE: IC13The final models are fine-tuned for about 180K iterationsbased on the SynthText pre-trained model. Thresholds on the

62812 VOLUME 7, 2019

Page 9: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

TABLE 1. Results of ICDAR 2013 Dataset.

pixel and link to determinate positive or negative are set to0.7 and 0.5 respectively. Some detection results on IC13 aregiven in Figure 7(a).

We have evaluated the proposed GISCA for textdetection on ICDAR 2013, a popular horizontal textdataset. The numerical comparison results are listedin Table 1. In general, our proposed STD model outper-forms all the existing state-of-the-art methods in termsof recall and f -measure. Specifically, it achieves thebest performance compared to all the regression-basedmethods [13], [14], [16]–[18], [20]–[22], [31] in three eval-uation measurements. Besides, GISCA outperforms all thesegmentation based text detection algorithms [11], [50] interms of recall and f -measure. Taking our baseline modelPixelLink as an example, GISCA exceeds PixelLink by 3.9%,4.2%, 4.0% in recall, precision,f -measure respectively. Theonly exception is that GISCA has a lower precision value thanMask TextSpotter. Considering that Mask TextSpotter usesthe results of text recognition to refine the detection results,it is reasonable to generate such a high precision value. On thewhole, the experimental results show the superiority of theproposed GISCA.

D. MULTI-ORIENTED TEXT DETECTION PERFORMANCE:IC15We follow the experiment settings presented in section B. Thethresholds for pixel and link prediction are both set to 0.8.The results are evaluated by the official submission server forthe fair comparison. Some qualitative illustrations are shownin Figure 7(b).

The IC15 text detection task is more challenging comparedto IC13 owing to its multi-orientation characteristics in com-plex contexts. Besides, images in IC15 are of low resolu-tion and contain many small text instances. Table 2 showsresults on GISCA compared with previous stat-of-the-artpublished methods. The experiment results reveal the pro-posed method’s advantages under the complex backgroundand low resolution conditions. More specifically, the pro-posed method outperforms all the existing algorithms accord-ing to the value of recall (85.5%), precision (89.1%) andf -measure (87.3%). Especially in f -measure value, GISCA

TABLE 2. Results of ICDAR 2015 incidental Dataset.

TABLE 3. Results of MS Td500 Dataset.

leads the second-place FTSN by 3.2%. Therefore, GISCA hasa huge advantage in detecting multi-oriented text.

E. ORIENTED MULTI-LINGUAL TEXT LINE DETECTIONPERFORMANCE: TD500MSRA TD500 is the dataset which contains both Englishand Chinese texts. The annotations are given in the forms oftext lines. The thresholds of pixel and link are set to 0.8 and0.7 respectively.

Some qualitative detection results are illustratedin Figure 7(c) and comparison results are listed in Table 3.We can conclude that GISCA can successfully detect longtext lines of arbitrary orientations and shapes. More specifi-cally, GISCA surpasses the existing state-of-the-art results bymore than 2% according to the value of f -measure. AlthoughRRD slightly surpasses the proposed method in precisionvalue, the proposed method result is still competitive con-sidering that RRD is a detector specifically designed for longoriented texts.

F. ABLATION EXPERIMENTSOur proposed multi-oriented text detection model invokestwo innovative components, namely Contextual AttentionModule (CAM) and Gradient-inductive Module (GIM).To verify the validity of these two proposed modules, we con-duct ablation studies over ICDAR 2013, ICDAR 2015 andMSRA TD500 respectively. Four sets of comparison resultsare list in Table 4,5,6. For convenience, we utilize Baselineto denote the original PixelLink model which takes the com-monly used U-Net as backbone network. The second experi-ment Baseline+CAM replaces the skip connection in U-Netwith the proposed CAM. The third is Baseline+GIM whichfollows the architecture of U-Net but it takes the proposedGIM in the up-sampling process. The last is CAM+GIM thatincorporates both CAM and GIM in the backbone network.

VOLUME 7, 2019 62813

Page 10: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

FIGURE 7. Some qualitative detection results on (a) ICDAR 2013, (b) ICDAR 2015, and (c) MSRA-TD500.

FIGURE 8. The up-sampling process results. The symbol in the upper right corner is the name of the feature map. (a) WithCAM. (b) Without CAM. It is obvious that when CAM is integrated into the network, the feature maps filter out the backgroundinterference and focus on the text area.

TABLE 4. Ablation experiments of IC13.

1) WITH OR WITHOUT CAMAs shown in Table 4,5,6, Baseline+CAM outperformsBaseline evaluated by recall, precision and f -measure.

TABLE 5. Ablation experiments of IC15.

For instance, considering f -measure, Baseline+CAM sur-passes Baseline by 3.8%, 2.5%, 2.5% in IC13, IC15 andTD500 respectively. Similarly, CAM+GIM also exceeds

62814 VOLUME 7, 2019

Page 11: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

TABLE 6. Ablation experiments of TD500.

Baseline+GIM on all three measurements. These two setsof comparison experiments show that Contextual AttentionModule is an efficient structure in the text detection task. Theimprovement of the precision value indicates that the CAMeffectively reduces the false detection rate and thus alleviatesthe FP problem. The up-sampling process with and withoutCAM are shown in the Figure 8, which intuitively displays ofthe effects of CAM.

2) WITH OR WITHOUT GIMTo validate the potency of GIM, we need to compare Baselinewith Baseline+GIM and Baseline+CAM with CAM+GIM.Taking the comparison between Baseline and Baseline+GIMinto account, the latter experiments outperform the formerones on all three datasets and under almost all three mea-surement criteria. The only exception is the Baseline sur-passes Baseline+GIM in the value of recall in TD500 whichcould be attributable to the large text size variation. Theother sets of comparison results between Baseline+CAMandCAM+GIM also indicate that GIM is an efficient module inmost cases.

V. CONCLUSIONIn this work, we have presented an end-to-end trainablescene text detector named GISCA which is under the U-Netframework. More specifically, a novel Contextual AttentionModule (CAM) is carefully designed to boost the salient textareas which leads to the reduction of the false positive results.Besides, the proposed Gradient-inductive Module (GIM) isdesigned to provide a linear bypass which aims at alleviat-ing the gradient vanishing/exploding problems. Experimentalresults on three widely-used datasets (ICDAR 2013, ICDAR2015 and MSRA TD500) demonstrate the superiority ofthe proposed method, which outperforms all state-of-the-artmethods in terms of f -measure. In the future, we plan toinvestigate the universality of CAM and GIM in deep neuralnetworks and probe the curve text detection problems usingthe proposed method.

ACKNOWLEDGMENTSpecial acknowledgements are given to Aoto-PKUSZ JointResearch Center of Artificial Intelligence on Scene Cognitionand Technology Innovation for its support.

REFERENCES[1] C. Kang, G. Kim, and S. I. Yoo, ‘‘Detection and recognition of text

embedded in online images via neural context models,’’ in Proc. 31st AAAIConf. Artif. Intell., Feb. 2017, pp. 4103–4110.

[2] X. Rong, C. Yi, and Y. Tian, ‘‘Recognizing text-based traffic guide panelswith cascaded localization network,’’ in Proc. Eur. Conf. Comput. Vis.Cham, Switzerland: Springer, Sep. 2016, pp. 109–121.

[3] B. Xiong and K. Grauman, ‘‘Text detection in stores using a repetitionprior,’’ in Proc. IEEEWinter Conf. Appl. Comput. Vis. (WACV), Mar. 2016,pp. 1–9.

[4] C. Yi and Y. Tian, ‘‘Scene text recognition in mobile applications by char-acter descriptor and structure configuration,’’ IEEE Trans. Image Process.,vol. 23, no. 7, pp. 2972–2982, Jul. 2014.

[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘Imagenet classificationwith deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-cess. Syst., 2012, pp. 1097–1105.

[6] Q. Ye and D. Doermann, ‘‘Text detection and recognition in imagery:A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7,pp. 1480–1500, Jul. 2015.

[7] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-timeobject detection with region proposal networks,’’ in Proc. Adv. Neural Inf.Process. Syst., 2015, pp. 91–99.

[8] W. Liu et al., ‘‘SSD: Single shot multibox detector,’’ in Proc. Eur. Conf.Comput. Vis. Cham, Switzerland: Springer, Sep. 2016, pp. 21–37.

[9] J. Long, E. Shelhamer, and T. Darrell, ‘‘Fully convolutional networksfor semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2015, pp. 3431–3440.

[10] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning forimage recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2016, pp. 770–778.

[11] D. Deng, H. Liu, X. Li, and D. Cai, ‘‘Pixellink: Detecting scene text viainstance segmentation,’’ in Proc. 32nd AAAI Conf. Artif. Intell., Apr. 2018,pp. 6773–6780.

[12] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutional net-works for large-scale image recognition.’’ [Online]. Available: https://arxiv.org/abs/1409.1556

[13] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, ‘‘Textboxes: A fast textdetector with a single deep neural network,’’ in Proc. 31st AAAI Conf. Artif.Intell., Feb. 2017, pp. 4161–4167.

[14] M. Liao, B. Shi, and X. Bai, ‘‘TextBoxes++: A single-shot oriented scenetext detector,’’ IEEE Trans. Image Process., vol. 27, no. 8, pp. 3676–3690,Aug. 2018.

[15] J. Huang, V. Sivakumar, M.Mnatsakanyan, and G. Pang. (2018). ‘‘Improv-ing rotated text detection with rotation region proposal networks.’’[Online]. Available: https://arxiv.org/abs/1811.07031

[16] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding, ‘‘Wordsup:Exploiting word annotations for character based text detection,’’ in Proc.IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4940–4949.

[17] X. Zhou et al., ‘‘EAST: An efficient and accurate scene text detector,’’in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,pp. 5551–5560.

[18] M. Liao, Z. Zhu, B. Shi, G.-S. Xia, and X. Bai, ‘‘Rotation-sensitiveregression for oriented scene text detection,’’ in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 5909–5918.

[19] F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao, ‘‘Geometry-aware scenetext detection with instance transformation network,’’ in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 1381–1389.

[20] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, ‘‘Detecting text in naturalimage with connectionist text proposal network,’’ in Proc. Eur. Conf.Comput. Vis. Cham, Switzerland: Springer, Sep. 2016, pp. 56–72.

[21] B. Shi, X. Bai, and S. Belongie, ‘‘Detecting oriented text in natural imagesby linking segments,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jul. 2017, pp. 2550–2558.

[22] Z. Liu, G. Lin, S. Yang, J. Feng, W. Lin, andW. L. Goh. (2018). ‘‘LearningMarkov clustering networks for scene text detection.’’ [Online]. Available:https://arxiv.org/abs/1805.08365

[23] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, ‘‘Multi-orientedtext detection with fully convolutional networks,’’ in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4159–4167.

[24] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. (2016). ‘‘Scenetext detection via holistic, multi-channel prediction.’’ [Online]. Available:https://arxiv.org/abs/1606.09002

[25] D. He et al., ‘‘Multi-scale FCNwith cascaded instance aware segmentationfor arbitrary oriented word spotting in the wild,’’ in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 3519–3528.

[26] Y. Wu and P. Natarajan, ‘‘Self-organized text detection with minimal post-processing via border learning,’’ in Proc. IEEE Int. Conf. Comput. Vis.(ICCV), Oct. 2017, pp. 5000–5009.

VOLUME 7, 2019 62815

Page 12: GISCA: Gradient-Inductive Segmentation Network …web.pkusz.edu.cn/adsp/files/2019/11/08709682_compressed.pdfReceived April 17, 2019, accepted May 3, 2019, date of publication May

M. Cao et al.: GISCA for Scene Text Detection

[27] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, ‘‘Recurrent mod-els of visual attention,’’ in Proc. Adv. Neural Inf. Process. Syst., 2014,pp. 2204–2212.

[28] S. Jetley, N. A. Lord, N. Lee, and P. H. S. Torr. (2018). ‘‘Learn to payattention.’’ [Online]. Available: https://arxiv.org/abs/1804.02391

[29] F. Wang et al., ‘‘Residual attention network for image classifica-tion,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017,pp. 3156–3164.

[30] T. He, W. Huang, Y. Qiao, and J. Yao, ‘‘Text-attentional convolutionalneural network for scene text detection,’’ IEEE Trans. Image Process.,vol. 25, no. 6, pp. 2529–2541, Jun. 2016.

[31] P. He,W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li, ‘‘Single shot text detec-tor with regional attention,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),Oct. 2017, pp. 3047–3055.

[32] Z. Huang, Z. Zhong, L. Sun, and Q. Huo, ‘‘Mask R-CNN with pyramidattention network for scene text detection,’’ in Proc. IEEE Winter Conf.Appl. Comput. Vis. (WACV), Jan. S2019, pp. 764–772.

[33] X. Yang et al. (2018). ‘‘R2CNN++: Multi-dimensional attention basedrotation invariant detector with robust anchor strategy.’’ [Online]. Avail-able: https://arxiv.org/abs/1811.07126

[34] R. K. Srivastava, K. Greff, and J. Schmidhuber. (2015). ‘‘Highway net-works.’’ [Online]. Available: https://arxiv.org/abs/1505.00387

[35] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, ‘‘Deep net-works with stochastic depth,’’ in Proc. Eur. Conf. Comput. Vis. Cham,Switzerland: Springer, Sep. 2016, pp. 646–661.

[36] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Denselyconnected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708.

[37] B. Hariharan and P. Arbeläez, R. Girshick, and J. Malik, ‘‘Hypercolumnsfor object segmentation and fine-grained localization,’’ inProc. IEEEConf.Comput. Vis. Pattern Recognit., Jun. 2015, pp. 447–456.

[38] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networksfor biomedical image segmentation,’’ in Proc. Int. Conf. Med. ImageComput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2015,pp. 234–241.

[39] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li. (2018). ‘‘Scene textdetection with supervised pyramid context network.’’ [Online]. Available:https://arxiv.org/abs/1811.08605

[40] T.-Y. Lin and P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,‘‘Feature pyramid networks for object detection,’’ in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jul. 2017, pp. 2117–2125.

[41] A. Newell, K. Yang, and J. Deng, ‘‘Stacked hourglass networks for humanpose estimation,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland:Springer, Sep. 2016, pp. 483–499.

[42] S. Hochreiter, ‘‘The vanishing gradient problem during learning recurrentneural nets and problem solutions,’’ Uncertain. Fuzziness Knowl.-BasedSyst., vol. 6, no. 2, pp. 107–116, 1998.

[43] R. Pascanu, T. Mikolov, and Y. Bengio, ‘‘Understanding the explodinggradient problem,’’ CoRR, abs/1211.5063, vol. 2, 2012.

[44] C. Szegedy et al., ‘‘Going deeper with convolutions,’’ in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9.

[45] G. Bradski and A. Kaehler, ‘‘Opencv,’’ Dr. Dobbs J. Softw. tools, vol. 3,2000.

[46] D. Karatzas et al., ‘‘ICDAR 2013 robust reading competition,’’ in Proc.12th Int. Conf. Document Anal. Recognit., Aug. 2013, pp. 1484–1493.

[47] D. Karatzas et al., ‘‘ICDAR 2015 competition on robust reading,’’ inProc. 13th Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2015,pp. 1156–1160.

[48] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, ‘‘Detecting texts of arbitraryorientations in natural images,’’ in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2012, pp. 1083–1090.

[49] A. Gupta, A. Vedaldi, and A. Zisserman, ‘‘Synthetic data for text local-isation in natural images,’’ in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2016, pp. 2315–2324.

[50] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, ‘‘Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,’’ inProc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 67–83.

[51] W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, ‘‘Deep direct regression formulti-oriented scene text detection,’’ in Proc. IEEE Int. Conf. Comput. Vis.,Oct. 2017, pp. 745–753.

[52] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, ‘‘Textsnake:A flexible representation for detecting text of arbitrary shapes,’’ in Proc.Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 20–36.

[53] Y. Dai et al., ‘‘Fused text segmentation networks for multi-orientedscene text detection,’’ in Proc. 24th Int. Conf. Pattern Recognit. (ICPR),Aug. 2018, pp. 3604–3609.

MENG CAO received the B.E. degree from theHuazhong University of Science and Technology(HUST), in 2018. He is currently pursuing theM.Sc. degree in computer science and engineer-ing with Peking University. His current researchinterests include scene text detection and computervision.

YUEXIAN ZOU received the M.Sc. degree fromthe University of Electronic Science and Technol-ogy of China, in 1991, and the Ph.D. degree fromThe University of Hong Kong, in 2000. She iscurrently a Full Professor with Peking University,where she is also the Director of the AdvancedData and Signal Processing Laboratory, ShenzhenGraduate School. Her current research interestsinclude machine learning for signal processingand deep learning and its applications. She was a

recipient of the Leading Figure for Science and Technology Award by theShenzhen Municipal Government, in 2009.

DONGMING YANG received the B.E. degreefrom Shanxi University, in 2015, and the M.Sc.degree from the University of Chinese Academyof Sciences, in 2018. He is currently pursuing thePh.D. degree in computer science and engineer-ing with Peking University. His current researchinterests include computer vision and patternrecognition.

CHAO LIU received the B.E. degree from theInstitute of Disaster Prevention, in 2016. He is cur-rently pursuing the M.Sc. degree with the Schoolof Electronic and Computer Engineering, PekingUniversity. His current research interest includesscene text detection.

62816 VOLUME 7, 2019