gisca: gradient-inductive segmentation network with

13
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.DOI GISCA: Gradient-inductive Segmentation Network with Contextual Attention for Scene Text Detection MENG CAO 1 ,YUEXIAN ZOU 1,2 , DONGMING YANG 1 , CHAO LIU 1 1 ADSPLAB, School of ECE, Peking University, Shenzhen, China 2 Peng Cheng Laboratory, Shenzhen, China Corresponding author: Yuexian Zou (e-mail: [email protected]) This paper was partially supported by the Shenzhen Science & Technology Fundamental Research Program (JCYJ20160330095814461), National Engineering Laboratory for Video Technology - Shenzhen Division, Shenzhen Key Lab for IMVR (ZDSYS201703031405467) and Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and Intelligent Computing). Special acknowledgements are given to Aoto-PKUSZ Joint Lab (AISR) for its support. ABSTRACT Scene text detection (STD) is an irreplaceable step in scene text reading system. It remains a more challenging task than general object detection since text objects are of arbitrary orientations and varying sizes. Generally, segmentation methods which use U-Net or hourglass-like networks are mainstream approaches in multi-oriented text detection tasks. However, experience has shown that text-like objects in complex background have high response values on the output feature map of U-Net, which leads to the severe false positive detection rate and degrades the STD performance. To tackle this issue, an adaptive soft attention mechanism called Contextual Attention Module (CAM) is devised to integrate into U-Net to highlight salient areas and meanwhile retains more detail information. Besides, the gradient vanishing and exploding problems make U-Net more difficult to train because of the nonlinear deconvolution layer used in the up-sampling process. To facilitate the training process, a Gradient-inductive Module (GIM) is carefully designed to provide a linear bypass to make the gradient back-propagation process more stable. Accordingly, an end-to-end trainable Gradient-Inductive Segmentation network with Contextual Attention is proposed (GISCA in short). Experimental results on three public benchmarks have demonstrated that the proposed GISCA achieves the state-of-the-art results in terms of f -measure: 92.1%, 87.3%, 81.4% for ICDAR 2013, ICDAR 2015 and MSRA TD500 respectively. INDEX TERMS Scene text detection, multi-oriented text, segmentation network, contextual attention, gradient vanishing/exploding problems I. INTRODUCTION S CENE text is one of the most common objects in the na- ture, which frequently appears on many practical scenes and contains important information for many applications [1] [2] [3] [4] , such as blind navigation, scene understanding, autonomous driving, etc. Scene text reading is thus of great importance in computer vision which has made tremendous progress benefiting from the development of Deep Convo- lution Neural Networks (DCNNs) [5]. Generally, scene text reading can be divided into two main sub-tasks: scene text detection (STD) and scene text recognition. We focus on the task of STD in this paper since it’s the prerequisite and crucial step of scene text recognition. Despite the similarity to traditional OCR, it is noticed that STD is still challenging because text instances exhibit vast diversity in fonts, scales, arbitrary orientations, as well as uncontrollable illumination affects [6], etc. Thanks to the recent development of general object detec- tion [7] [8] and instance segmentation methods [9], STD has witnessed a gigantic progress. Deep learning based methods for STD directly learn hierarchical features from input im- ages, which demonstrates more accurate and efficient per- formance in various STD public benchmarks. However, due to the characteristics of scene text, there still remain many problems to address in practice. Firstly, false positive (FP) is a common mistake in STD, which means detectors tend to mistake non-text areas for text areas. Some examples are given in Figure 1. Thus, the FP problem may significantly VOLUME 4, 2016 1

Upload: others

Post on 03-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GISCA: Gradient-inductive Segmentation Network with

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

GISCA: Gradient-inductiveSegmentation Network with ContextualAttention for Scene Text DetectionMENG CAO1,YUEXIAN ZOU1,2, DONGMING YANG1, CHAO LIU11ADSPLAB, School of ECE, Peking University, Shenzhen, China2Peng Cheng Laboratory, Shenzhen, China

Corresponding author: Yuexian Zou (e-mail: [email protected])

This paper was partially supported by the Shenzhen Science & Technology Fundamental Research Program (JCYJ20160330095814461),National Engineering Laboratory for Video Technology - Shenzhen Division, Shenzhen Key Lab for IMVR (ZDSYS201703031405467)and Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and IntelligentComputing). Special acknowledgements are given to Aoto-PKUSZ Joint Lab (AISR) for its support.

ABSTRACT Scene text detection (STD) is an irreplaceable step in scene text reading system. It remainsa more challenging task than general object detection since text objects are of arbitrary orientations andvarying sizes. Generally, segmentation methods which use U-Net or hourglass-like networks are mainstreamapproaches in multi-oriented text detection tasks. However, experience has shown that text-like objects incomplex background have high response values on the output feature map of U-Net, which leads to thesevere false positive detection rate and degrades the STD performance. To tackle this issue, an adaptivesoft attention mechanism called Contextual Attention Module (CAM) is devised to integrate into U-Netto highlight salient areas and meanwhile retains more detail information. Besides, the gradient vanishingand exploding problems make U-Net more difficult to train because of the nonlinear deconvolution layerused in the up-sampling process. To facilitate the training process, a Gradient-inductive Module (GIM) iscarefully designed to provide a linear bypass to make the gradient back-propagation process more stable.Accordingly, an end-to-end trainable Gradient-Inductive Segmentation network with Contextual Attentionis proposed (GISCA in short). Experimental results on three public benchmarks have demonstrated that theproposed GISCA achieves the state-of-the-art results in terms of f -measure: 92.1%, 87.3%, 81.4% forICDAR 2013, ICDAR 2015 and MSRA TD500 respectively.

INDEX TERMS Scene text detection, multi-oriented text, segmentation network, contextual attention,gradient vanishing/exploding problems

I. INTRODUCTION

SCENE text is one of the most common objects in the na-ture, which frequently appears on many practical scenes

and contains important information for many applications [1][2] [3] [4] , such as blind navigation, scene understanding,autonomous driving, etc. Scene text reading is thus of greatimportance in computer vision which has made tremendousprogress benefiting from the development of Deep Convo-lution Neural Networks (DCNNs) [5]. Generally, scene textreading can be divided into two main sub-tasks: scene textdetection (STD) and scene text recognition. We focus onthe task of STD in this paper since it’s the prerequisite andcrucial step of scene text recognition. Despite the similarityto traditional OCR, it is noticed that STD is still challenging

because text instances exhibit vast diversity in fonts, scales,arbitrary orientations, as well as uncontrollable illuminationaffects [6], etc.

Thanks to the recent development of general object detec-tion [7] [8] and instance segmentation methods [9], STD haswitnessed a gigantic progress. Deep learning based methodsfor STD directly learn hierarchical features from input im-ages, which demonstrates more accurate and efficient per-formance in various STD public benchmarks. However, dueto the characteristics of scene text, there still remain manyproblems to address in practice. Firstly, false positive (FP)is a common mistake in STD, which means detectors tendto mistake non-text areas for text areas. Some examples aregiven in Figure 1. Thus, the FP problem may significantly

VOLUME 4, 2016 1

Page 2: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 1: The examples of the false positive results. Thefalse positive detection results are illustrated in red while thecorrect ones are in green.

affect the precision value in text detection. Secondly, al-though it has been proved that the network depth is of crucialimportance in various computer vision task, the network stillcan not be too deep because of the vanishing or explod-ing gradient and the consequent degradation problem [10].This has become a bottleneck to improve STD performance.Therefore, designing a suitable backbone network is crucialto boost the STD performance.

To solve these two problems, we propose a Gradient-Inductive Segmentation network with Contextual Attentionnamed as GISCA for multi-oriented scene text detection. Thebackbone network of GISCA incorporates two innovativemodules, namely Contextual Attention Module (CAM) andGradient-inductive Module (GIM) to solve the two problemsmentioned above respectively. Concretely, CAMs adaptivelymake the network focus on the salient areas and suppressthe background information, thus alleviating the FP problem.As for GIM, this well-designed module has a linear bypasswhich makes the gradient back-propagation process morestable. To the best of our knowledge, it is the first time topropose this kind of linear/non-linear decouple mechanism inthe STD task. Following the convention of text segmentationmethods, this novel STD detector generates the detectionresults directly from the instance segmentation map. Besides,it’s worth mentioning that the segmentation map is generatedthrough two kinds of pixel-wise prediction, text/non-textprediction and link prediction. Similar to PixelLink [11], thelink predictions between two pixels are labeled as positiveif they are within the same text instance, otherwise theyare labeled as negative. Therefore, the text score map andthe eight link maps (every pixel has 8 neighbours) form thefinal segmentation map using the positive linkage to linkthe positive pixels. The final bounding boxes are generatedby applying minAreaRect method in OpenCV to the textinstances. With these delicate designs, GISCA reveals a com-petitive performance in STD. We will show by experimentsin Section IV that our proposed detector outperforms otherstate-of-the-art methods.

The baseline model of this study called PixelLink wasproposed in [11]. Our proposed method extends PixelLinkwith two delicately designed modules CAM and GIM. The

attention mechanism is widely adopted in various computervision tasks. Compared to previous methods, CAM not onlysuppresses the background clutter but also enhances the de-tail expression of multi-layer feature maps. As for the gradi-ent back-propagation problems, this paper extensively stud-ies related publications and proposes the Gradient-inductiveModule which provides an effective linear gradient back-propagation bypass during the up-sampling process.

The main contributions of this paper are summarized asfollows: 1) This paper proposes an adaptive soft ContextualAttention Module (CAM) which serves as a probability heatmap and efficiently highlights the salient areas. Compared toother attention mechanisms, CAM avoids the additional inputsupervisory constraints and skillful loss function designs. Ex-periments show CAM efficiently extracts meaningful infor-mation from the input feature maps and enhances the detailcharacteristics in salient areas which make the network moresuitable for pixel-level and region-level tasks. 2) A novel up-sampling pattern named Gradient-inductive Module (GIM)is proposed to provide both sufficient feature expressionand convenient gradient back propagation. GIM facilities thetraining process and improves the detection results at thesame time. 3) The proposed GISCA outperforms state-of-the-art algorithms on three widely adopted multi-oriented textdatasets.

The rest of the paper is organized as follows. Section IIreviews some related works of our study. Section III presentspipelines and algorithms of our proposed GISCA. Section IVpresents the extensive experimental results. Finally, sectionV draws the conclusion.

II. RELATED WORKAs a special case of general object detection, scene textdetection (STD) basically follows the same framework ofobjection detection. In section A, we will first review themainstream approaches in the development progress of scenetext detection. Then issues about the application of attentionmechanisms are analyzed in Section B. Finally, Section Cdescribes the methods proposed recently for alleviating thegradient vanishing/exploding problems.

A. SCENE TEXT DETECTIONSTD aims to localize text instances in images commonlyusing word bounding boxes. Primarily, most recent meth-ods may roughly be classified into regression-based andsegmentation-based methods.

Regression-based methods: More specifically, regression-based methods can be further divided into two categoriesaccording to the regression target: proposal-based and part-based methods.

Proposal-based methods follow the routine of one-stage ortwo-stage detectors such as SSD [8] and Faster-RCNN [7].TextBoxes [13] modifies SSD [8] for scene text detectionwith long default boxes and the vertical offset strategy todeal with extreme aspect ratio texts. TextBoxes++ [14] ex-tends TextBoxes by using quadrilateral bounding boxes to

2 VOLUME 4, 2016

Page 3: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Pixel score map

Link score map

Segmentation map

CAMGIM

Post-

Processing

D1(conv1_2)

#c: 64 D2(conv2_2)

#c: 128

D3(Conv3_3)

#c: 256 D4(conv4_3)

#c: 512 D5(conv5_3)

#c: 512

U4

#c: 2/16

U4'U3

#c: 2/16

U3'

#c: 2/16

U2'

#c: 2/16

U1

#c: 2/16

CAMGIMCAMGIM

U2

#c: 2/16

Deconvolution

Deconvolution

Bilinear

interpolation

Copy

#c: 2/16

FIGURE 2: Architecture of GISCA. Firstly, An image is fed to the down-sampling VGG16 process, namely theD1∼D5 layers.D5 is generated by converting fc6/fc7 to the convolution layer using 1× 1 filter. The annotations in parentheses are the same asthose defined in [12]. Note that #c stands for the number of channels. #c: 2/16 stands for feature maps with 2 or 16 channels,for text/non-text prediction or link prediction individually. Then D5 is input to the deconvolution layer and U4 is the output.The red dotted box contains two separate modules CAM and GIM. The attention signal is refined through CAM (the greyblock depicted in the picture). After the bilinear interpolation, the previous up-sampling layer and the corresponding down-sampling layer are input into GIM to generate the output of the next layer. The final predictions (text/non-text prediction andlink prediction) are conducted on U1 which maintains the same resolution as the input image. The eight feature maps in bluedashed box stand for the link predictions in eight directions. The post-processing procedure is adopted by connecting positivepixels with positive links.

detect multi-oriented texts. RRPN [15] transforms the RPNin Faster RCNN to Rotated Region Proposal Network whichfacilitates the detection of multi-oriented texts. Wordsup [16]takes notice that there are few character level annotations inSTD and introduces the weak supervision strategy to adjustcharacter coordinates. EAST [17] eliminates the steps ofcandidate aggregation and word partition. Instead, it predictswords or text lines of arbitrary orientations and quadrilateralshapes directly. Liao et al. propose to separate text/non-textclassification and regression tasks using rotation-invariantand sensitive features [18]. Wang et al. come up withan Instance Transformation Network (ITN) [19] to learngeometry-aware representation tailored for scene text in anend-to-end network.

Part-based methods tend to regress text parts and predict

the linking relationships between them. CTPN [20] pro-poses a vertical anchor mechanism to predict the locationand text/non-text scores for each proposal. Then a recurrentneural network (RNN) is utilized to connect the sequentialproposals. SegLink [21] decomposes text into two elements:segments which are bounding boxes covering part of theword and links which connect two adjacent segments. MCN[22] converts an image into a Stochatic Flow Graph(SFG)and then conducts Markov Clustering on this graph to formthe final bounding boxes.

Segmentation-based method: Segmentation-based meth-ods regard the detection task as the instance segmentationproblem. Multi-oriented text detection is especially suitableto follow the segmentation method although it invokes com-plex post-processing. Zhang et al. apply FCN to predict the

VOLUME 4, 2016 3

Page 4: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

salient map of text region followed by the character com-ponent combination [23]. HMCP [24] utilizes the modifiedHED to obtain the text region map, character map and link-ing orientation map. Then the text line results are achievedthrough the problem-specific graph model. He et al. employa novel text detection algorithm using two cascaded FCN,one to extract text block regions and the other to removefalse positive [25]. Wu et al. develop a lightweight FCN toeffectively learn the borders in text images and introduce theborder class to text detection [26].

Considering the superiority of the segmentation-basedmethods in detecting multi-oriented scene texts, this paperuses segmentation as the baseline route.

B. ATTENTION MECHANISMAn important feature of human perception is that a persondoes not process the entire scene at once when parsing thepicture. Instead, humans selectively focus on a certain part ofthe visual space to get the salient information which presentsan overall understanding. This well-known attention mech-anism is widely studied in various computer vision tasks.Generally, the attention mechanism falls under two maincategories: hard attention [27] and soft attention [28] [29].Hard attention, e.g. iterative region proposal and cropping, isoften a fixed correction mechanism and it tends to be non-differentiable, which makes model training more difficult.In contrast, the trainable soft attention mechanism is a moresimple and efficient way to highlight important areas withoutintroducing additional inputs. More importantly, trainablesoft attention places only a small computation burden on themodel.

Attention mechanisms are widely adopted in scene textdetection (STD) to improve the detection accuracy. Text-attention CNN [30] integrates a mask branch in the networkand thus provides rich supervision information. This maskbranch enhances the capability to discriminate ambiguoustext instances. SSTD [31] takes the pixel-wise binary maskof text images as an additional input, which is used to refinethe original feature map. Huang et al. incorporate the pro-posed Pyramid Attention Network (PAN) into the Mask R-CNN framework to enhance the text detection performance[32]. R2CNN++ [33] adopts a multi-dimensional attentionmechanism which contains both pixel-wise and channel-wiseattention to reduce the adverse impact of noise.

Most of the methods mentioned above either use additionalinput or require the problem-specific loss function to con-strain the optimization process. In this paper, the proposedCAM only uses the contextual information of the networkand does not require explicit supervisory constraints. Moreimportantly, CAM not only suppresses background clutter,but also enriches the detail features of salient regions.

C. GRADIENT VANISH PROBLEMSDCNNs tend to be substantially deep to achieve accurateresults, which has been proved in [10]. However, many recentpublications [10] [34] [35] [36] [37] address the problem that

the gradient update information may vanish or explode dur-ing the back-propagation process. Almost all papers that dealwith gradient vanish/explosion problems design a shortcutconnection between layers close to the input and those closeto the output. More specifically, the shortcut connections areimplemented either by residual blocks or by concatenationoperations.

For residual block based models, they utilize the residuallearning framework proposed in [10] to ease the networktraining process and the following degradation problem [10].ResNet [10] hypothesizes that the residual functions areeasier to optimize than the original unknown mapping func-tions. Therefore, ResNet proposes identity shortcuts and re-formulates the layers to learn the residual functions. Inspiredby Long Short-Term Memory recurrent networks (LSTM),Highway Network [34] designs the bypassing paths and thegating units which provide a method to effectively trainnetworks with hundreds of layers in an end-to-end manner.Stochastic Depth [35] randomly drops residual blocks duringthe training process to further mitigate the gradient vanishingproblem.

The concatenation operation has also been extensivelyused as another way for alleviating gradient vanish-ing/exploding problems. DenseNet [36] connects each layerto every other layers in the proposed Dense Block. Ev-ery two Dense blocks are connected by transition layersto change feature map sizes via convolution and pooling.Hyper-column network [37] defines the pixel-level hyper-column as the vector of activation values of all the shallowlayers in CNN, which proves to perform well in fine-grainedlocalization tasks. For the segmentation task, U-Net [38]designs a skip connection to concatenate the down-samplingand up-sampling feature maps.

However, when carefully evaluating these structures, wefind that they cannot achieve the direct gradient back-propagation in the whole training process. For example,residual Block based methods have to utilize 1 × 1 convo-lution layers in the bypass branches to deal with the channelmismatch between input and output layers and this nonlinearbypass is unfavorable for gradient back-propagation. ForDenseNet, the dense concatenation is limited in the denseblock because only feature maps in the same dense blockpreserve the same resolution while the resize of featuremap is implemented through convolution and pooling oper-ations. Therefore, only the gradient within the DenseBlockis preserved and the gradient back-propagation between theDenseBlocks is still stuck. U-Net does connect the featuremaps between up-sampling and down-sampling processes,but the up-sampling process is still implemented through thenon-linear deconvolution operation. To achieve the directgradient back-propagation, this paper delicately designs anovel Gradient-inductive Module which contains a linearbypass in the up-sampling process. With our design, GIMenables the gradient from the very deep layers to be directlypropagated to shallow layers.

4 VOLUME 4, 2016

Page 5: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

III. METHODOLOGYA. OVERVIEWThe proposed GISCA is an end-to-end trainable, fully con-volutional network to detect multi-oriented texts. The overallarchitecture is presented in Figure 2. The framework can bedivided into four parts: A modified U-Net network [38] withGradient-inductive Module (GIM) serving as the backbonenetwork, the Contextual Attention Module (CAM) for featurerefinements, the pixel-level prediction module to output theinstance segmentation maps and the post-processing moduleto generate the bounding boxes.

In our design, the well-known VGG16 network [12] isadopted as the feature extractor. By convention, the fullyconnected layers fc6 and fc7 are converted into convolutionallayers using 1 × 1 filters. Inspired by [38], we design the U-shape feature fusion network for scene text instance segmen-tation. The convolution layers conv2_2, conv3_3, conv4_3,conv5_3 and fc_7 are utilized for the feature fusion imple-mentation. Different from U-Net, our modified U-Net is up-sampled through two modules, namely Contextual AttentionModule (CAM) and the Gradient-inductive Module (GIM).CAM, the delicately designed attention module, aims tosuppress irrelevant background and highlight the salient textareas. Then the feature layers are generated through GIMsequentially, which enables the gradient update informationto be directly propagated to shallow layers. Follwing Pix-elLink [11], two sets of predictions are conducted on thefinal feature maps, one branch for text/non-text classicicationand the other one for linkage prediction. Positive pixels arethen grouped together using positive links, which generates acollection of connected components (CCs), each representinga detected text instance.

Here we give the pseudo code.

Algorithm 1 Algorithm for the proposed method

Input: RGB images D0

Output: detection results1© Down-sampling process

1: for i ∈ [1, 5] do2: Di = pooling(ReLU(Conv(Di−1)))3: end for

2© Up-sampling process4: if (i == 4) then5: Ui = deconv(D5)6: end if7: if (i ∈ [1, 3]) then8: Dt

i = CAM(Di, U′i+1)

Ui = GIM(Dti , U

′i+1)

9: end if3© Pixel-level prediction process

10: The pixel-level binary classification is conducted on U1.11: The link prediction (8 values for each pixel) is applied

on U1.4© Post-processing process

12: The positive pixels are grouped with the positive linkage.

The design of CAM is detailed in Section B. Section Cpresents the structure of GIM, including the mathematicalproof of its effectiveness. The pixel-level predictions areillustrated in Section D. Finally, the overall optimizationprocedures including the specific loss function and post-processing strategies are given in Section E.

B. CONTEXTUAL ATTENTION MODULE(CAM)The false positives (FP) problem remains a big challengefor scene text detection [39]. Because of the complex back-ground interference and the large aspect ratio of the textshape, the current text detection algorithms tend to mistakesome similar objects such as a batch of fence or certaindisc-shaped objects as the text areas. The cause of misclas-sification is probably lack of local detail information in thedeep feature maps. Generally, the detail information willbe filtered during the convolutional down-sampling process.One typical down-sampling process is depicted in Figure3. From Figure 3, we can see that low-level feature mapscapture more details but they can not clearly indicate wherethe text areas are. On the opposite, the high-level featuremaps explicitly demonstrate the text areas with the cost ofdetail information loss.

Although the up-sampling process of U-Net outputs thefeature map with the same resolution as the input image,the detail information is still missing during the tediousconvolution proceeding. As a result, these feature maps arenot suitable for pixel-level or region-level tasks owing to thecontradiction between the resolution and semantic informa-tion. Therefore, it is important for the trainable model tofocus on the salient areas on one hand and retain more detailinformation on the other hand.

Based on the soft attention mechanism, we propose theContextual Attention Module (CAM). The architecture ofCAM is illustrated in Figure 4, from which we can see thatthe output feature map of CAM retains both detail character-istics and high-level semantic information (highlighted areasindicating where texts are). The common skip connectionsin U-Net are replaced by CAMs during the up-samplingprocess. Compared to the traditional U-Net models, CAMadaptively emphasizes the salient areas and suppresses irrele-vant background information without adding a large numberof model parameters. Figure 5 illustrates the input featuremaps and the output refinement results of CAM. We canclearly see that the glass curtain walls in the input featuremaps have high response values, while the CAM outputfeature maps are more focused on the text area.

The attention map has shape Win ×Hin × Cin with eachattention coefficient ranging from 0 to 1. The output of CAMis the element-wise multiplication result of the original skipconnection feature map (from low-level layers) and the atten-tion coefficient map. As depicted in Figure 4, CAM takes thefeature map Di and Ui+1, i = 1, 2, 3 as inputs. The adaptiveattention map is generated through two 3 × 3 convolutionallayers and one 1×1 convolutional layer. The generated atten-tion map has two channel representing text/non-text scores

VOLUME 4, 2016 5

Page 6: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) (b) (c) (d) (e) (f)

FIGURE 3: The illustration of down-sampling process results. (a) The input image. (b)∼(f): the feature map D1∼D5. Lowerlevel feature maps capture more local details while higher level feature maps contain more semantic information.

Conv/ReLU

Conv/ReLU

esoftmaxDi

Ui+1'

Dti

Conv/ReLU

FIGURE 4: The architecture of CAM. Di, i = 1, 2, 3, represents the feature map extracted from the down-sampling process.U ′i+1 denotes the bilinear interpolation of Ui+1. The output result Dt

i contains both the details in Di and the semanticinformation in U ′i+1.

(a) Input of CAM (b) Output of CAM

D1t

D2t

D1

D2

D3 D3t

FIGURE 5: Comparisons of the input and output of CAM.The symbol in the upper right corner is the name of thefeature map.

respectively. Therefore, the designed generation process ofCAM is as follows:

m = ReLU(Conv3×3(Di) + Conv3×3(Ui+1)) (1)

s = eSoftmax(ReLU(Conv1×1(m))) (2)

where Conv3×3 denotes the convolution layer with kernelsize 3 × 3 and Conv1×1 represents 1 × 1 convolution.Softmax function is used to normalise the attention coef-ficients. Through the exponential function, the salient map isenhanced because the gap between text and non-text areasbecomes larger. In both Eq(1) and Eq(2), the rectified linearunit ReLU is utilized as the activation function. The finalattention result is given as follows:

Dti = s∗ �Di (3)

where s∗ is generated by broadcasting s to the same channelas Di, namely Cin. � represents the pixel-to-pixel multipli-cation of s∗ and Di.

C. GRADIENT-INDUCTIVE MODULE (GIM)Generally, for the segmentation task, the well-known net-works such U-Net [38], FPN [40] and stacked hourglass [41]having the hourglass-like structure are widely used to obtaindeep feature maps with high resolutions. Unfortunately, thetraining process of the networks which consist of both down-sampling and up-sampling processes becomes more difficult

6 VOLUME 4, 2016

Page 7: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

when the DCNNs become increasingly deeper. This phe-nomenon can be attributed to the gradient vanishing [42] orgradient exploding problems [43] when the gradients propa-gate from deep feature maps to shallow ones. For instance,the parameter update may stuck in a local optimal region orthe gradient value may become infinite.

To address this problem, we carefully design a novelGradient-inductive Module (GIM), which enables the gradi-ent update information to be directly propagated to shallowlayers. Considering that the nonlinear structure is deft atformulating the mapping between input and output whilethe linear structure can easily propagate gradients, GIM isdesigned to decouple the linear and nonlinear expressionsas two branches. The architecture is depicted in Figure 6.The concatenation layers followed by convolution and ReLUactivation functions form the non-linear branches. Throughthis branch, the neural networks can approximate the map-ping function well in the training process. The element-wise addition layer is the linear branch of GIM. In a certainsense, this design embeds a "constant" value in the output.Therefore, the existence of the linear layer guarantees thegradient update of each layer to be close to one constant valueCompared with the conventional tandem structure to connectlinear and nonlinear layers [5] [44], our decoupled moduleshows better gradient stability during the training process.

Here, we denote the feature maps as Di(i = 1, 2, 3, 4, 5)and Uj(j = 1, 2, 3, 4) for the down-sampling and up-sampling process respectively. Then GIM can be expressedas follows:

Ui =

{Dti + r(concat(Dt

i , U′i+1)) i = 1, 2, 3

deconv(D5) i = 4(4)

where the Dti denotes the feature map transfered from Di,

namely the output of CAM. The concat(·) represents featureconcatenation and r represents Conv + ReLU operations.For the sake of clarity, we expand the formula gradually asfollows:

Mi = concat(Dti , U

′i+1) (5)

Mi = ReLU(conv(Mi)) (6)

Ui = Dti + Mi (7)

where conv in Eq (6) denotes the 1 × 1 × Cin convolutionoperations.

To generate Ui(i = 1, 2, 3), the GIM takes the CAMoutput Dt

i with shape Win×Hin×Cin and the previous up-sampling result U ′i+1 with the same shape as input. It shouldbe noted that U ′i+1 is the bilinear interpolation result of Ui+1,which avoids introducing the nonlinear deconvolution layers.

The concatenation result of Dti and U ′i+1 is defined as Mi

in Eq (5). The convolution operations are utilized to extractthe messages from Mi and adjust the number of channels, asdescribed in Eq (6). With our design, the nonlinear branchof GIM outputs the feature map with the same shape as theinput. As shown in equation (7), the linear branch of GIMtakes the element-wise addition results of Dt

i and Mi as the

final outputs. In this manner, feature maps U3, U2, U1 aregenerated sequentially.

The explaination of GIM’s principle may be given fromthe perspective of mathematical derivation. Ignoring the in-fluence of CAM, we only integrate GIM in U-Net. Accordingto the architecture of the proposed backbone network shownin Figure 2, we have the following expression.

Ui = Di +Hφi(Di, Ui+1), i = 1, 2, 3 (8)

where Hφi is the asymptotically approximation function inthe up-sampling process. Sequentially from Eq (8), we getthe following equation:

U1 = D1 +D2 +D3 +

3∑i=1

Hφ(Di, Ui+1) (9)

For convenience,Hφ is adopted here to represent the simpli-fication result of nested iterations of Hφi, i = 1, 2, 3. Usingthe chain rule, we can derive as follows:∂L∂D1

=∂L∂U1

· ∂U1

∂D1

=∂L∂U1

(1 +∂

∂U1

3∑i=1

F(Di, Ui+1)), i = 1, 2, 3

(10)

where L is the loss function. The additive term ∂L∂U1

ensuresthat the gradient of deep feature map U1 can be directlypropagated to the shallow one D1. Therefore, the Gradient-inductive Module (GIM) provides a bypass which "induce"the deep gradient to the shallow feature map. It comes out tobe efficient to alleviate the gradient vanishing problem andthat is why GIM is named as Gradient-inductive Module.Similarly, we find that the gradient of U2 and U3 can bepropagated to D2 and D3 respectively.

∂L∂D2

=∂L∂U2

(1 +

3∑i=2

∂U2G(Di, Ui+1)) (11)

∂L∂D3

=∂L∂U3

(1 +

3∑i=2

∂U3M(Di, Ui+1)) (12)

where G andM are defined as the nonlinear mapping func-tion. It is noted that the existence of ∂L

∂U2and ∂L

∂U3makes the

gradient update in layers D2 and D3 more stable.

D. PIXEL LEVEL PREDICTIONThe segmentation method is widely used in STD by cast-ing the detection task as text/non-text classification tasksespecially when dealing with the multi-oriented or arbitrary-oriented text detection. This paper follows the methodologyin PixelLink [11] which predicts the linkage between thepixel and its eight adjacent pixels. For the text/non-textprediction, we utilize softmax function to generate the finalfeature map with channel 1 × 2 = 2, representing text/non-text score respectively; For the pixel-to-pixel linkage predic-tion, we output the feature map with channel 8 × 2 = 16representing positive/negative scores for every eight adjacent

VOLUME 4, 2016 7

Page 8: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Conv+ReLU

r(Mi)

Ui

MiC

t

iD

'

1iU

FIGURE 6: The flowchart of GIM. Dti is the output result of

CAM while U ′i+1 denotes the bilinear interpolation result ofUi+1. The symbol c© represents the concatenation operationand ⊕ represents element-wise addition.

pixels. Then two separate thresholds are applied to gener-ate the positive pixels and the positive linkage. The outputsegmentation map is obtained by linking the positive pix-els using positive linkages. Finally, Connected Components(CC) algorithm is employed to find the individual text areasand minAreaRext in OpenCV [45] is applied to obtain thebounding boxes for every CC.

E. OPTIMIZATIONThe overall loss function is given as follows:

L = λLpixel + Llink

where Lpixel and Llink denote the pixel loss and the link lossseparately. λ is set to 2.0 because the pixel classification taskis more important than the link prediction task.

Generally, we follow the same methodology in PixelLink[11]. For pixel loss computing, a novel instance-balancedcross-entropy loss is adopted to deal with large differenceof the text size, which takes the text instance areas as themodulation factor. Link loss is a kind of class-balanced cross-entropy loss which is calculated on positive pixels only.Specific details can refer to PixelLink [11].

IV. EXPERIMENTSTo evaluate the proposed method, we conduct experimentson three commonly used public datasets: ICDAR2013 (IC13)[46], ICDAR2015 (IC15) [47], MSRA-TD500 (TD500) [48].The experiment settings and the comparison results arepresented as follows. A concise description of these threedatasets and the accepted evaluation protocols are givenin Section A. The implementation details are illustrated inSection B. Experimental results for ICDAR 2013, ICDAR2015 and MSRA Td500 are given in Section C, Section Dand Section E respectively. To reveal the effectiveness oftwo proposed modules, sufficient ablation studies are givenin Section F.

A. DATASETS AND EVALUATION PROTOCOL1) SynthText in the Wild (Synth Text) [49]: The Syn-

thText dataset contains 800k synthesized text imageswhich are rendered on natural images. With a properlearning algorithm to choose and transform the textlocation, the images tend to have a realistic look.Annotations are given in character, word and text linelevel. This dataset is used to pre-train the model.

2) ICDAR 2015 (IC15) [47]: IC15 is the most commonlyused benchmark for detecting multi-oriented text withword-level quadrilaterals. It is composed of 1000 train-ing images and 500 testing images. This dataset is chal-lenging because of the arbitrary orientations, motionblur and low resolutions.

3) ICDAR 2013 (IC13) [46]: IC13 consists of 229 train-ing images and 233 testing images with word-levelannotations provided. It is the standard benchmark forevaluating near-horizontal text detection.

4) MSRA TD500 (TD500) [48]: TD500 contains 500 im-ages (300 for training, 200 for testing) containing bothChinese and English texts. Different from IC15 andIC13, annotations of TD500 are at the line level whichare represented by the aligned horizontal rectangles.

5) Evaluation protocol: In our experiments, we utilizethe standard evaluation protocols for text detection,namely precision (P), recall (R) and f-measure (F).Precisely, they are defined as follows:

P =TP

TP + FP(13)

R =TP

TP + FN(14)

F = 2× P ×RP +R

(15)

where TP ,FP ,FN denotes the correctly classifiedtext instances, incorrectly spotted instances and themissing instances respectively. Given a detected textinstance b, it will be assigned to the positive detectionif the IOU between b and a ground truth is larger thana given threshold (normally set to 0.5). Since there isa trade-off between recall and precision, f -measureis a more objective measurement for performance as-sessments.

B. IMPLEMENTATION DETAILSThe experiments are conducted on one single NVIDIA TitanX GPU with 12GB memory. The whole model is optimizedby SGD with momentum = 0.9 and weight decay = 5×10−4.The model is pre-trained on SynthText for 500K iterationswhich takes about 0.9s per iterations. The learning rate isset to 10−3 in the pre-trained process. Then we fine-tunethe model on IC15, IC13 and TD500 respectively. We setthe learning rate to 1 × 10−3 for the first 100K iterationsand 1 × 10−4 for the rest 80K iterations. The input imagesare resized to 1280 × 768, 768 × 768, 512 × 512 for IC15,

8 VOLUME 4, 2016

Page 9: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Results of ICDAR 2013 Dataset

Algorithm Recall(%) Precision(%) f-measure(%)CTPN [20] 83.0 93.0 87.0

PixelLink [11] 87.5 88.6 88.1TextBoxes [13] 83.0 89.0 86.0

TextBoxes++ [14] 84.0 91.0 88.0MCN [22] 87.0 88.0 81.0EAST [17] 82.7 92.6 87.4

SegLink [21] 83.0 87.7 85.3Mask TextSpotter [50] 88.6 95.0 91.7

SSTD [31] 86.0 88.0 87.0WordSup [16] 88.0 93.0 90.0

RRD [18] 75.0 88.0 81.0GISCA(ours) 91.4 92.8 92.1

IC13 and Td-500 respectively. We take the data augmentationstrategies following SSD and PixelLink. Input images arefirstly rotated at a probability of 0.2, by a random angle of0,π/2,π or 3π/2. Then we randomly crop them with areasranging from 0.1 from 1, and aspect ratios of the images areranging from 0.5 to 2. As for the text instance in the processedimages, the shorter side less than 10 pixels are ignored.

C. HORIZONTAL TEXT DETECTION PERFORMANCE:IC13The final models are fine-tuned for about 180K iterationsbased on the SynthText pre-trained model. Thresholds on thepixel and link to determinate positive or negative are set to0.7 and 0.5 respectively. Some detection results on IC13 aregiven in Figure 7(a).

We have evaluated the proposed GISCA for text detectionon ICDAR 2013, a popular horizontal text dataset. Thenumerical comparison results are listed in Table 1. In gen-eral, our proposed STD model outperforms all the existingstate-of-the-art methods in terms of recall and f -measure.Specifically, it achieves the best performance compared toall the regression-based methods [20] [13] [14] [22] [17][21] [31] [16] [18] in three evaluation measurements. Be-sides, GISCA outperforms all the segmentation based textdetection algorithms [11] [50] in terms of recall and f -measure. Taking our baseline model PixelLink as an ex-ample, GISCA exceeds PixelLink by 3.9%, 4.2%, 4.0% inrecall, precision,f -measure respectively. The only excep-tion is that GISCA has a lower precision value than MaskTextSpotter. Considering that Mask TextSpotter uses theresults of text recognition to refine the detection results, itis reasonable to generate such a high precision value. Onthe whole, the experimental results show the superiority ofthe proposed GISCA.

D. MULTI-ORIENTED TEXT DETECTIONPERFORMANCE: IC15We follow the experiment settings presented in section B.The thresholds for pixel and link prediction are both setto 0.8. The results are evaluated by the official submissionserver for the fair comparison. Some qualitative illustrationsare shown in Figure 7(b).

TABLE 2: Results of ICDAR 2015 Incidental Dataset

Algorithm Recall(%) Precision(%) f-measure(%)CTPN [20] 51.6 74.2 60.9EAST [17] 78.3 83.3 80.7

SegLink [21] 76.8 73.1 75.0MCN [22] 72.0 80.0 76.0SSTD [31] 73.0 80.0 77.0

WordSup [16] 77.0 79.3 78.2DeepReg [51] 80.0 82.0 81.0TextSnake [52] 80.4 84.9 82.6PixelLink [11] 82.0 85.5 83.7

FTSN [53] 80.0 88.6 84.1GISCA(ours) 85.5 89.1 87.3

TABLE 3: Results of MS Td500 Dataset

Algorithm Recall(%) Precision(%) f-measure(%)PixelLink [11] 73.2 83.0 77.8

EAST [17] 61.6 81.7 70.2ITN [19] 65.6 80.3 72.2

RRPN [15] 68.0 82.0 74.0SegLink [21] 70.0 86.0 77.0

RRD [18] 73.0 87.0 79.0GISCA(ours) 77.1 86.3 81.4

The IC15 text detection task is more challenging com-pared to IC13 owing to its multi-orientation characteristicsin complex contexts. Besides, images in IC15 are of lowresolution and contain many small text instances. Table 2shows results on GISCA compared with previous stat-of-the-art published methods. The experiment results revealthe proposed method’s advantages under the complex back-ground and low resolution conditions. More specifically, theproposed method outperforms all the existing algorithmsaccording to the value of recall (85.5%), precision (89.1%)and f -measure (87.3%). Especially in f -measure value,GISCA leads the second-place FTSN by 3.2%. Therefore,GISCA has a huge advantage in detecting multi-oriented text.

E. ORIENTED MULTI-LINGUAL TEXT LINE DETECTIONPERFORMANCE: TD500

MSRA TD500 is the dataset which contains both English andChinese texts. The annotations are given in the forms of textlines. The thresholds of pixel and link are set to 0.8 and 0.7respectively.

Some qualitative detection results are illustrated in Figure7(c) and comparison results are listed in Table 3. We can con-clude that GISCA can successfully detect long text lines ofarbitrary orientations and shapes. More specifically, GISCAsurpasses the existing state-of-the-art results by more than2% according to the value of f -measure. Although RRDslightly surpasses the proposed method in precision value,the proposed method result is still competitive consideringthat RRD is a detector specifically designed for long orientedtexts.

VOLUME 4, 2016 9

Page 10: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a)

(b)

(c)

FIGURE 7: Some qualitative detection results on (a) ICDAR 2013, (b) ICDAR 2015, and (c) MSRA-TD500

TABLE 4: Ablation Experiments of IC13

Algorithm Recall(%) Precision(%) f-measure(%)Baseline 87.5 88.6 88.1

Baseline+CAM 91.5 92.4 91.9Baseline+GIM 91.4 92.2 91.8

CAM+GIM 91.4 92.8 92.1

F. ABLATION EXPERIMENTS

Our proposed multi-oriented text detection model invokestwo innovative components, namely Contextual AttentionModule (CAM) and Gradient-inductive Module (GIM). Toverify the validity of these two proposed modules, we con-duct ablation studies over ICDAR 2013, ICDAR 2015 andMSRA TD500 respectively. Four sets of comparison resultsare list in Table 4,5,6. For convenience, we utilize Baselineto denote the original PixelLink model which takes the com-monly used U-Net as backbone network. The second experi-ment Baseline+CAM replaces the skip connection in U-Netwith the proposed CAM. The third is Baseline+GIM whichfollows the architecture of U-Net but it takes the proposedGIM in the up-sampling process. The last is CAM+GIM thatincorporates both CAM and GIM in the backbone network.

With or without CAM: As shown in Table 4,5,6,Baseline+CAM outperforms Baseline evaluated by recall,precision and f -measure. For instance, considering f -measure, Baseline+CAM surpasses Baseline by 3.8%,

TABLE 5: Ablation Experiments of IC15

Algorithm Recall(%) Precision(%) f-measure(%)Baseline 82.0 85.5 83.7

Baseline+CAM 84.5 87.9 86.2Baseline+GIM 83.5 85.6 84.5

CAM+GIM 85.5 89.1 87.3

TABLE 6: Ablation Experiments of TD500

Algorithm Recall(%) Precision(%) f-measure(%)Baseline 73.2 83.0 77.8

Baseline+CAM 75.9 85.3 80.3Baseline+GIM 72.5 84.6 78.1

CAM+GIM 77.1 86.3 81.4

2.5%, 2.5% in IC13, IC15 and TD500 respectively. Similarly,CAM+GIM also exceeds Baseline+GIM on all three mea-surements. These two sets of comparison experiments showthat Contextual Attention Module is an efficient structure inthe text detection task. The improvement of the precisionvalue indicates that the CAM effectively reduces the falsedetection rate and thus alleviates the FP problem. The up-sampling process with and without CAM are shown in theFigure 8, which intuitively displays of the effects of CAM.

With or without GIM: To validate the potency of GIM,we need to compare Baseline with Baseline+GIM andBaseline+CAM with CAM+GIM. Taking the comparison

10 VOLUME 4, 2016

Page 11: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) With CAM

(b) Without CAM

U4

U4

U3

U3

U2

U2

U1

U1

FIGURE 8: The up-sampling process results. The symbol in the upper right corner is the name of the feature map. (a)WithCAM (b)Without CAM. It is obvious that when CAM is integrated into the network, the feature maps filter out the backgroundinterference and focus on the text area.

between Baseline and Baseline+GIM into account, the latterexperiments outperform the former ones on all three datasetsand under almost all three measurement criteria. The onlyexception is the Baseline surpasses Baseline+GIM in thevalue of recall in TD500 which could be attributable to thelarge text size variation. The other sets of comparison resultsbetween Baseline+CAM and CAM+GIM also indicate thatGIM is an efficient module in most cases.

V. CONCLUSIONIn this work, we have presented an end-to-end trainablescene text detector named GISCA which is under the U-Netframework. More specifically, a novel Contextual AttentionModule (CAM) is carefully designed to boost the salient textareas which leads to the reduction of the false positive results.Besides, the proposed Gradient-inductive Module (GIM) isdesigned to provide a linear bypass which aims at alleviatingthe gradient vanishing/exploding problems. Experimental re-sults on three widely-used datasets (ICDAR 2013, ICDAR2015 and MSRA TD500) demonstrate the superiority ofthe proposed method, which outperforms all state-of-the-artmethods in terms of f -measure. In the future, we plan toinvestigate the universality of CAM and GIM in deep neuralnetworks and probe the curve text detection problems usingthe proposed method.

ACKNOWLEDGMENTThis paper was partially supported by the ShenzhenScience & Technology Fundamental Research Program(JCYJ20160330095814461), National Engineering Labora-tory for Video Technology - Shenzhen Division, ShenzhenKey Lab for IMVR (ZDSYS201703031405467) and Shen-zhen Municipal Development and Reform Commission (Dis-

ciplinary Development Program for Data Science and Intel-ligent Computing). Special acknowledgements are given toAoto-PKUSZ Joint Lab (AISR) for its support.

REFERENCES AND FOOTNOTESREFERENCES[1] Chulmoo Kang, Gunhee Kim, and Suk I Yoo. Detection and recognition of

text embedded in online images via neural context models. In Thirty-FirstAAAI Conference on Artificial Intelligence, 2017.

[2] Xuejian Rong, Chucai Yi, and Yingli Tian. Recognizing text-based trafficguide panels with cascaded localization network. In European Conferenceon Computer Vision, pages 109–121. Springer, 2016.

[3] Bo Xiong and Kristen Grauman. Text detection in stores using a repetitionprior. In 2016 IEEE Winter Conference on Applications of ComputerVision (WACV), pages 1–9. IEEE, 2016.

[4] Chucai Yi and Yingli Tian. Scene text recognition in mobile applicationsby character descriptor and structure configuration. IEEE transactions onimage processing, 23(7):2972–2982, 2014.

[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenetclassification with deep convolutional neural networks. In Advances inneural information processing systems, pages 1097–1105, 2012.

[6] Qixiang Ye and David Doermann. Text detection and recognition inimagery: A survey. IEEE transactions on pattern analysis and machineintelligence, 37(7):1480–1500, 2015.

[7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:Towards real-time object detection with region proposal networks. InAdvances in neural information processing systems, pages 91–99, 2015.

[8] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, ScottReed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multiboxdetector. In European conference on computer vision, pages 21–37.Springer, 2016.

[9] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutionalnetworks for semantic segmentation. In Proceedings of the IEEE confer-ence on computer vision and pattern recognition, pages 3431–3440, 2015.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residuallearning for image recognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778, 2016.

[11] Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. Pixellink: Detectingscene text via instance segmentation. In Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018.

VOLUME 4, 2016 11

Page 12: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[12] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv:1409.1556,2014.

[13] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and WenyuLiu. Textboxes: A fast text detector with a single deep neural network. InThirty-First AAAI Conference on Artificial Intelligence, 2017.

[14] Minghui Liao, Baoguang Shi, and Xiang Bai. Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing,27(8):3676–3690, 2018.

[15] Jing Huang, Viswanath Sivakumar, Mher Mnatsakanyan, and Guan Pang.Improving rotated text detection with rotation region proposal networks.arXiv preprint arXiv:1811.07031, 2018.

[16] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, andErrui Ding. Wordsup: Exploiting word annotations for character basedtext detection. In Proceedings of the IEEE International Conference onComputer Vision, pages 4940–4949, 2017.

[17] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, WeiranHe, and Jiajun Liang. East: an efficient and accurate scene text detector.In Proceedings of the IEEE conference on Computer Vision and PatternRecognition, pages 5551–5560, 2017.

[18] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai.Rotation-sensitive regression for oriented scene text detection. In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 5909–5918, 2018.

[19] Fangfang Wang, Liming Zhao, Xi Li, Xinchao Wang, and Dacheng Tao.Geometry-aware scene text detection with instance transformation net-work. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1381–1389, 2018.

[20] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting textin natural image with connectionist text proposal network. In Europeanconference on computer vision, pages 56–72. Springer, 2016.

[21] Baoguang Shi, Xiang Bai, and Serge Belongie. Detecting oriented textin natural images by linking segments. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 2550–2558, 2017.

[22] Zichuan Liu, Guosheng Lin, Sheng Yang, Jiashi Feng, Weisi Lin, andWang Ling Goh. Learning markov clustering networks for scene textdetection. arXiv preprint arXiv:1805.08365, 2018.

[23] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu,and Xiang Bai. Multi-oriented text detection with fully convolutionalnetworks. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4159–4167, 2016.

[24] Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, andZhimin Cao. Scene text detection via holistic, multi-channel prediction.arXiv preprint arXiv:1606.09002, 2016.

[25] Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, Alexander G Ororbi,Daniel Kifer, and C Lee Giles. Multi-scale fcn with cascaded instanceaware segmentation for arbitrary oriented word spotting in the wild. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3519–3528, 2017.

[26] Yue Wu and Prem Natarajan. Self-organized text detection with minimalpost-processing via border learning. In Proceedings of the IEEE Interna-tional Conference on Computer Vision, pages 5000–5009, 2017.

[27] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models ofvisual attention. In Advances in neural information processing systems,pages 2204–2212, 2014.

[28] Saumya Jetley, Nicholas A Lord, Namhoon Lee, and Philip HS Torr. Learnto pay attention. arXiv preprint arXiv:1804.02391, 2018.

[29] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, HonggangZhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network forimage classification. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3156–3164, 2017.

[30] Tong He, Weilin Huang, Yu Qiao, and Jian Yao. Text-attentional convolu-tional neural network for scene text detection. IEEE transactions on imageprocessing, 25(6):2529–2541, 2016.

[31] Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. Singleshot text detector with regional attention. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 3047–3055, 2017.

[32] Zhida Huang, Zhuoyao Zhong, Lei Sun, and Qiang Huo. Mask r-cnn withpyramid attention network for scene text detection. In 2019 IEEE WinterConference on Applications of Computer Vision (WACV), pages 764–772.IEEE, 2019.

[33] Xue Yang, Kun Fu, Hao Sun, Jirui Yang, Zhi Guo, Menglong Yan,Tengfei Zhan, and Sun Xian. R2cnn++: Multi-dimensional attention based

rotation invariant detector with robust anchor strategy. arXiv preprintarXiv:1811.07126, 2018.

[34] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. High-way networks. arXiv preprint arXiv:1505.00387, 2015.

[35] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger.Deep networks with stochastic depth. In European conference on computervision, pages 646–661. Springer, 2016.

[36] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-berger. Densely connected convolutional networks. In Proceedings of theIEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.

[37] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik.Hypercolumns for object segmentation and fine-grained localization. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 447–456, 2015.

[38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-tional networks for biomedical image segmentation. In International Con-ference on Medical image computing and computer-assisted intervention,pages 234–241. Springer, 2015.

[39] Enze Xie, Yuhang Zang, Shuai Shao, Gang Yu, Cong Yao, and GuangyaoLi. Scene text detection with supervised pyramid context network. arXivpreprint arXiv:1811.08605, 2018.

[40] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha-ran, and Serge Belongie. Feature pyramid networks for object detection.In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2117–2125, 2017.

[41] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networksfor human pose estimation. In European Conference on Computer Vision,pages 483–499. Springer, 2016.

[42] Sepp Hochreiter. The vanishing gradient problem during learning recurrentneural nets and problem solutions. International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.

[43] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding theexploding gradient problem. CoRR, abs/1211.5063, 2, 2012.

[44] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich. Going deeper with convolutions. In Proceedings of the IEEEconference on computer vision and pattern recognition, pages 1–9, 2015.

[45] Gary Bradski and Adrian Kaehler. Opencv. Dr. DobbâAZs journal ofsoftware tools, 3, 2000.

[46] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura,Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David FernandezMota, Jon Almazan Almazan, and Lluis Pere De Las Heras. Icdar 2013robust reading competition. In 2013 12th International Conference onDocument Analysis and Recognition, pages 1484–1493. IEEE, 2013.

[47] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, SumanGhosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neu-mann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015competition on robust reading. In 2015 13th International Conference onDocument Analysis and Recognition (ICDAR), pages 1156–1160. IEEE,2015.

[48] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. Detectingtexts of arbitrary orientations in natural images. In 2012 IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1083–1090. IEEE,2012.

[49] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data fortext localisation in natural images. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2315–2324, 2016.

[50] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai.Mask textspotter: An end-to-end trainable neural network for spotting textwith arbitrary shapes. In Proceedings of the European Conference onComputer Vision (ECCV), pages 67–83, 2018.

[51] Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Deep directregression for multi-oriented scene text detection. In Proceedings of theIEEE International Conference on Computer Vision, pages 745–753, 2017.

[52] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu,and Cong Yao. Textsnake: A flexible representation for detecting text ofarbitrary shapes. In Proceedings of the European Conference on ComputerVision (ECCV), pages 20–36, 2018.

[53] Yuchen Dai, Zheng Huang, Yuting Gao, Youxuan Xu, Kai Chen, Jie Guo,and Weidong Qiu. Fused text segmentation networks for multi-orientedscene text detection. In 2018 24th International Conference on PatternRecognition (ICPR), pages 3604–3609. IEEE, 2018.

12 VOLUME 4, 2016

Page 13: GISCA: Gradient-inductive Segmentation Network with

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

VOLUME 4, 2016 13