image retrieval and pattern spotting using siamese neural ...a whole document image or a small...

8
Image Retrieval and Pattern Spotting using Siamese Neural Network Kelly L. Wiggers * , Alceu S. Britto Jr. *, Laurent Heutte , Alessandro L. Koerich and Luiz S. Oliveira § * Pontical Catholic University of Parana (PUCPR), Curitiba (PR), Brazil Email: {wiggers, alceu}@ppgia.pucpr.br Normandie Univ, UNIROUEN, UNIHAVRE, INSA Rouen, LITIS, Rouen, France Email: [email protected] École de Technologie Supérieure (ÉTS), Université du Québec, Montréal (QC), Canada Email: [email protected] § Federal University of Paraná, Curitiba (PR), Brazil Email: [email protected] State University of Ponta Grossa (UEPG), Ponta Grossa (PR), Brazil Abstract—This paper presents a novel approach for image retrieval and pattern spotting in document image collections. The manual feature engineering is avoided by learning a similarity- based representation using a Siamese Neural Network trained on a previously prepared subset of image pairs from the Im- ageNet dataset. The learned representation is used to provide the similarity-based feature maps used to nd relevant image candidates in the data collection given an image query. A robust experimental protocol based on the public Tobacco800 document image collection shows that the proposed method compares favor- ably against state-of-the-art document image retrieval methods, reaching 0.94 and 0.83 of mean average precision (mAP) for retrieval and pattern spotting (IoU=0.7), respectively. Besides, we have evaluated the proposed method considering feature maps of different sizes, showing the impact of reducing the number of features in the retrieval performance and time-consuming. Index Terms—Siamese network, image retrieval, pattern spot- ting I. I NTRODUCTION The large volume of digital information stored in the last two decades by modern society has been the primary motivation for researchers in the Pattern Recognition area to investigate new methods for automatic indexing and retrieval of digital material (images, videos, and audios) based on their contents. Some signicant progress may be observed in the eld of Content-Based Image Retrieval (CBIR), in which the relevant image candidates must be found in large digital collections based on a given query represented by a whole image or just a pattern available on it [1]–[5]. A more recent and exciting challenge in the CBIR eld has been to perform the retrieval process without any contextual information, i.e. without previous knowledge about the patterns to be detected. The lack of information makes unfeasible the use of strategies where specic models are trained based on the patterns (ob- jects) to be retrieved. Such a challenge is common in CBIR solutions devoted to vast libraries of document images, where usually the main tasks are image retrieval and pattern spotting. In the former, the objective is to nd every document image that contains the given query, while in the later an additional difculty is to provide the query location in the retrieved images. The lack of contextual information mentioned above, com- bined with the wide variability in terms of scale, color and texture of the patterns present in document collections, which may include seals, logos, faces, and initial letters, makes the denition of a robust representation scheme (feature ex- traction) a real challenging problem. In the literature, one may nd some promising results of traditional representation methods based on local descriptors such as bag-of-visual- words (BoW) [6]. However, some signicant contributions have been recently achieved through the use of deep learning based methods, mainly by doing feature extraction using Convolutional Neural Networks (CNN) [7] [8] [9]. Recently, Luo et al. [10] and Wiggers et al. [11] proposed the use of pre-trained CNNs to provide the image representation (feature extraction) achieving very promising results. Different from them, in the present work two deep models (CNN based) are organized in a Siamese architecture. Such a deep architecture has shown successful results in different applications such as face verication [12] [13] and gesture recognition to predict if an input pair of images are similar or not. In this paper, we address both image retrieval and pattern spotting tasks by using the feature map of a Siamese Neural Network (SNN) trained on the ImageNet dataset to learn how to represent the similarity between two images. With the support of two important concepts, representation and transfer learning, we use the feature map of the trained SNN in our solution to provide the similarity between a given image query and each object candidate (subimage) detected on the

Upload: others

Post on 01-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image

Image Retrieval and Pattern Spotting using SiameseNeural Network

Kelly L. Wiggers∗, Alceu S. Britto Jr.∗¶, Laurent Heutte†, Alessandro L. Koerich‡ and Luiz S. Oliveira§

∗Pontifical Catholic University of Parana (PUCPR), Curitiba (PR), BrazilEmail: {wiggers, alceu}@ppgia.pucpr.br

†Normandie Univ, UNIROUEN, UNIHAVRE, INSA Rouen, LITIS, Rouen, FranceEmail: [email protected]

‡École de Technologie Supérieure (ÉTS), Université du Québec, Montréal (QC), CanadaEmail: [email protected]

§Federal University of Paraná, Curitiba (PR), BrazilEmail: [email protected]

¶State University of Ponta Grossa (UEPG), Ponta Grossa (PR), Brazil

Abstract—This paper presents a novel approach for imageretrieval and pattern spotting in document image collections. Themanual feature engineering is avoided by learning a similarity-based representation using a Siamese Neural Network trainedon a previously prepared subset of image pairs from the Im-ageNet dataset. The learned representation is used to providethe similarity-based feature maps used to find relevant imagecandidates in the data collection given an image query. A robustexperimental protocol based on the public Tobacco800 documentimage collection shows that the proposed method compares favor-ably against state-of-the-art document image retrieval methods,reaching 0.94 and 0.83 of mean average precision (mAP) forretrieval and pattern spotting (IoU=0.7), respectively. Besides, wehave evaluated the proposed method considering feature mapsof different sizes, showing the impact of reducing the number offeatures in the retrieval performance and time-consuming.

Index Terms—Siamese network, image retrieval, pattern spot-ting

I. INTRODUCTION

The large volume of digital information stored in thelast two decades by modern society has been the primarymotivation for researchers in the Pattern Recognition area toinvestigate new methods for automatic indexing and retrievalof digital material (images, videos, and audios) based ontheir contents. Some significant progress may be observed inthe field of Content-Based Image Retrieval (CBIR), in whichthe relevant image candidates must be found in large digitalcollections based on a given query represented by a wholeimage or just a pattern available on it [1]–[5]. A more recentand exciting challenge in the CBIR field has been to performthe retrieval process without any contextual information, i.e.without previous knowledge about the patterns to be detected.The lack of information makes unfeasible the use of strategieswhere specific models are trained based on the patterns (ob-jects) to be retrieved. Such a challenge is common in CBIR

solutions devoted to vast libraries of document images, whereusually the main tasks are image retrieval and pattern spotting.In the former, the objective is to find every document imagethat contains the given query, while in the later an additionaldifficulty is to provide the query location in the retrievedimages.

The lack of contextual information mentioned above, com-bined with the wide variability in terms of scale, color andtexture of the patterns present in document collections, whichmay include seals, logos, faces, and initial letters, makesthe definition of a robust representation scheme (feature ex-traction) a real challenging problem. In the literature, onemay find some promising results of traditional representationmethods based on local descriptors such as bag-of-visual-words (BoW) [6]. However, some significant contributionshave been recently achieved through the use of deep learningbased methods, mainly by doing feature extraction usingConvolutional Neural Networks (CNN) [7] [8] [9]. Recently,Luo et al. [10] and Wiggers et al. [11] proposed the use ofpre-trained CNNs to provide the image representation (featureextraction) achieving very promising results. Different fromthem, in the present work two deep models (CNN based) areorganized in a Siamese architecture. Such a deep architecturehas shown successful results in different applications such asface verification [12] [13] and gesture recognition to predictif an input pair of images are similar or not.

In this paper, we address both image retrieval and patternspotting tasks by using the feature map of a Siamese NeuralNetwork (SNN) trained on the ImageNet dataset to learnhow to represent the similarity between two images. Withthe support of two important concepts, representation andtransfer learning, we use the feature map of the trained SNNin our solution to provide the similarity between a given imagequery and each object candidate (subimage) detected on the

Page 2: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image

document images. The main contributions of our work arethreefold: a) evaluation of the SNN similarity representation(feature maps) to perform the retrieval and spotting tasksconsidering the difficulties inherent to document images; b)evaluation of the transfer learning scheme in which the modeltrained on regular images of the ImageNet is used without anytuning to perform the retrieval and spotting tasks in the contextof document images; c) evaluation of the impact of the SNNfeature map size in terms of time-consuming and performancein both tasks, image retrieval and pattern spotting.

A robust experimental protocol using the public Tobacco800document image collection shows that the proposed methodcompares favorably against state-of-the-art document imageretrieval methods, reaching 0.94 and 0.83 of mean averageprecision (mAP) for retrieval and pattern spotting, respectively,when considering the retrieval of the Top 5 relevant candidatesand IoU=0.7. In summary, the proposed approach succeeds inretrieving relevant image candidates and localizing preciselythe position of the searched patterns on it. Additional exper-iments show that the SNN representation generalizes well,improving the matching performance when compared to deepfeatures of a CNN trained on an image classification task.

The remainder of this paper is organized as follows: Section2 presents the state-of-the-art techniques in image retrieval andSiamese architectures. The method developed is introduced inSection 3: the representation learning based on SNN, and eachstep of the image retrieval and pattern spotting tasks. Section4 presents the Tobacco800 database used for the benchmark,followed by the experimental protocol and results. Finally,Section 5 presents our conclusions and future work.

II. RELATED WORK

Different techniques can be used to retrieve informationfrom a collection of documents, but they are usually organizedin a similar two-step process [14], as follows: a) in an offlinestep, image candidates are extracted from the document imagesand indexed using a suitable representation schema (a featurevector); b) in an online phase, given an input query image,a measure of similarity is used to compare it with imagecandidates extracted from the stored documents, returning aranked list.

As mentioned before, the main challenges are the lack ofprevious knowledge about the images to be retrieved andthe need for a representation invariant to changes in termsof scale, color, texture, and localization of the query in theimage collection. In addition, the query can be presented asa whole document image or a small sub-image (a pattern tobe detected). Recently, many papers have been dedicated toimage retrieval [15]–[17], while Alaei et al. [18] presented aninteresting survey of systems for recovering logos and sealson administrative documents.

Going further in a CBIR system, we can identify threecommon steps, as follows: a) pre-processing, b) representa-tion (feature extraction), and c) retrieval (matching). In thepre-processing basically, the data collection is prepared forindexing, which is a crucial step for the success of the image

retrieval process. An object detection scheme usually basedon a sliding window mechanism, or alternative techniques todivide the document image into small regions, results in a setof indexed candidates without the need of knowing their size[19]. One may find in the literature alternative approaches togenerate the candidate regions, such as Selective Search [20],Edge Boxes [21] or Bing [22]. An interesting comparison ofseveral algorithms to generate image candidates is presented inZitnick and Dollár [21]. In their experiments, the Edge Boxes[21] and Selective Search [20] techniques have shown themost promising results in terms of relevant object detection,while Bing has shown to be a faster algorithm but with lowerprecision.

In the second step (representation), the detected candidatesare represented through the use of an appropriate featureextraction method. Here, as in most of pattern recognitionfields, deep features based on CNN are an interesting newalternative. In the last few years, deep architectures have beenused in image retrieval as feature learning methods [23] [24].Recently, Gordo et al. [9] evaluated the use of CNN as afeature extractor in several public datasets such as ImageNet,CIFAR-10, and MNIST.

Babenko and Lempitsky [7] explored the use of pre-trainedmodels (transfer learning), observing improvement in retrievalperformance. In [25], the authors suggest that feature vectorscan be obtained by training Siamese neural networks usingpairs of images. The Siamese model has been successfullyused to face recognition [26] [13], and signatures or symbolidentification [12]. In an SNN, after the convolutional layer,a distance measure between vectors is computed. The name"Siamese" historically comes from the necessity to collect thestate of a single network for two different activation samplesduring training. It can be seen as using two identical, parallelneural networks sharing the same set of weights [27]. The firstidea of Siamese network was published by Bromley et al. [28]for signature verification problems.

Qi et al. [29] also proposed a sketch-based image retrievalto pull output feature vectors closer for input pairs that arelabeled as similar and push them away if irrelevant. Althoughthis approach is related to our work, the proposed architectureand image search are different. In our method, we decide toevaluate the generalization of a model trained on ImageNet,avoiding a fine-tuning on the used document images.

A recent work proposed by Chung et al. [30] focusedon a deep Siamese CNN pre-trained on ImageNet dataset.The proposed method computed contrastive loss function andshowed good performance in Diabetic Retinopathy Fundusimage dataset. However, the authors use only binary image pairinformation. In contrast, in document images, the features canbe more complex, with variation in shape, color, and texture.Lin et al. [31] propose a Siamese model to learn a featurerepresentation and finding matches between street view andaerial view imagery. However, they use their dataset to learnthe representation which is not feasible in our case.

The third step of a CBIR system (retrieval) usually requiresa distance metric to compute the similarity between the feature

Page 3: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image
Page 4: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image
Page 5: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image

overlap estimation considers the position of the query (x,y), itscorresponding area q1, the positions of the candidate (x1,y1),and its corresponding area o1, as denoted in Equation 2 [37].

IoU(x, y) =q1 ∩ o1

q1 ∪ o1(2)

The relevance of a candidate is related to its overlap withthe query. For this purpose, the Intersection over Union (IoU)is computed. IoU=0.7 is considered to be a reasonably goodresult, since IoU=0.5 is considered too small and IoU=0.9,very restrictive [21]. However, we consider the analysis:0.1≤IoU≤0.7 in order to determine that a positive candidateis retrieved, and at the end, the precision and the recall arecalculated. Finally, the mAP is calculated to evaluate theresults considering all the queries.

IV. EXPERIMENTS

This section describes our experiments on the Tobacco800dataset. This public subset of the Complex Document ImageProcessing (CDIP) collection was created by the Illinois In-stitute of Technology [38]. The 1,290 document images werelabeled by the Language and Media Processing Laboratory(University of Maryland). It contains 412 document imageswith logos and 878 without logos. In our experiments, wehave considered the 21 categories presenting two or moreoccurrences, making 418 queries for the search process.

A. Experimental Protocol

In the pre-processing stage, the improved Selective Searchalgorithm produces 1.2M of regions of interest consideringthe 1,290 documents in the database. All these images wereconsidered in our experiments. The improved version of SSincreased in 13× the quality of objects that overlap the queryin more than 90%, when compared to our previous work [11].In fact, we have considered the aspect ratio of the query imageto guide the SS algorithm. The candidate c will only be con-sidered for retrieval if the aspect ratio heightc

widthc

has a maximumof 25% difference with the query q. For example, if a queryhas heightq

widthq

=0.5, the candidate similarity calculation will only

be performed if the candidate has a heightcwidthc

between 0.375 and0.625. For Tobacco800 dataset, by applying this query basedcontextual information, the initial 1.2M of candidates werereduced to 873,876 candidates, improving the quality of thepre-processing stage (candidate generation).

The SNN architecture trained with ImageNet dataset pro-vides 4096-dimensional feature maps corresponding to itsoriginal fc7 layer. We have also investigated the additionof a new layer (fcnew) to reduce the feature maps to 512,256 and 128 dimensions. These values were used in [11]to reduce the dimensionality of the feature map of a CNNmodel. In addition, we decided to use the SNN as a featureextractor to avoid spending time to process each candidateimage more than once. Table I compares the time spent inthe retrieval task for some queries when using the distancelayer implemented inside the SNN, and the SNN used as afeature extractor. In the last, the feature map is extracted using

TABLE I: Computational time for the retrieval task (in sec-onds). Distance layer of Siamese model .vs. Siamese modelas a feature extractor. Model with 256-feature map

Query Image# of Siamese Siamese as

Candidates Distance Layer Feature Extractor

210,148 571.43 52.08

124,060 427.56 35.29

167,571 492.16 54.02

the fully-connected layer, then an external Euclidean distancemeasure is computed. We observed that the computational timeto retrieve a single query using the SNN as a feature extractoris about 11 times lower.

The performance of both tasks, image retrieval and patternspotting, is evaluated using the mean Average Precision (mAP)and Recall for all queries. In the experiments, we evaluatedthe effect of the choices made for the three mentioned methodstages, and we compared the final results with our previousmethod [11] and the current state-of-the-art.

TABLE II: mAP with Euclidean distance, Top-k ranking, andSiamese with feature map sizes n = {4096; 512; 256; 128}

Feature MapTop-k

5 10 25 50 100

4096 0.944 0.872 0.735 0.546 0.306

512 0.744 0.640 0.486 0.358 0.220256 0.746 0.654 0.503 0.368 0.227128 0.738 0.655 0.506 0.373 0.229

B. Results and Discussion

Table II shows the experimental results of the proposedimage retrieval method, comparing the SNN with a 4096-dimensional feature map with those with n entries, wheren={512, 256, 128}. The similarity between the network fea-ture maps was calculated using the Euclidean distance. Af-terwards, the Top-k candidates were chosen to generate aranked list of relevant image candidates according to the mAP,where k={5, 10, 25, 50, 100}. As one may see in Table II,the best results were achieved by using the AlexNet with4096 features resulting in 0.94 and 0.87 of mAP in the Top-5 and Top-10 ranking, respectively. The recall is higher than90% in Top-5 and Top-10. With the additional layer, the bestresults are related to 256 features with a 0.74 of mAP forTop-5 ranking. The performance is significantly better usingthe 4096-dimensional feature map than any version of theadditional layer. However, the computational cost to index allthe vectors of the dataset is much higher.

Table III shows some qualitative results of the logos re-trieved using the SNN with full feature map (4096 dimen-sional). These results are very promising since many correctlogos were retrieved in the Top-5 ranking. We can observe

Page 6: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image
Page 7: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image

TABLE VIII: Qualitative spotting results for some queries. Query logo and first retrieved logo by similarity using 4096-dimensional feature map and location (red square)

TABLE IX: Comparison with the state-of-the-art for PatternSpotting. Top-5, IoU≥0.6, classes with at least three samples

Approach mAP Recall (%)

Le et al. [41] 0.970 88.42Le et al. [42] 0.910 88.78Siamese Model (4096) 0.922 92.72

Siamese Model (256) 0.718 73.19

the 256-dimensional and 128-dimensional feature maps, theresults were 9.20% and 5.07% worst, respectively.

Finally, the computational time of the proposed method wasevaluated. The computational resources consist of 1 AMDRyzen 5 1600, 32GB of RAM, 1 GPU NVIDIA GTX 1080with 8GB of memory and 2560 CUDA. Table VI compares thenumber of candidates processed per second by each approach(512, 256 and 128 dimensions) and the original feature map(4096 dimensions). As one may see, the number of candidatesprocessed per second increased by approximately 6.10%,9.20% and 8.40% when considering the feature maps with512, 256 and 128 dimensions, while in the retrieval process itincreased in almost 2, 3.5 and 5 times, respectively.

Table VII shows the experimental results of the patternspotting task. Here, the candidate is taken as relevant if itoverlaps enough with the image query. We can observe thatour best results were achieved using the whole feature map(4096), but the results related to the reduced maps are quitecompetitive. Despite the reduction of about 0.12 in mAP, theyhave shown a significant reduction in terms of time consumingas shown in Table VI. Still, in Table VII, we can see a smallgap in terms of localization performance between IoU=0.1 andIoU=0.7. This means that our system succeeded not only toretrieve the relevant image candidates for each query, but also

in finding the query position precisely. A qualitative analysiscan be seen in Table VIII, in which we have some queries andthe first candidate retrieved with its respective location.

Table IX shows a comparison with the current state-of-the-art for pattern spotting on the Tobbaco dataset. For such acomparison we have used the same experimental parametersof [41], [42], which considers Top-5, IoU≥0.6 and classeswith at least three samples per category. The mAP achievedby the proposed Siamese model with 4096 is 1.31% better thanmAP achieved by Le et al. [42] but it did not outperform themAP presented in [41]. On the other hand, the recall is 4.44%and 4.86% better than [42] and [41] respectively. However, itis important to highlight that both [41] and [42] require theprevious knowledge of the logo gallery, what is not necessaryfor the proposed method.

V. CONCLUSION

This paper presented a novel approach for image retrievaland spotting in document images where the images are repre-sented using a Siamese model trained on the ImageNet dataset.We evaluated the results of conventional AlexNet architectureand a modifier version with an additional layer with the aim ofreducing the dimension of the feature map. Our experimentalresults were very promising. It was possible to observe anincrease in the mAP, since the features generalize well andimprove the matching performance compared to features ob-tained with networks trained for image classification.

Further work must be carried out to improve the SNN modelby using other architectures to avoid exhaustive comparisonsof feature maps, allowing the use of the model distance layerduring the online phase. Besides, we plan to evaluate theproposed SNN model in different datasets, such as in Paris,INRIA Holidays and DocExplore.

Page 8: Image Retrieval and Pattern Spotting using Siamese Neural ...a whole document image or a small sub-image (a pattern to be detected). Recently, many papers have been dedicated to image

ACKNOWLEDGMENT

CAPES (Coord. for the Improvement of Higher EducationPersonnel) and CNPq (National Council for Scientific andTechnological Development) grant 306684/2018-2.

REFERENCES

[1] S. Wu, A. Oerlemans, E. M. Bakker, and M. S. Lew, “Deep binary codesfor large scale image retrieval,” Neurocomputing, 2017.

[2] Y. Xu, F. Shen, X. Xu, L. Gao, Y. Wang, and X. Tan, “Large-scaleimage retrieval with supervised sparse hashing,” Neurocomputing, vol.229, pp. 45 – 53, 2017.

[3] S. En, C. Petitjean, S. Nicolas, and L. Heutte, “A scalable patternspotting system for historical documents,” Pattern Recognition, vol. 54,pp. 149–161, 2016.

[4] P. Yarlagadda, A. Monroy, B. Carque, and B. Ommer, “Recognition andanalysis of objects in medieval images,” in ACCV 2010 International

Workshops, R. Koch and F. Huang, Eds. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2011, pp. 296–305.

[5] G. Zhu and D. Doermann, “Logo matching for document image re-trieval,” in 2009 10th International Conference on Document Analysis

and Recognition, 2009, pp. 606–610.[6] Sivic and Zisserman, “Video google: a text retrieval approach to object

matching in videos,” in Proceedings Ninth IEEE International Confer-

ence on Computer Vision, Oct 2003, pp. 1470–1477 vol.2.[7] A. Babenko and V. Lempitsky, “Aggregating local deep features for

image retrieval,” in The IEEE International Conference on Computer

Vision (ICCV), December 2015.[8] J. Yue-Hei, N. F. Yang, and L. S. Davis, “Exploiting local features

from deep networks for image retrieval,” in 2015 IEEE Conference on

Computer Vision and Pattern Recognition, 2015, pp. 53–61.[9] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval:

Learning global representations for image search,” in Computer Vision

ECCV 2016: 14th European Conference, Amsterdam, The Netherlands,

October 11-14, 2016, Proceedings, Part VI, B. Leibe, J. Matas, N. Sebe,and M. Welling, Eds. Cham: Springer International Publishing, 2016,pp. 241–257.

[10] Y. W. Luo, Y. Li, F. J. Han, and S. B. Huang, “Grading image retrievalbased on cnn deep features,” in 2018 20th International Conference on

Advanced Communication Technology (ICACT), Feb 2018, pp. 148–152.[11] K. L. Wiggers, A. S. Britto Jr., A. L. Koerich, L. Heutte, and L. E. S.

Oliveira, “Document image retrieval using deep features,” in Interna-

tional Joint Conference on Neural Networks (IJCNN), vol. 1, Rio deJaneiro, 2018, pp. 3185–3192.

[12] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks forone-shot image recognition,” in ICML 2015 Deep Learning Workshop,2015.

[13] H. Wu, Z. Xu, J. Zhang, W. Yan, and X. Ma, “Face recognition based onconvolution siamese networks,” in 2017 10th International Congress on

Image and Signal Processing, BioMedical Engineering and Informatics

(CISP-BMEI), Oct 2017, pp. 1–5.[14] S. Marinai, B. Miotti, and G. Soda, “Digital libraries and document im-

age retrieval techniques: A survey,” in Learning Structure and Schemas

from Documents, M. Biba and F. Xhafa, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2011, pp. 181–204.

[15] J. Chen, Y. Wang, L. Luo, J.-G. Yu, and J. Ma, “Image retrieval basedon image-to-class similarity,” Pattern Recognition Letters, vol. 83, Part3, pp. 379 – 387, 2016.

[16] Q. D. T. Thuy, Q. N. Huu, C. P. Van, and T. N. Quoc, “An efficientsemantic – related image retrieval method,” Expert Systems with Appli-

cations, vol. 72, pp. 30 – 41, 2017.[17] S. He, P. Samara, J. Burgers, and L. Schomaker, “Historical manuscript

dating based on temporal pattern codebook,” Computer Vision and Image

Understanding, vol. 152, pp. 167 – 175, 2016.[18] A. Alaei, M. Delalandre, and N. Girard, “Logo detection using painting

based representation and probability features,” in 12th International

Conference on Document Analysis and Recognition, vol. 1236-1239,2013.

[19] S. En, C. Petitjean, S. Nicolas, L. Heutte, and F. Jurie, “Regionproposal for pattern spotting in historical document images,” in 2016

15th International Conference on Frontiers in Handwriting Recognition

(ICFHR), Oct 2016, pp. 367–372.

[20] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeul-ders, “Selective search for object recognition,” International Journal of

Computer Vision, vol. 104, no. 2, pp. 154–171, 2013.[21] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals

from edges,” in ECCV, 2014.[22] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: Binarized

normed gradients for objectness estimation at 300fps,” in IEEE CVPR,2014.

[23] A. Krizhevsky and G. E. Hinton, “Using very deep autoencoders forcontent-based image retrieval.” in ESANN, 2011.

[24] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing forimage retrieval via image representation learning,” in Proceedings of

the Twenty-Eighth AAAI Conference on Artificial Intelligence. AAAIPress, 2014, pp. 2156–2162.

[25] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codesfor image retrieval,” in Computer Vision – ECCV 2014, D. Fleet, T. Pa-jdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer InternationalPublishing, 2014, pp. 584–599.

[26] J. P. Florian Schroff, Dmitry Kalenichenko, “Facenet: A unified embed-ding for face recognition and clustering,” in The IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.[27] S. Berlemont, G. Lefebvre, S. Duffner, and C. Garcia, “Class-balanced

siamese neural networks,” Neurocomputing, vol. 273, pp. 47 – 56, 2018.[28] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Sig-

nature verification using a "siamese" time delay neural network,” inProceedings of the 6th International Conference on Neural Information

Processing Systems. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 1993, pp. 737–744.

[29] Y. Qi, Y. Song, H. Zhang, and J. Liu, “Sketch-based image retrievalvia siamese convolutional neural network,” in 2016 IEEE International

Conference on Image Processing (ICIP), Sept 2016, pp. 2460–2464.[30] Y.-A. Chung and W.-H. Weng, “Learning deep representations of medi-

cal images using siamese cnns with application to content-based imageretrieval,” in Proceedings of the 31st Conference on Neural Information

Processing Systems - NIPS 2017, 11 2017.[31] T. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep representations

for ground-to-aerial geolocalization,” in 2015 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5007–5015.

[32] I. Melekhov, J. Kannala, and E. Rahtu, “Siamese network features forimage matching,” in 2016 23rd International Conference on Pattern

Recognition (ICPR), Dec 2016, pp. 378–383.[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification

with deep convolutional neural networks,” in Advances in Neural Infor-

mation Processing Systems, 2012.[34] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[35] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approachto parallelizing stochastic gradient descent,” in Advances in Neural In-

formation Processing Systems 24, J. Shawe-taylor, R. Zemel, P. Bartlett,F. Pereira, and K. Weinberger, Eds., 2011, pp. 693–701.

[36] L. Liu and H. Qi, “Learning effective binary descriptors via crossentropy,” in 2017 IEEE Winter Conference on Applications of Computer

Vision (WACV), March 2017, pp. 1251–1258.[37] S. Nowozin, “Optimal decisions from probabilistic models: the

intersection-over-union case,” in Computer Vision and Pattern Recog-

nition (CVPR 2014). IEEE Computer Society, June 2014.[38] D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and

J. Heard, “Building a test collection for complex document informationprocessing,” in Proc. 29th Annual Int. ACM SIGIR Conference (SIGIR

2006), 2006, pp. 665–666.[39] R. Jain and D. Doermann, “Logo retrieval in document images,” in 2012

10th IAPR International Workshop on Document Analysis Systems, 2012,pp. 135–139.

[40] M. Rusinol and J. Lladós, “Efficient logo retrieval through hashingshape context descriptors,” in Proceedings of the 9th IAPR International

Workshop on Document Analysis Systems, 2010, pp. 215–222.[41] V. P. Le, M. Visani, C. D. Tran, and J. M. Ogier, “Improving logo

spotting and matching for document categorization by a post-filter basedon homography,” in 2013 12th International Conference on Document

Analysis and Recognition, 2013, pp. 270–274.[42] V. P. Le, N. Nayef, M. Visani, J.-M. Ogier, and C. D. Tran, “Document

retrieval based on logo spotting using key-point matching,” in 2014 22nd

International Conference on Pattern Recognition, 2014, pp. 3056–3061.