sequential cross-modal hashing learning via multi-scale ... the multi-scale features, and...

Download Sequential Cross-Modal Hashing Learning via Multi-scale ... the multi-scale features, and multi-scale

Post on 01-Oct-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining

    ZHAODA YE, YUXIN PENG∗, Peking University, China.

    Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space through hash function, and achieves fast and flexible cross-modal retrieval. Most existing cross-modal hashing methods learn hash function by mining the correlation among multimedia data, but ignore the important property of multimedia data: Each modality of multimedia data has features of different scales, such as texture, object and scene features in the image, which can provide complementary information for boosting retrieval task. The correlations among the multi-scale features are more abundant than the correlations between single features of multimedia data, which reveal finer underlying structure of the multimedia data and can be used for effective hashing function learning. Therefore we proposeMulti-scale Correlation Sequential Cross-modal Hashing (MCSCH) approach, and its main contributions can be summarized as follows: 1) Multi-scale feature guided sequential hashing learning method is proposed to share the information from features of different scales through a RNN based network and generate the hash codes sequentially. The features of different scales are used to guide the hash codes generation, which can enhance the diversity of the hash codes and weaken the influence of errors in specific features, such as false object features caused by occlusion. 2) Multi-scale correlation mining strategy is proposed to align the features of different scales in different modalities and mine the correlations among aligned features. These correlations reveal finer underlying structure of multimedia data and can help to boost the hash function learning. 3) Correlation evaluation network evaluates the importance of the correlations to select the worthwhile correlations, and increases the impact of these correlations for hash function learning. Experiments on two widely-used 2-media datasets and a 5-media dataset demonstrate the effectiveness of our proposed MCSCH approach.

    Additional Key Words and Phrases: Cross-modal Hashing, Correlation mining, Multi-scale, Sequential hash learning

    ACM Reference Format: Zhaoda Ye, Yuxin Peng. 2019. Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining. ACM Trans. Multimedia Comput. Commun. Appl. 1, 1 (August 2019), 21 pages. https://doi.org/10.1145/nnnnnnn. nnnnnnn

    1 INTRODUCTION Cross-modal retrieval aims to retrieve multimedia content in different modalities with any single modality query that users are interested in. With the explosive growth of multimedia data, cross- modal hashing receives wide attention for its high retrieval efficiency. Cross-modal hashing method maps the original multimedia data into the Hamming space through the hash function. Then the

    ∗Corresponding author.

    Author’s address: Zhaoda Ye, Yuxin Peng, Peking University, Insititute of Computer Science and Technology, Beijing, 100871, China. pengyuxin@pku.edu.cn.

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1551-6857/2019/8-ART $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn

    ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article . Publication date: August 2019.

    https://doi.org/10.1145/nnnnnnn.nnnnnnn https://doi.org/10.1145/nnnnnnn.nnnnnnn https://doi.org/10.1145/nnnnnnn.nnnnnnn

  • 2 Zhaoda Ye, Yuxin Peng

    It's a sunshine morning. A group of children are playing

    soccer at the seaside.

    Object

    Word

    Word

    Sentence

    Scene

    Scene

    Fig. 1. The correlation between different modalities and scales

    short binary codes in the Hamming space can accelerate retrieval with bit operations and hash tables, as well as reduce storage space compared with the original high-dimension features.

    The hashing methods of single modality retrieval have been widely studied over the past years, such as image retrieval[1–7], text retrieval[8–11] and video retrieval [12, 13]. However, the require- ments of users are highly flexible, such as retrieving the relevant audio clips with a query of image, which leads to the issue of cross-modal retrieval. Unfortunately, "heterogeneous gap" makes it impossible to apply these methods to cross-modal retrieval task directly. Such gap means the data of different modalities is inconsistent and lies in different feature spaces, where the similarities between the data of different modalities cannot be measured directly. Most of the existing works address the problem of the "heterogeneous gap" by mining the

    correlation among different modalities [14–17], such as Inter-Media Hashing (IMH) [15], Cross-View Hashing (CVH) [16] and Composite Correlation Quantization (CCQ) [17]. These methods do not use any supervised information for correlation mining and achieve promising performance. However, they only learn hash function from the distribution of the data, which limits the performance on semantic similar data retrieval. To improve the retrieval accuracy, some works consider to use supervised information for better correlation mining, such as Cross-Modality Similarity Sensitive Hashing (CMSSH) [18] and Semantic Correlation Maximization (SCM) [19], and achieve better results than the unsupervised methods. Recently, inspired by the success of deep learning in many visual tasks, some works adopt the deep learning framework for cross-modal hashing, such as Cross-Media Neural Network Hashing (CMNNH) [20] and Cross Autoencoder Hashing (CAH) [21].

    However, most of these works ignore the important property of multimedia data: Each modality of multimedia data has the multi-scale features, such as texture, object and scene features in the image, which indicates comprehensive characteristic of multimedia data, and can provide complementary information for retrieval task. These complementary information can enhance the diversity of the hash codes, and weaken the influence of errors in specific features, such as false object features caused by occlusion. Some recent works, such as [22], have verified the effectiveness to use multi-level information. And the correlations among the multi-scale features reveal finer underlying structure of multimedia data and can be used for effective hashing function learning. As

    ACM Trans. Multimedia Comput. Commun. Appl., Vol. 1, No. 1, Article . Publication date: August 2019.

  • Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining 3

    shown in Figure 1, we represent some correlations between different scales in two modalities. For example, the words "children" and "soccer" in text have correlation with the objects in the image, and the word "seaside" has correlation with the scene in the image. While the scene of the blue sky and sun in the image has correlation with a description sentence in the text. Therefore, we propose Multi-scale Correlation Sequential Cross-modal Hashing (MCSCH)

    approach, where the multi-scale feature guided sequential hashing learning method fully utilizes the multi-scale features, and multi-scale correlation mining strategy mines the correlations between the multi-scale features. Furthermore, correlation evaluation network evaluates the importance of the correlations to select the worthwhile correlations for hash function learning. The main contributions of this paper can be summarized as follows:

    1)Multi-scale feature guided sequential hashing learningmethod adopts a RNN structure to share the complementary information in multi-scale features, which takes the multi-scale features as inputs and generates the hash codes sequentially. The generation of the hash codes are guided by two types of information: history information and guide information. The history information comes from the previous generation process, which is regarded as auxiliary information to reduce the error hash codes generation in current step. The guide information is the multi-scale features, which provide the complementary information of multimedia data for the hash code generation.

    2) Multi-scale correlation mining strategy mines the correlation among the multi-scale features, which can help the model learn more robust hash function. Concretely, we align the multi-scale features in different modalities by adjusting the input order of the guide information, where the hash codes guided by the aligned features have the same position and length in final hash codes. Beyond the alignment of the hash codes, we can use correlation constraint functions to mine the correlation from the aligned multi-scale features, which can boost the hash function learning.

    3) Correlation evaluation network evaluates the importance of the correlations to select the worthwhile correlations, and increases the impact of these correlations in training stage. Specially, the network takes the last output features in the RNN structure as inputs to evaluate the importance of the correlations, which can improve the impact of the worthwhile correlations in training stage. In this paper, we extend our previous conference work [23] to fit multimedia data that

Recommended

View more >