high-order deep neural networks for learning multi-modal …scai/publications/conferences/... ·...

High-order Deep Neural Networks for Learning Multi-Modal Representations

Kyoung-Woon On [email protected] Kim [email protected] Zhang [email protected]

Department of Computer Science and Engineering, Seoul National University, Seoul 151-744, Korea

In multi-modal learning, data consists of multiple modal-ities, which need to be represented jointly to capture thereal-world ’concept’ that the data corresponds to (Srivas-tava & Salakhutdinov, 2012). However, it is not easy toobtain the joint representations reflecting the structure ofmulti-modal data with machine learning algorithms, espe-cially with conventional neural networks. This is becausethe information which consists of multiple modalities hasdistinct statistical properties and each modality has a differ-ent kind of representation and correlational structure (Sri-vastava & Salakhutdinov, 2012). Also, noise exists in in-formation from multi-modal input, which makes the infor-mation unreliable and inaccurate (Ernst & Di Luca, 2011).

In this paper, we develop High-order deep neural networks(HODNN) to learn the representations of multi-modal data.The HODNN connects abstract information of multiplemodalities with high-order edges, which lead to a mul-tiplicative interaction rather than just additive interactionused in conventional deep neural networks. This high-orderinteraction not only captures highly non-linear relationshipacross the modalities but also suppresses the uncorrelatednoise efficiently. In addition, we apply general deep struc-ture to each modality so as to obtain the balanced abstractinformation from each of it. (Ngiam et al., 2011).

Thus, the HODNN consists of two parts: modal-specificlearning layers and a joint representations learning layers.The modal-specific learning layers have connections onlywithin each modality, so the highest hidden layers of eachmodality represent abstract information of that modality.The joint representations learning layers is composed ofhigher-order interactions between the joint hidden units andmultiple groups of modal-specific hidden units. The jointhidden units can learn non-linear correlations among themodal-specific hidden units. The overall architecture ofHODNN is illustrated in Figure 1. In details, the modal-specific learning layers follow a general neural networksframework. The hidden representations h1j of the specificmodality is obtained by an Equation 1.

h1j = σ(∑i

W v1

ij v1i + bias) (1)

Figure 1. Architecture of High-order deep neural networks

where σ() is an activation function, which is a sigmoidfunction in our work. In the joint representations learn-ing layers, the weight is an (n + 1) way interaction tensorwhich connects n-groups of modal-specific hidden unitsand a group of joint hidden units. For simplicity, we justshow the case of a 2-modalities environment as an exam-ple, but it is easily expanded across n-modalities case. Thejoint hidden representations hJointj is obtained by an Equa-tion 2.

hJointk = σ(∑i,j

W Jointijk h1ih

2j + bias) (2)

Motivated by (Memisevic & Hinton, 2010), the multi-wayfactoring method is employed to maintain efficient modelcomplexity without loosing capacity of the model heavily.As a consequence, we can obtain the joint hidden represen-tations hjointk as follows:

hJointk =σ

∑f

WhJoint

kf (∑i

Wh1

if h1i )(

∑j

Wh2

jf h2j )+bias

(3)

As preliminary experiments, we focus on the joint repre-sentations learning layers to show the effect of high-orderinteraction. For the joint representations learning layers,the factorized version of high-order Boltzmann machine(Figure 2a) is used as a building block (Sejnowski, 1986).


Figure 2. Comparative models for demonstrating effect of high-order interaction

Also, the MNIST dataset is utilized which consists of handwritten digit images and the corresponding labels. Whilethe label vectors of the MNIST are used as targets in a dis-criminative task, in our experiments, the image and the la-bel vectors are used as two different modalities, the formeris for visual information and the latter is for textual infor-mation which indicate the concept ’number’. To show thecompetence of our model, a shallow bi-model RBM (Fig-ure 2b) is used as a comparative model. The shallow bi-modal RBM also functions as a module combining differ-ent modalities in conventional multi-modal deep networks(Ngiam et al., 2011; Srivastava & Salakhutdinov, 2012).

In order to see how the joint hidden units represent theabstract information across the modalities, first of all, 2Dt-SNE embedding algorithm is applied to hidden units ofboth factored high-order Boltzmann machine and shallowbi-modal RBM (figure 4). The result of the factored high-order Boltzmann machine is shown to be more visually dis-criminative than that of the shallow bi-modal RBM. Also, itis interesting to notice that the embedding result of the jointhidden representations with factored high-order Boltzmannmachine looks as if the representations of each numberform their own sub-manifold structure. These interpre-tations imply that the representation power of suggestedmodel is greater than the comparative model.

Second, to demonstrate how the high-order interaction ef-ficiently cancels out the noise of either modality which isuncorrelated with other modality, it is appropriate to com-pare the joint hidden representations of both models whenthe noisy input is fed in. For this experiment, we firstlygenerated a corrupted dataset which consists of the 100 im-age inputs of number ’1’ corrupted by other numbers andcorresponding clear text inputs of number ’1’ (Figure 3).The joint hidden representations of the corrupted datasetare shown with normal dataset by using 2D t-SNE visual-ization (Figure 4). As expected, in the factored high-orderBoltzmann machine, the representations of the corrupteddataset are located near the representations of number ’1’.However, in the shallow bi-modal RBM, the representa-tions of corrupted dataset lie scattered across the numbers.It reveals the property of high-order interaction that remove

the uncorrelated noise information of either modality basedon the other modality.

In future works, we aim to apply our model to the study ofevent cognition which have emerged over the last years as avibrant topic of scientific study. It is an important researchtopic because much of our behavior is guided by our un-derstanding of events which are what happens to us, whatwe do, what we anticipate and what we remember in ourdaily life (Radvansky & Zacks, 2014). For better under-standing of the study, we employ multiple wearable sen-sors to record the daily-life of a person. This is because thecollected data from the viewpoint of the first-person playsa significant role in learning of human behavior (Zhang,2013; Kim et al., 2016; Lee et al., 2016). Through thisstudy, we hope to suggest an event cognition model whichcan perceive real-time events in real-life by using multiplewearable sensors.

Figure 3. Some examples of 100 corrupted images. The corruptedimages are generated by mixing of randomly selected image ofnumber 1 and other numbers.

Figure 4. Experimental results in comparison of learned represen-tations between the shallow bi-modal RBM (upper row) and thefactored high-order Boltzmann machine (lower row) using 2Dt SNE visualization


AcknowledgementThis work was partly supported by the Korea govern-ment (IITP-R0126-16-1072-SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF).

ReferencesErnst, Marc O and Di Luca, Massimiliano. Multisensory

perception: from integration to remapping. Sensorycue integration (Trommershauser J, Kording KP, LandyMS, eds). Oxford University Press, Oxford, pp. 224–250,2011.

Kim, E.-S., On, K.-W., and Zhang, B.-T. Deepschema: Au-tomatic schema acquisition from wearable sensor data inrestaurant situations. In Twenty-Fifth International JointConference on Artificial Intelligence, 2016.

Lee, S.-W., LEE, C.-Y., Kwak, D.-H., Kim, J., and Zhang,B.-T. Dual-memory deep learning architectures for life-long learning of everyday human behaviors. In Twenty-Fifth International Joint Conference on Artificial Intelli-gence, 2016.

Memisevic, Roland and Hinton, Geoffrey E. Learning torepresent spatial transformations with factored higher-order boltzmann machines. Neural Computation, 22(6):1473–1492, 2010.

Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam,Juhan, Lee, Honglak, and Ng, Andrew Y. Multimodaldeep learning. In Proceedings of the 28th internationalconference on machine learning (ICML-11), pp. 689–696, 2011.

Radvansky, Gabriel A and Zacks, Jeffrey M. Event cogni-tion. Oxford University Press, 2014.

Sejnowski, Terrence J. Higher-order boltzmann machines.In AIP Conference Proceedings, volume 151, pp. 398–403, 1986.

Srivastava, Nitish and Salakhutdinov, Ruslan R. Multi-modal learning with deep boltzmann machines. In Ad-vances in neural information processing systems, pp.2222–2230, 2012.

Zhang, B.-T. Information-theoretic objective functions forlifelong learning. In AAAI Spring Symposium: LifelongMachine Learning, pp. 62–69. Citeseer, 2013.

high-order deep neural networks for learning multi-modal …scai/publications/conferences/... ·...

Documents