semi-supervised learning for real-world object recognition ... · sarial autoencoders (aae) for...

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Semi-supervised Learning for

Real-world Object Recognition

using Adversarial Autoencoders

SUDHANSHU MITTAL

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Semi-supervised Learning for

Real-world Object

Recognition using

Adversarial Autoencoders

SUDHANSHU MITTAL

Master in Computer Science

Date: December 22, 2017

Supervisor: Prof. Thomas Brox (University of Freiburg), Prof.

Wolfram Burgard (University of Freiburg), Prof. Atsuto Maki (KTH)

Examiner: Prof. Danica Kragic

School of Computer Science and Communication

ii

Abstract

For many real-world applications, labeled data can be costly to obtain.

Semi-supervised learning methods make use of substantially available

unlabeled data along with few labeled samples. Most of the latest work

on semi-supervised learning for image classification show performance

on standard machine learning datasets like MNIST, SVHN, etc. In this

work, we propose a convolutional adversarial autoencoder architecture

for real-world data. We demonstrate the application of this architecture

for semi-supervised object recognition. We show that our approach

can learn from limited labeled data and outperform fully-supervised

CNN baseline method by about 4% on real-world datasets. We also

achieve competitive performance on the MNIST dataset compared to

state-of-the-art semi-supervised learning techniques. To spur research

in this direction, we compiled two real-world datasets: Internet (WIS)

dataset and Real-world (RW) dataset which consists of more than 20K

labeled samples each, comprising of small household objects belonging

to ten classes. We also show a possible application of this method for

online learning in robotics.

iii

Sammanfattning

I de flesta verklighetsbaserade tillämpningar kan det vara kostsamt

att erhålla märkt data. Inlärningsmetoder som är semi-övervakade

använder sig oftast i stor utsträckning av omärkt data med stöd av

en liten mängd märkt data. Mycket av det senaste arbetet inom semi-

övervakade inlärningsmetoder för bildklassificering visar prestanda på

standardiserad maskininlärning så som MNIST, SVHN, och så vidare.

I det här arbetet föreslår vi en convolutional adversarial autoencoder

arkitektur för verklighetsbaserad data. Vi demonstrerar tillämpningen

av denna arkitektur för semi-övervakad objektidentifiering och visar

att vårt tillvägagångssätt kan lära sig av ett begränsat antal märkt data.

Därmed överträffar vi den fullt övervakade CNN-baslinjemetoden

med ca. 4% på verklighetsbaserade datauppsättningar. Vi uppnår även

konkurrenskraftig prestanda på MNIST datauppsättningen jämfört

med moderna semi-övervakade inlärningsmetoder. För att stimulera

forskningen i den här riktningen, samlade vi två verklighetsbaserade

datauppsättningar: Internet (WIS) och Real-world (RW) datauppsät-

tningar, som består av mer än 20 000 märkta prov vardera, som utgörs

av små hushållsobjekt tillhörandes tio klasser. Vi visar också en möjlig

tillämpning av den här metoden för online-inlärning i robotik.

iv

Acknowledgement

I would like to thank my supervisors at the University of Freiburg,

Prof. Thomas Brox and Prof. Wolfram Burgard for giving me this op-

portunity to pursue my master thesis at their lab. I greatly appreciate

their constant support, feedback and guidance throughout the thesis

work. I would like to thank my supervisor at KTH, Prof. Atsuto Maki

for supporting this collaboration in all respects and for his meticulous

feedback on scientific writing. I would like to thank Prof. Danica

Kragic Jensfelt for examining the thesis and organizing the public pre-

sentation at KTH. I owe a great debt of gratitude to Andreas Eitel and

Maxim Tatarchenko for being great mentors, for countless discussions,

motivation and guidance.

I had the privilege of discussing and learning from many excep-

tional researchers at AIS. Special thanks to Gabriel Oliveira, Ayush

Dewan, Tayyab Naseer, Marcel Binz and Noha Radwan for numerous

interesting discussions. Many thanks to Andreas Eitel, Michael Keser

and Philipp Jund for their technical support. I would like to thank An-

dreas Eitel and Prof. Wolfram Burgard for offering me a student job at

AIS which supported me financially throughout my stay in Germany. I

thank Anna Hellberg Gustafsson from KTH for providing me Erasmus+

scholarship for my stay in Germany.

I thank Andreas Eitel, Maxim Tatarchenko and Florian Kraemer for

proofreading the thesis report. This work would not have been possible

without the support of everyone at the AIS group. Special thanks to

Marcus Lundin, Gabriela Zarzar Gandler and Sebastian Zarzar Gandler

for helping me write the Swedish version of the abstract. I thank every-

one who helped me to collect the dataset: Tobias Paxian, Andreas Eitel,

V.K.Mittal, Shashi Kabdal, Himanshu Mittal, Shruti Kabdal, Shuchi Kab-

dal, Hannah Rosa Nesswetter, David Czudnochowski, Anand Narayan,

Sophie Ninnemann, Gabriela Zarzar Gandler, Jingwei Zhang, Oier

Mees, Rendani Mbuvha, Ronak Shah, Vishakha Patel, Andy Wachaja

and Federico Boniardi.

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Ethics, Societal Aspects and Sustainability . . . . 3

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . 5

2 Background 6

2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . 6

2.1.1 Convolutional Neural Networks . . . . . . . . . . 7

2.2 Deep Generative Models . . . . . . . . . . . . . . . . . . . 9

2.2.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Generative Adversarial Network . . . . . . . . . . 12

3 Related Work 15

3.1 Deep Generative Models . . . . . . . . . . . . . . . . . . . 15

3.1.1 VAE-based Methods . . . . . . . . . . . . . . . . . 16

3.1.2 GAN-based Methods . . . . . . . . . . . . . . . . . 16

3.1.3 Hybrid Methods . . . . . . . . . . . . . . . . . . . 16

3.1.4 Real-world Applications . . . . . . . . . . . . . . . 17

4 Methodology 19

4.1 Adversarial Autoencoders . . . . . . . . . . . . . . . . . . 19

4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.2 Basic AAE Architecture . . . . . . . . . . . . . . . 20

4.1.3 Learning Latent Distributions . . . . . . . . . . . . 22

4.1.4 Semi-supervised AAE . . . . . . . . . . . . . . . . 23

4.1.5 Convolutional Semi-supervised AAE Architecture 27

v

vi CONTENTS

5 Experiments and Results 30

5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1.1 MNIST Dataset . . . . . . . . . . . . . . . . . . . . 30

5.1.2 Internet Dataset . . . . . . . . . . . . . . . . . . . . 30

5.1.3 Real-world Dataset . . . . . . . . . . . . . . . . . . 32

5.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . 34

5.2 Learning of the Latent Distribution . . . . . . . . . . . . . 35

5.3 Semi-supervised Classification . . . . . . . . . . . . . . . 38

5.3.1 Implementation Details . . . . . . . . . . . . . . . 38

5.3.2 Object Recognition Results . . . . . . . . . . . . . 42

5.4 Online Learning with AAE . . . . . . . . . . . . . . . . . 46

5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Conclusion and Future Work 50

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 51

A Datasets 57

A.1 Dataset Filtering . . . . . . . . . . . . . . . . . . . . . . . . 57

A.2 Real-world Dataset: Video Streams . . . . . . . . . . . . . 58

B Architecture Details 59

B.1 Semi-supervised Convolutional AAE . . . . . . . . . . . . 59

B.1.1 Adversarial Network: Discriminator . . . . . . . . 59

B.1.2 Autoencoder Network . . . . . . . . . . . . . . . . 60

B.1.3 Classification/Adversarial Network: Generator . 61

Chapter 1

Introduction

1.1 Motivation

The idea behind semi-supervised learning for object recognition comes

from the learning ability of human beings. A human child can learn

about objects like animals, toys, etc. from only a few examples. For

example, once a child is shown what a cat looks like, it can thereafter

recognize a new type of cats in the world. Human beings do not require

thousands of labeled examples to learn the visual appearance of an

object, and they become better at recognition with subsequent exposure

to other variants of that object.

Image classification is one of the important tasks in the field of

computer vision. This task is highly relevant for various applications

like autonomous driving, service robotics, remote sensing and medical

diagnosis. Most of the latest image classification methods like Deep

Residual Networks [16] require a large collection of manually labeled

images to perform well. Collecting labeled samples can be difficult and

very expensive for specific real-world applications.

One way to tackle this challenge is by leveraging information from

unlabeled data in an unsupervised or semi-supervised manner. Al-

though image classification in a completely unsupervised manner is

not yet practical for complex distributions like natural images, recent

methods based on neural networks have shown promising results for

semi-supervised learning. In semi-supervised learning methods, we

can make use of unlabeled data for training - typically a small amount

of labeled data with a large amount of unlabeled data. Semi-supervised

methods make use of unlabeled data to better capture the shape of

1

2 CHAPTER 1. INTRODUCTION

underlying data distribution and generalize better to new samples.

In fields like medical science and robotics, it is much easier to obtain

unlabeled data as compared to obtaining labeled data. For example,

in robotics, a mobile robot can autonomously interact with the envi-

ronment and collect unlabeled data in abundance without any human

supervision. Therefore, semi-supervised learning is very well suited to

fields like robotics.

Several methods have been studied in the literature for semi-su-

pervised learning. In this work, we plan to focus on techniques based

on generative models. Building scalable generative models to capture

rich distributions such as audio, images or video is one of the impor-

tant challenges in machine learning. Until recently, deep generative

models, such as Restricted Boltzmann Machines, Deep Belief Networks

and Deep Boltzmann Machines were trained primarily by sampling

algorithms. In these sampling-based approaches, the methods become

more imprecise as training progresses. This happens because samples

from the procedures are unable to mix between modes fast enough.

In recent years, several deep generative models, namely, Variational

Autoencoder (VAE) and Generative Adversarial Network (GAN), have

been developed that can be trained via direct back-propagation and

avoid the difficulties that come with sampling-based training.

Figure 1.1: Examples for each class from the Real-world (RW) dataset:

banana, bottle, bowl, calculator, can, cup, orange, scissors, soccer-ball

and watering-can.

In this work, we explore, how well the latest methods based on

deep generative models can be used to recognize objects using semi-

supervised learning methods. We scale one such method called Adver-

CHAPTER 1. INTRODUCTION 3

sarial Autoencoders (AAE) for object recognition on real-world image

datasets. Figure 1.1 gives a glimpse of our real-world object dataset.

AAE is a hybrid approach which uses ideas from Variational Autoen-

coder (VAE) and Generative Adversarial Network (GAN). AAE is a

probabilistic autoencoder that uses an adversarial framework for varia-

tional inference. In a probabilistic autoencoder, the encoder approxi-

mates a posterior distribution, and the decoder is used to stochastically

reconstruct the input data from the latent variables; the resulting model

captures the distribution over images. Latent variable are the variables

that are not directly observed but rather are inferred using a mathemat-

ical model, from other observed variables.

Online learning is a related task which is highly relevant for robotics.

For example in service robotics, every time a new mobile robot is set

up in a new environment, it needs to adapt to the environment and

learn the objects in that environment for an interactive application. The

traditional way is to annotate all the objects manually to recognize and

interact with them. Additionally, the variety of objects also changes

dynamically in any given environment. To reduce these expenses, we

can deploy a robot with a semi-supervised learning approach. The

robot’s learning model can be initially trained with only a few labeled

instance of the objects, and then the robot can adapt its model to increase

the classification performance over time by collecting more unlabeled

data. In this work, we also show how this semi-supervised learning

method may be used for online learning on real-world data. Since

our real-world data is similar to the data captured by the robots, this

method can be readily applied to robotics.

1.1.1 Ethics, Societal Aspects and Sustainability

The contributions of this thesis work are very technical concerning the

usage of deep generative models for semi-supervised object recognition,

although there are many possible applications of object recognition in

general for example autonomous driving, medical diagnosis, service

robotics, etc.

Some applications of semi-supervised classification can be highly

relevant for the society, for example, cancer tumor detection in magnetic

resonance spectroscopic images. Since we all know that cancer is a fatal

disease and more than 10 million people are diagnosed with cancer

every year worldwide, it is one of the main challenges that our society

4 CHAPTER 1. INTRODUCTION

faces. Detection of a cancer tumor in early stages can help in curing

it before it becomes fatal. In oncology, medical diagnosis of cancer

involves differentiating between tumor types and grades. The classifi-

cation requires the availability of accurate diagnosis of past cases such

that they can be used as training samples. Such labeled data is scarce

in most areas of medical science while unlabeled data can be acquired

in abundance without keeping the identity of the person undisclosed.

Therefore, semi-supervised recognition can be a sensible choice for such

applications. In our opinion, semi-supervised classification methods

can have a positive societal impact. They can help us learn a model for

applications where supervision is scarce, and anonymity of the data is

the foremost priority.

One of the major ethical challenge in most of the computer vision

applications is the privacy of the individual’s private image data. But

this work majorly concerns about the common household objects, thus

it poses weaker challenges to privacy breach of an individual. The

method discussed in this work makes the application of object recogni-

tion to different fields relatively easier as we need lesser labeled images

to accomplish the same task. This method can also be used for ecologi-

cal sustainability by applying this method for better planning of land

and forest usage. This method can help us classify different species and

landmarks using aerial images. Similarly, such methods can also be

applied for applications in spatial informatics, which in general lack

labeled data. In conclustion, sustainable usage of technologies based

on semi-supervised learning can help to improve the performance of

systems in various such areas which are crucial for social and ecological

preservation.

1.2 Contributions

We apply one of the latest semi-supervised learning methods for object

recognition on real-world image data. In summary, our key contribu-

tions are:

• Real-world Datasets: Compilation of two different real-world

datasets. The first dataset is collected using a web-based image

search engine and the second dataset is collected using a hand-

held camera. Both datasets are automatically filtered using a latest

image retrieval method [41].

CHAPTER 1. INTRODUCTION 5

• Convolutional AAE for Semi-supervised Object Recognition:

A Convolutional Adversarial Autoencoder architecture for semi-

supervised end-to-end training. Extension of the fully-connected

AAE architecture, proposed in [32], to a convolutional AAE ar-

chitecture. An open-source Tensorflow implementation of our

method that can be applied to different datasets. Competitive re-

sults on the standard MNIST dataset and our real-world datasets.

• Online learning: Experiments demonstrating the possible appli-

cation of our semi-supervised method for online learning, espe-

cially in robotics.

1.3 Overview of the Thesis

In Chapter 2, we describe the theoretical concepts behind all the build-

ing blocks of an adversarial autoencoder, concepts of artificial neural

networks, convolutional neural networks and deep generative models

including both variational autoencoder and adversarial networks. In

Chapter 3, we review most relevant related works on deep generative

models for semi-supervised learning and other related semi-supervised

methods for real-world application. In Chapter 4, we first discuss

the motivation and theory behind the basic adversarial autoencoder

model in detail. Later in this chapter, we describe our new proposed

architecture for convolutional adversarial autoencoders.

Finally, in Chapter 5, we present our experiments and evaluation

results of the method discussed in Chapter 4. In this chapter, we present

one of the potential applications of the adversarial autoencoder model

in online learning and in Chapter 6, we summarize our results and

discuss the future work.

Chapter 2

Background

In this chapter, we will briefly discuss various building blocks impor-

tant to this work. Most of these components are based on deep neural

networks.

2.1 Artificial Neural Networks

The feed-forward neural network is a machine learning model that

learns to approximate some function. It can be used for both classifi-

cation and regression problems. For a classification problem, it learns

the function y = f(x; θ), that maps input x to the class-label y. The

feed-forward neural network learns the value of the parameter θ that

results in the best function approximation.

Figure 2.1: The general feed-forward neural network architecture [2].

6

CHAPTER 2. BACKGROUND 7

It can be described as the series of functional transformations. For

example, for input variables, x1, ..., xD, one set of function (also called

layer) is defined as:

zj = h(D∑

i=1

w(1)ji xi + w

(1)j0 ), (2.1)

where j = 1, ...,M and wji are the weights (same as θ in the above ex-

planation), w(1)j0 are the bias parameters, the superscript (1) indicates

the index of the layer in the network. Each linear combination is trans-

formed using a non-linear activation function h(·) as shown in Eq. 2.1.

Each of outputs zj are called hidden-units. These hidden units can be

further combined to form an overall network function as:

yk(x,w) = σ(M∑

j=1

w(2)kj zj + w

(2)k0 ), (2.2)

where k = 1, ..., K and K is the total number of outputs, and similarly

w(2)k0 are the bias parameters, and σ is the final activation function similar

to h(·) in Eq. 2.1. For classification problem, σ is often selected to be

the softmax activation function. Similarly, we can make a hierarchical

chain of such non-linear functions to form a neural network according

to our application. Figure 2.1 shows the general artifical neural network

architecture.

The feed-forward network can be trained with the standard back-

propagation algorithm. For multi-class classification, the cross-entropy

loss function is used with a feed-forward neural network, which is

defined as

H(y, y′) = −∑

i

y′i log yi, (2.3)

where yi and y′i denote the predicted class-label and true class-label of

the class i, respectively. Neural networks with more than three hidden

layers are sometimes called deep neural networks.

2.1.1 Convolutional Neural Networks

Convolutional Neural Networks (ConvNets/CNN) are a special type

of neural networks for processing data that has a grid-like topology

8 CHAPTER 2. BACKGROUND

for example image data can be thought of as a 2-D grid of pixels. Con-

volution is a kind of linear operation on the grid of numbers. These

networks use convolution operation instead of matrix multiplication

like in regular neural networks. Convolutional neural networks have

been tremendously successful in applications like image recognition

and classification.

ConvNet is a sequence of layers similar to hidden layers in regular

neural networks and every layer of a ConvNet transforms one volume

of activations to another through a differentiable function. There are

several types of layers in a ConvNet architecture: Convolutional layer,

Pooling layer, Dropout layer, Normalization layer and Fully-connected

layer. We now briefly discuss these small building block-layers. The

explanation of ConvNets is inspired by a latest survey on CNN methods

[14] and a course [27] from the Stanford University:

Convolutional Layer

Convolutional layers apply a convolution operation to the input, pass-

ing the output to the next layer. The primary purpose of convolution in

case of a ConvNet is to extract features from the input image. Convolu-

tion preserves the spatial relationship between pixels by learning image

features using small squares of input data. The convolution operation

allows the network to learn spatial features at hierarchical levels with

fewer parameters as compared to a regular neural network.

Pooling Layer

Pooling layer is used in-between successive convolution layers in a

ConvNet architecture. Its function is to progressively reduce the spatial

size of the representation to reduce the number of parameters in the

network, and hence to also control overfitting. The pooling Layer

operates independently on every depth slice of the input and resizes

it spatially, using the MAX operation. This is commonly called Max-

pooling operation. We also use a similar configuration of pooling layer

in our semi-supervised learning architecture.

Dropout Layer

This layer “drops out” a random set of activations in a layer by setting

them to zero. It makes sure that the network is not getting too “fitted”


to the training data and thus helps alleviate the overfitting problem.

An important note is that this layer is only used during training, and

not during test time.

Batch-Normalization Layer

It is a technique [18] proposed for accelerating the learning process

of deep neural networks. They affirm that due to the change in the

distribution of each layer’s parameters, the learning process is slowed

down. They call this internal covariance shift and solve this problem by

normalizing layer inputs. Normalization is carried out for each training

mini-batch.

Fully-Connected Layer

Each unit in a fully-connected layer has full connections to all activa-

tions in the previous layer similar to conventional neural networks.

Their activations can hence be computed with a matrix multiplication

followed by a bias offset.

2.2 Deep Generative Models

Generative models can be trained with missing data, and semi-super-

vised learning is one of the interesting cases of missing data where

labels for most of the training data are missing. In deep generative

models, the generative model is implicitly or explicitly learned using

deep neural networks. Deep generative models are one of the success-

ful techniques that attempt to solve the problem of unsupervised and

semi-supervised learning. They have widespread applications besides

semi-supervised learning, like density estimation, image denoising and

representation learning.

2.2.1 Autoencoders

Variational Autoencoder (VAE) [20, 35] is a deep generative modeling

technique that uses neural networks to parameterize the posterior dis-

tribution of the latent variables along with a generative network. VAE

is based on an autoencoder architecture. We first briefly describe the

autoencoder model before explaining the VAE algorithm.


Vanilla Autoencoder

Figure 2.2: The general Autoencoder architecture where the yellow

module is the encoder network and red module is the decoder network.

An autoencoder is a feed-forward neural network that tries to recon-

struct its input after passing it through a lower dimensional space.

Autoencoders are unsupervised learning models. An autoencoder

contains two connected set of neural networks: encoder network and

decoder network, shown in Figure 2.2. The encoder compresses the

input data to a lower dimensional space also called the latent space or

hidden representation, and the decoder takes this hidden representa-

tion as input with the goal to reconstruct the input to the encoder. In

other words, the encoder can be defined as function h = f(x) and a

decoder as function t = g(h), and effectively learns an identity function

g(f(x)) = x. An autoencoder is trained with the reconstruction loss

between the input and its reconstructed version. It is simple and most

effectively defined using a mean-squared loss or L2-loss function as.

L =1

2M

M∑

i

||xi − xi||22, (2.4)

where M is the total number of input data, xi is the original data and

xi is its reconstruction. If the input data is normalized between [0, 1],

the cross-entropy loss can also be used as reconstruction loss.


Variational Autoencoder

Figure 2.3: The general Variational Autoencoder (VAE) architecture,

where the yellow module is the encoder network and red module is the

decoder network.

Autoencoder is capable of generating only those images, which are

shown during training. We cannot generate new images using a sim-

ple autoencoder. To build an explicit generative model, a probability

distribution is imposed over the latent space in the case of variational

autoencoder. In VAE, this is imposed using a Kullback-Liebler (KL)

divergence [22] loss term, which measures the distance between latent

variables (output of the encoder) and a standard Gaussian distribution

(prior distribution) along with a mean-squared error term to accurately

reconstruct the input images. The objective of this latent variable model

is to calculate the posterior p(z|x). According to Bayes:

p(z|x) =p(x|z)p(z)

p(x)(2.5)

Since evaluating the evidence term p(x) is intractable, we need to

approximate this posterior distribution. To overcome this challenge, the

VAE introduces an inference machine qφ(z|x) that learns to approximate

the posterior pθ(z|x). Hence, the objective of this latent variable model

becomes to minimize the following KL-divergence (DKL) term:

DKL[qφ(z|x)||pθ(z|x)], (2.6)

but computing the posterior pθ(z|x) is still intractable due to the pres-

ence of evidence term pθ(x). To make this variational inference tractable


VAE combines Evidence Lower Bound (ELBO) function with KL di-

vergence term and tries to maximize the lower bound on the data

likelihood instead: pθ(x) ≥ L(θ, φ,x). The lower bound is written as:

L(θ, φ,x) = Eqφ(z|x)[log pθ(x|z)]︸︷︷︸

reconstruction term

−DKL(qφ(z|x))||pθ(z))︸︷︷︸

regularization term

. (2.7)

The overall loss term in the case of VAEs is a sum of the reconstruc-

tion term and the KL divergence regularization term as shown in Eq.

2.7. Figure 2.3 shows the general VAE architecture with appropriate

notations used in Eq. 2.7.

2.2.2 Generative Adversarial Network

Figure 2.4: The general Generative Adversarial Network (GAN) archi-

tecture where the green module is the discriminator (D) network, and

the yellow module is the generator (G) network.

The basic idea of Generative Adversarial Network (GAN) [12] is to set

up a game between two players. One of them is called the generator

G(z). The generator creates samples that are intended to come from the

same distribution as the training data. The other player is called the

discriminator D(x). The discriminator determines whether the samples

are generated (fake) by the generator or taken from the training data

(real). The discriminator is similar to a supervised model classifying

samples into two classes, which are real or fake. The generator learns


to fool the discriminator by producing fake samples similar to the true

training data, and the discriminator learns to catch the counterfeiting

process of the generator. Figure 2.4 shows the basic GAN architecture

as explained above.

The adversarial game between the generator and discriminator can

be formalized as:

minG

maxD

V (D,G) = Ex∼pdata(x)[log(D(x))] + Ez∼p(z)[log(1−D(G(z)))].

(2.8)

In Eq. 2.8, pdata(x) denotes the data distribution, p(z) denotes the

latent distribution for sampling the noise vector, which is given as input

to the generator. Generally, the noise distribution is assumed to be a

standard normal distribution. V (D,G) is the overall GAN objective

function. Eq. 2.8 can be further broken into two separate equations for

discriminator and generator networks for implementation purpose:

Discriminator Network

maxD

V (D,G) = Ex∼pdata(x)[log(D(x))]︸︷︷︸

Maximize prob. of D(real)

+Ez∼p(z)[log(1−D(G(z)))︸︷︷︸

Minimize prob. of D(fake)

].

(2.9)

In Eq. 2.9, D(x) is trained with the sigmoid cross-entropy loss func-

tion with label 1 for a real sample and 0 for a fake sample.

Generator Network

minG

V (D,G) = Ez∼p(z)[log(1−D(G(z)))︸︷︷︸

Maximize prob. of D(fake)

. (2.10)

In Eq. 2.10, D(x) is trained with sigmoid cross-entropy loss function

with label 0 for a real sample and 1 for a fake sample. The classifier is

trained on two mini-batches of data; one coming from the dataset and

other coming from the generator.

As a result of this learning procedure, the generator learns to create

samples that are drawn from the same distribution as the training data.

A GAN [13] can be considered as an implicit generative model. In

GAN, the model does not explicitly represent a probability distribution

where data lies like in VAEs, but the model interacts indirectly with the

probability distribution. We can sample directly from the distribution

represented by the model itself.


GANs are known to be hard to train due to several reasons. First, the

formulation from Equation 2.8 can become unstable if the discriminator

learns too quickly. In this case, the loss of the generator saturates before

reaching an equilibrium. Second, GANs suffer from ‘modal collapse’

[38] to a parameter setting, where the generator might get stuck with

generating one mode of the data.

Chapter 3

Related Work

Semi-supervised learning is a well-studied topic in the literature [44].

Some of the well-studied semi-supervised learning methods include

self-training [36], generative models [11], graph-based methods [43],

co-training [3] and multi-view training [4]. In self-training algorithms,

the model is bootstrapped with additional labeled data. This additional

labeled data is obtained from the highly confident prediction of the

unlabeled data. In graph-based methods, the model tries to propa-

gate label information by connecting similar observations from labeled

and unlabeled samples. Graph-based approaches are computationally

expensive and limited to small scale problems [10]. The co-training

method is based on the assumption that features can be split into two

conditionally independent sub-feature sets. In co-training, two sep-

arate classifiers are learned based on the sub-feature sets, and these

classifiers are trained to agree upon the labels from unlabeled data as

well as labeled data. In this work, we only focus on deep generative

model-based techniques for semi-supervised learning.

3.1 Deep Generative Models

Latest Deep generative models like [38, 32, 6, 8] have emerged as strong

candidates for unsupervised and semi-supervised learning of compli-

cated distributions like images. The learning model needs to discover

the abstract structures hidden within the unlabeled image data. Two

main classes of successful generative models for semi-supervised learn-

ing are Variational Autoencoder (VAE) and Generative Adversarial

Network (GAN).

15

16 CHAPTER 3. RELATED WORK

3.1.1 VAE-based Methods

VAE [20, 35] are a class of deep generative models that allow us to

learn latent variable generative models for the input data. Kingma

et al. [19] introduced the first successful deep generative model for

semi-supervised learning, but it needs to be coupled with a pretrained

feature extractor to perform well. Recently, several other competitive

VAE based methods [29, 42] have been proposed in the literature for

semi-supervised learning. Most of the VAE-based semi-supervised

methods are limited to simpler datasets like MNIST, SVHN and NORB,

but in this work, we successfully show semi-supervised classification

on real-world data. While the VAE-based semi-supervised learning

methods require pre-training of the autoencoder, our method can be

trained in an end-to-end manner without any preprocessing step on

the network.

3.1.2 GAN-based Methods

Most of the state-of-the-art semi-supervised methods are based on

GANs [23, 24, 40, 38]. In GANs, two neural networks namely, Generator

and Discriminator play a zero-sum game. The learned discriminator

module can be used for the application of semi-supervised learning

[8, 38]. For semi-supervised learning, a slight modification is made to

the output layer of the discriminator to accommodate the extra fake

class. Therefore, the dimension of the classifier output is increased

from K to K + 1, where K is the number of classes. While typical

GAN-based methods try to match the data distribution directly, our

approach aims to match the latent distribution of the autoencoder to a

prior distribution using a GAN.

3.1.3 Hybrid Methods

Recently, numerous hybrid works [1, 7, 8, 25, 32] of GAN and VAE have

been proposed in the literature for generative modeling. They try to

establish a connection between VAE and GAN to simultaneously learn

a good generative model while learning an efficient inference network.

In other words, the hybrid model can perform well on both the tasks of

image generation and latent space modeling. Adversarial Variational

Bayes [34] uses a more general GAN inference framework within a max-

imum likelihood setting. Adversarial Learned Inference [8] is another

CHAPTER 3. RELATED WORK 17

framework which used GAN framework for approximating maximum

likelihood. One such hybrid approach is Adversarial Autoencoders

[32] (AAE). The AAE model replaces the KL-divergence term in VAEs

[20] with an adversarial training method. In adversarial training, a

discriminator is jointly trained to distinguish between posterior and

prior samples. This method provides a better approach to matching

the latent representation with the prior distribution as compared to

VAE. In our work, we use the learning principles of AAE and extend

it for higher resolution real-world image data. Several other newer

techniques use implicit distributions to learn posterior approximations

other than AAE [32]. One of the latest hybrid method proposed as an

improvement to AAE, PixelGAN Autoencoders [31], captures the data

distribution jointly by the latent code and autoregressive decoder.

3.1.4 Real-world Applications

There is relatively little work on real-world applications of semi-su-

pervised learning. Recent work on semi-supervised haptic material

recognition [9] is one of the few successful works in robotics using

deep generative models. They used a GAN based approach for semi-

supervised learning from tactile sensory data. In another exception

[28], they use semi-supervised learning for object recognition using an

ensemble manifold regularization method. Both methods mentioned

above use low dimensional sensory input data, thus making it easier to

learn a semi-supervised model.

Developing online learning systems is another new emerging area of

research especially in robotics. There are a few related works on online

learning [26] which try to learn multiple tasks by sharing knowledge

among associated tasks. Other application-oriented online learning

works include [5, 37] where the system mines the web to learn visual

concepts and text-based relationships. [5] uses image search engines

to get weak labels for the images. In our work, we treat all the data

fetched from the Internet as strongly labeled.

Our work is the first of its kind to the best of our knowledge, where

a generative modeling based semi-supervised learning method has

been used for a real-world application like object recognition. In this

work, along with semi-supervised object recognition on real-world

image data, we also show potential application of this method for con-

tinuous learning. Our method is highly relevant for mobile robotics

18 CHAPTER 3. RELATED WORK

where robots can independently interact with the objects and improve

its performance without supervision. Our continuous learning setup

is devoid of any interactive web services in contrast to previous ap-

proaches, and it only focuses only on improving the performance of the

same task through unsupervised interactions.

Chapter 4

Methodology

4.1 Adversarial Autoencoders

Adversarial Autoencoder (AAE) is a method for regularizing an au-

toencoder. It imposes a prior distribution on the latent code of the

autoencoder using GANs. The Adversarial Autoencoder also converts

autoencoder into a probabilistic generative model which allows sam-

pling. In this chapter, we first discuss the motivation behind using an

Adversarial Autoencoder for our application. In Section 4.1.2, we dis-

cuss the structure of the basic AAE architecture where we explain how

the adversarial network is used to regularize the latent distribution. In

Section 4.1.3, we discuss another variant of the AAE architecture which

is essential to demonstrate its ability to learn the desired latent distribu-

tion in a semi-supervised setting. In the next section, we describe our

approach to scale the semi-supervised AAE method for classification

with high-resolution real-world images.

4.1.1 Motivation

One of the main drawbacks of VAEs is that the integral of the KL diver-

gence term (regularization term in Eq. 2.7) does not have a closed-form

solution except for few basic distributions because it requires an access

to the exact functional form of the prior distribution. Training such a

model can be difficult because backpropagation through the stochastic

hidden units is not possible and requires some reparameterization trick

to make the network differentiable.

The Adversarial Autoencoder model drops the KL divergence term

19

20 CHAPTER 4. METHODOLOGY

completely by making use of adversarial learning, and the model can

be learned in an end-to-end manner. Since AAE just needs to be able

to sample from the prior distribution, it allows imposing any arbitrary

prior distribution to the output of the neural network by regularizing it

using a GAN network. AAE architecture can also be used to disentangle

distinct aspects of the data into separate latent variables. This feature

of AAE is further utilized for semi-supervised learning. In the recent

literature, researchers have proposed a lot of different semi-supervised

methods for image classification on various low-resolution datasets like

MNIST and SVHN. Scaling these methods to high-resolution images

is hard due to various challenges involved with GAN networks like

mode-collapsing, also discussed in Section 2.2.2. Although GANs can

accurately model complex distributions, they are known to be challeng-

ing to train due to instabilities caused by difficult minimax optimization

problem, whereas AAE does not suffer from such challenges because it

uses GAN to learn a simple distribution on the latent code.

Figure 4.1: Architecture of a basic Adversarial Autoencoder. The ‘+’

and ‘−’ sign shows positive and negative input to the adversarial dis-

criminator network, respectively.

4.1.2 Basic AAE Architecture

Let x be the input and z be the latent code of the autoencoder. Let

p(z) be the prior distribution of the latent code, q(z|x) be the encoding

distribution and p(x|z) be the decoding distribution. Let pdata(x) be the

data distribution and p(x) be the model distribution to be learned. The

CHAPTER 4. METHODOLOGY 21

aggregated posterior distribution of q(z) on the hidden code is defined

by the encoding network q(z|x) as follows:

q(z) =

∫

x

q(z|x)pdata(x)dx (4.1)

AAE is a modified autoencoder, where the latent code is regularized

matching the above (Eq. 4.1) aggregated posterior distribution to the

prior distribution p(z) using the adversarial network.

Figure 4.1 schematically shows how the AAE works with Gaussian

prior on the latent code. The top structure in the network is a standard

autoencoder that reconstructs the image x from the latent code z. The

autoencoder is trained using standard reconstruction loss function:

L =1

2M

M∑

i

||xi − xi||22, (4.2)

where M is the total number of input data, xi is the original data and

xi is its reconstruction. The bottom structure is another network that

discriminates whether a sample is taken from the latent code of the

autoencoder or if it is sampled from the distribution p(z) specified by

the user. The discriminator receives z from the encoder q(z|x) and z′

sampled from the true prior distribution. The discriminator is trained

to distinguish between generated z and sampled z′ using the following

loss function:

LDz= −

1

m

m∑

k=1

log(D(z′)) + log(1−D(z)), (4.3)

where m is the minibatch size, D is the discriminator network. In Eq.

4.3, D(·) is trained with the sigmoid cross-entropy loss function with

label 1 for the true sample and 0 for the generated sample.

Then, the generator (encoder network) is updated using the follow-

ing loss function:

LG = −1

m

m∑

k=1

log(D(z)) (4.4)

We can notice from above loss functions that they both counteract

each other. As a result of careful training, the discriminator learns to

recognize fake (generated) samples and the generator learns to fool the

discriminator. Thus, we learn a good model distribution p(x) and ag-

gregated posterior distribution q(z). Optimization of this AAE network


involves three objective functions as described above. We do not train

with all the objective functions simultaneously, but rather alternate

between them for each mini-batch training process.

4.1.3 Learning Latent Distributions

Figure 4.2: Adversarial Autoencoder architecture with extra regulariza-

tion using label information.

Although with the basic AAE architecture it is possible to impose any

distribution on the latent code, we require an extra regularization to get

a class separation in the latent distribution. This extra regularization

requires label information. With the AAE approach this regularization

is realizable even when only a few labeled samples are available. Fig-

ure 4.2 shows how the label information can be leveraged to strictly

regularize the distribution of the latent code. In this network, the one-

hot vector is given as input to the discriminator network to select the

mode of the corresponding class from the prior distribution. For la-

beled samples, the true associated one-hot vector is given as input to

the discriminator network, and an extra one-hot vector where the 11th

category is switched on, is given as input to the discriminator network

for unlabeled samples. Thus, the discriminator can infer whether the

input comes from the labeled sample or from the unlabeled sample. As

a result of this model, the network learns to map the unlabeled samples


to the mode corresponding to the true class in the latent distribution.

This experiment helps us evaluate the complexity of the dataset in a

semi-supervised setting. Since this is a semi-supervised approach to

learn a latent distribution, we consider this model as a preliminary step

for the success of the semi-supervised classification.

We performed various experiments with this model on several

datasets to evaluate the proportion of labeled samples that is suffi-

cient for semi-supervised learning. The exact experimental setup and

results are discussed in Section 5.2. In the results section, we also visu-

alize the latent distribution in 2-dimensions. This model is also trained

with three objective functions as described in Section 4.1.2 using the

same alternating procedure.

4.1.4 Semi-supervised AAE

In Section 4.1.3, we studied how the latent distribution can be imposed

in a semi-supervised manner. In this section, we show how these

architectures can be combined to formulate a semi-supervised AAE

architecture for classification.

We assume that each image information can be decomposed and

reconstructed from two sets of independent components, namely, style

and class-label information. A continuous latent distribution can cap-

ture the style information, and a categorical latent distribution can cap-

ture class-label details. The network is designed such that the encoding

network can predict the class variable and continuous style variable

using labeled as well as unlabeled data, and the decoder network can

reconstruct the input image using both latent variables. Two separate

adversarial networks regularize the style and class-label hidden repre-

sentations; thus they ensure that they carry independent information

as shown in Figure 4.3.

In this architecture, let p(z) and p(y) be the continuous prior dis-

tribution on the style part of the latent code and the categorical prior

distribution on the class-label part of the latent code, respectively. This

semi-supervised AAE architecture comprises three different type of

modules that work in conjunction with each other, namely, autoencoder,

adversarial and classification module:


Autoencoder Module

In AAE, q(z|x) is a probabilistic encoder approximating the true poste-

rior distribution p(z|x), and p(x|z) is a generative decoder. We use the

standard reconstruction loss as written in Eq. 4.2. Autoencoder module

is an important part of the semi-supervised architecture for learning

the data representation from unlabeled samples.

Figure 4.3: Semi-supervised Adversarial Autoencoder architecture for

object recognition. The blue-highlighted box indicates the autoencoder

module, the red box corresponds to the adversarial module comprising

of two adversarial networks with a common generator, and the green-

highlighted box corresponds to the classification module.

Adversarial Module

Adversarial module comprises of two adversarial networks with a

common generator network, each for capturing style and class-label

information. This module is trained with unlabeled data. As discussed

in Sectio 2.2.2, the generator network of the GAN tries to mimic the

examples from the training data. Typically in GANs, the generator does

this by transforming a random noise sample, which is given as input,


into a synthetic or fake sample. In most image-based GAN applications,

the generator network is an expansion-type network, and the generated

synthetic samples are images. In contrast to typical image-based GANs,

the generator network in AAE is a compression-type network that

produces the synthetic sample by transforming the image input into

a low-dimensional latent vector. In AAE, the encoder network of the

autoencoder acts as the generator network during adversarial training,

and a separate discriminator network distinguishes between real and

fake latent codes.

The aggregated posterior distribution of the latent code, q(z), is

matched to an arbitrary prior, p(z), using an adversarial network as

illustrated in Figure 4.1. We use the term ‘posterior’ synonymous to

‘aggregated posterior’, defined in Eq. 4.1, for rest of the report for

convenience. In the semi-supervised AAE architecture, two separate

adversarial networks regularize style and class-label information of the

latent code. The first adversarial network ensures that the class-label

part (y) of the latent code does not carry any style information and the

aggregated posterior distribution of y matches the Categorical distri-

bution Cat(y). We assume that once the label information is removed,

the remaining information can be captured using a continuous distribu-

tion. Therefore, the second adversarial network imposes a continuous

distribution p(z) on the style part, (z), of the latent code.

For learning the adversarial discriminator network for the continu-

ous distribution, a loss function discussed in Eq. 4.3 is used. Similarly,

for learning the second adversarial discriminator for categorical distri-

bution, we use following loss function:

LDy= −

1

m

m∑

k=1

log(D(y′)) + log(1−D(y)) (4.5)

The generator loss can be combined for both adversarial networks

as :

LG = −1

m(

m∑

k=1

log(D(z)) +m∑

k=1

log(D(y))), (4.6)

where m is the size of the mini-batch training data and D(·) is the

discriminator network.


Classification Module

The classification module is the same as a classical convolutional neural

network as discussed in Section 2.1.1. With labeled data, the encoder

part of the autoencoder is used as the classification network with the

categorical output. We use the softmax cross-entropy loss for training

the classification network.

All four sub-networks including two adversarial networks, autoen-

coder network and classification network of the AAE models are trained

synchronously in an end-to-end manner in three phases: the reconstruc-

tion phase, the regularization phase and the semi-supervised classifi-

cation phase. We further discuss the training procedure in detail in

Section 5.3.1.


4.1.5 Convolutional Semi-supervised AAE Architecture

Figure 4.4: Semi-supervised Adversarial Autoencoders architecture for

object recognition. The ‘+’ and ‘−’ sign shows positive and negative

input to the adversarial discriminator network, respectively. The corre-

sponding color coding for different layers is shown in the right column.

The size of the filter and number of filters are mentioned on the top and

bottom of the layer, respectively. Usage of special activation functions

(if used) is mentioned in between two layers. The connection between a

convolution and fully-connected layer includes reshaping of the input

layer (not shown in the diagram).

The convolutional AAE architecture is obtained after an extensive

search over several hyperparameters: type of architecture, number

of layers, type of convolutional and upconvolutional layers, normal-

ization techniques, activation functions and loss functions. Figure 4.3

shows the optimal architecture obtained after an extensive hyperparam-

eter search. Since AAE is a modular approach, we divide and explain

the architecture for each module separately. Appendix Tables B.1, B.2

and B.3 show the convolutional semi-supervised AAE architecture with

details for different sub-networks separately. The training procedure

is discussed in detail in Section 5.3.1, where we discuss how these dif-

ferent sub-networks are optimized in an alternate manner to perfom


semi-supervised object recognition.

Autoencoder Network

The autoencoder network, shown in Table B.2, is a standard covolu-

tional autoencoder. The encoder network of the autoencoder is the

same as the classification network shown in Table B.3 with additional

output for style-latent code from FC-1 as shown in Figure 4.3. The

decoder network gets a concatenated input from both the latent parts.

We use transposed convolution operation for convolution operation

and upscaling the filter size by two times at each operation. The output

layer has the sigmoid activation function to match the scale of the input

sample. The dimensionality of the label representation is 10, and for the

style representation, we use 30 dimensions for the real-world datasets

and 10 dimensions for the MNIST dataset.

Classification Network/Adversarial Network: Generator

We implemented various competitive convolutional neural network ar-

chitectures for the encoding network like DenseNet [17], VGG Network

[39], AlexNet [21]. We found the VGG-type network to be the most

effective for our application. Table B.3 shows the architecture of the

encoding network which is also used for semi-supervised classification

on the test set. This network is also used as a generator network during

adversarial training. The architecture is designed for an input image

of 64× 64× 3. The architecture precisely contains seven convolutional

layers followed by two fully connected layers for classification with

the softmax activation function as the end operator. Each convolu-

tional layer is followed by batch-normalization on the mini-batches and

passed through the Leaky-Relu [30] non-linear activation function. The

max-pooling operation reduces the filter size from 64× 64 to 4× 4. The

network is trained with different learning rates at different phases of

training, which is discussed in detail in Section 5.3.1.

Adversarial Network: Discriminator

The discriminator sub-network of the adversarial network, shown

in Table B.1, is a regular neural network with two fully connected

layers of 1024 hidden units each. This discriminator classifies the


low-dimensional latent code from the vector sampled from the low-

dimensional (≤ 30) prior distribution. The last layer has a sigmoid

activation function such that we can use the sigmoid cross-entropy loss

function.

Chapter 5

Experiments and Results

In this chapter, we first describe our datasets and their compilation

procedure in Section 5.1. Subsequently, we discuss the results of all the

experiments performed using different AAE architectures for learning

latent distributions and semi-supervised classification in Section 5.2

and Section 5.3 respectively.

5.1 Datasets

We tested our convolutional adversarial autoencoder model on three

different datasets, namely, MNIST, Internet (WIS) dataset and Real-

world (RW) dataset for semi-supervised learning.

5.1.1 MNIST Dataset

MNIST is a database of handwritten digits. It contains 50,000 training

samples and 10,000 testing samples. It includes grayscale images of

resolution 28 × 28, where each digit is centered. It is a well known

dataset to experiment and analyze new learning techniques in the

machine learning community.

5.1.2 Internet Dataset

As we discussed in Chapter 1, collecting and annotating a large dataset

is an expensive and time-consuming procedure. Therefore, we use a

web-based image search engine to collect this real-world image dataset.

We call this internet dataset ‘WIS’ dataset since it is fetched using the

30

CHAPTER 5. EXPERIMENTS AND RESULTS 31

web-based image search engine. Using the Internet, it is possible to

collect a very diverse and abundant amount of data from different

sources, scales, domains, etc. Figure 5.1 gives a glimpse of the WIS

dataset.

Figure 5.1: Visualization of few samples from the Internet Dataset. This

dataset is fetched using a reverse image search engine.

We collect images belonging to 10 categories of objects. We only

consider small household objects for this dataset, namely: banana,

bottle, bowl, calculator, can, cup/mug, orange, scissors, soccer ball,

32 CHAPTER 5. EXPERIMENTS AND RESULTS

watering-can. To reduce the human effort for collecting and annotating

the images, we fetch images from the web using reverse image search

engine. Using this technique, we can obtain around 100 good quality

images for each image-query. For collecting this dataset, we select 40

exemplar images for each object category and query the web using

these selected images. This downloading step is followed by a filtration

process (discussed in Section 5.1.4) to remove unnecassary images from

the raw dataset. After filtering, this dataset contains approximately

24K labeled images. We split the dataset 80:20 for training and testing

respectively. Figure 5.2a shows more statistics about the WIS dataset.

(a) (b)

Figure 5.2: (a) shows the number of images per class in the WIS dataset.

(b) shows the number of videos collected per class for the RW dataset.

5.1.3 Real-world Dataset

The Real-world (RW) dataset is another object recognition dataset com-

prising of everyday objects. The RW object dataset consists of more

than 200 daily household objects. The objects are categorized into ten

classes (same as WIS dataset): banana, bottle, bowl, calculator, can,

cup/mug, orange-fruit, scissors, soccer-ball, watering-can. This dataset

is collected from video streams using hand-held cellphone cameras and

then later sampled to fetch image frames. The dataset contains 836

video streams of these 200 object instances. The data was captured by

different persons with some minimal instructions and no prior knowl-

edge about our work. Therefore, the resolution and quality of the video

data varies across different videos streams, but they are all captured

at the common frequency of 30Hz. In each video, the camera operator

moves around the object in a slow and random motion to capture the


object from different angles at various scale and with different back-

grounds. The data is captured in a very natural setting and is highly

diverse concerning illumination, clutter around the object and distance

of the object from the camera. Figure 5.3 gives a glimpse of the RW

dataset.

Figure 5.3: Visualization of few samples from the Real-world Dataset,

which is captured using hand-held cameras.


This dataset contains at least 36 videos for each object category. The

average duration of the videos is approximately 8 seconds. We sample

the frames from each video at the rate of 6Hz for our application. Before

using the data for learning the task, each sampled frame passes through

a filter to remove the noisy frames, which do not contain any of the 10

given object categories. We obtain approximately 22K image frames

equally distributed over all the classes. Figure 5.2b shows the dataset

statistics: number of videos captured per class.

5.1.4 Preprocessing

Both WIS and RW datasets contain a lot of noise images, where none

of the 10 categories are present in the image frame. We propose an

automatic filtering approach to remove these noisy images from the

dataset. This filtering process is discussed in details below.

Filtering the Dataset

Web images and their labels are more accessible to obtain, but directly

training on them can result in underperformance due to the presence of

noisy web query results and noisy labels. This can adversely affect the

precision of the manifolds learned from the unlabeled data and also the

semi-supervised classification performance. Therefore, we need to filter

the web query results before training any model using them. We use an

image retrieval technique [41], which builds a precise image descriptor

for object retrieval. This method encodes several image regions into a

single feature vector without feeding multiple inputs to the network.

We use this feature vector to filter the query results by matching the

cosine distance between the corresponding feature vectors. We use the

averaged cosine distance, (d), averaged over all the exemplar images

from that class:

d =

∑N

i=1 cos(xi,xj)

N, (5.1)

where xi is the feature vector of the ith query input image and xj is

the feature vector of the jth web-query result image.

This filtering technique is used for cleaning both WIS and RW

datasets. As a result of this cleaning procedure, approximated 10%

of the query results are rejected. Figure A.1 shows some qualitative

results of this filtering process with WIS data distribution. Since these


datasets are not annotated for each sample individually, and our pro-

posed method does not guarantee 100% correct filtering, they contain a

small amount of label impurity.

5.2 Learning of the Latent Distribution

(a) (b)

(c) (d)

Figure 5.4: Visualization of the latent space learned using unsupervised

basic AAE architecture. (a) A 2-D Gaussian distribution is used as

the prior distribution on the MNIST dataset. (b) A ten 2-D Gaussian

mixture distribution is used as the prior distribution on the MNIST

dataset. (c) A 2-D Gaussian distribution is used as the prior distribution

on the WIS dataset.(d) A ten 2-D Gaussian mixture distribution is used

as the prior distribution on the WIS dataset.

In this section, we test the ability of the AAEs to learn different la-

tent distributions using unsupervised, supervised and semi-supervised

learning. We first show results obtained with the basic AAE architec-

ture in an unsupervised setting for the MNIST dataset. In the next

experiments, we demonstrate the ability of the convolutional AAE


method to learn an arbitrary latent distribution using supervised and

semi-supervised learning for real images.

Experiment-1: Unsupervised Learning

It is possible to learn a visually comprehensible latent space for MNIST

datasets without using any label information with the basic AAE ar-

chitecture (Figure 4.1). We can visualize that the samples of different

classes are mapped to individual clusters in the latent space shown in

Figure 5.4a and 5.4b. But this unsupervised AAE model fails to produce

such a 2-D visualization for real-image datasets because of complex

data distribution, shown in Figure 5.4c and 5.4d. In these figures (5.4c

and 5.4d), all the class mappings overlap with each other. This shows

that the method can learn simple distributions like MNIST data dis-

tribution in an unsupervised manner, but it fails in case of complex

distributions like real-world WIS dataset. In the next experiments, we

try to learn the latent distributions using the semi-supervised learning

model for such complex data distributions.

(a) (b) (c)

Figure 5.5: Visualization of the latent space on WIS dataset. In this

experiment, we leverage the label information to better regularize the

latent space. This model is trained using all labeled samples. (a) The

ten 2-D Gaussian prior distribution imposed on the latent code (b) The

posterior distribution on the latent space using the training data (c) The

posterior distribution on the latent space using the testing data.

Experiment-2: Supervised Learning

This experiment uses the AAE architecture shown Figure 4.2, where

label information can be utilized in a supervised learning setup to regu-


larize the latent distribution. We assume a ten 2-D Gaussian mixture

model as the prior distribution where each Gaussian represents the

distribution of a separate class. Figure 5.5 shows the prior and posterior

distributions of the latent code when 100% label information is used.

These results indicate that it is possible to shape the latent distribution

using the proposed convolutional encoder with an adversarial training

procedure.

(a) (b) (c)

(d) (e) (f)

Figure 5.6: Experiment-2. Visualization of the posterior distribution of

the latent space on WIS dataset. In this experiment, we leverage the

label information to better regularize as shown in Figure 4.2. Figure

(a) and (d) show learned latent distribution on training and test set,

respectively, when the model is trained with 50% labeled samples and

50% unlabeled samples. (b) and (e) show learned latent distribution

learned using training and testing set respectively when the model is

trained with 20% labeled samples and 80% unlabeled samples. Simi-

larly, (c) and (f) show latent space distribution when trained with only

10% labeled samples and 90% unlabeled samples.


Experiment-3: Semi-supervised Learning

In this experiment, we test the same architecture in a semi-supervised

setting where only a proportion of labels is available along with unla-

beled samples. The only difference is that there are 11 categories in the

one-hot input vector where the 11th category is switched on when the

label of the input is unknown. Figure 5.6 shows the posterior distribu-

tion of the 2-D latent representation code for different ratios of labeled

to unlabeled data. We perform this experiment for three different ratios

where 50%, 20% and 10% samples are labeled from the whole dataset.

The qualitative result (Figure 5.6f) shows that it is possible to achieve

a visually discernible distribution even with only 10% labeled data.

The architecture is able to correctly map the unlabeled samples to the

right mode of the class distribution using limited label information.

This experiment verifies the feasibility of this AAE method for semi-

supervised learning for real-world data. As expected, the quality of the

posterior latent distribution degrades with lower ratios of labeled to

unlabeled samples as shown in Figure 5.6d to Figure 5.6f. We consider

this experiment to be a preliminary step before using this method for

semi-supervised classification on a new dataset.

5.3 Semi-supervised Classification

In this section, we demonstrate the performance of the semi-supervised

adversarial autoencoder architecture, described in Section 4.1.4, for

object recognition. First, we discuss the implementation details of

the semi-supervised classification experiment majorly concerning the

non-trivial training procedure. Then, we analyze the dynamics of the

learning procedure with the help of all the loss curves. Finally, we

show the object recognition results on all the datasets. We compare

the results of our semi-supervised method with a competitive CNN

baseline method.

5.3.1 Implementation Details

The exact details of the neural network architecture used for semi-

supervised classification are shown in Tables B.1, B.2 and B.3. As

mentioned in Section 4.1.4, the semi-supervised AAE classification

model is concurrently learned in three phases: the reconstruction phase,


Figure 5.7: Training procedure of semi-supervised AAE divided into

three phases- reconstruction phase (I), regularization phase (II) and

semi-supervised classification phase (III). The yellow box indicates the

encoder (generator) network; the red box indicates the decoder network

and green boxes indicates the discriminator network for adversarial

training.

the regularization phase, and the semi-supervised classification phase.

Training Procedure

Training this combination of networks is a little tricky and requires care-

ful calibration of learning rates for optimal results. The semi-supervised

AAE model is trained in three phases with different loss functions as

discussed in Section 4.1.3. A phase comprises a single mini-batch train-

ing process. Figure 5.7 shows the training process in different phases;

the active modules in the figure are highlighted in red color for each

training phase.

The training procedure runs for 500 epochs with diminishing learn-

ing rates. The objective functions are trained using the Adam optimizer

with a batch size of 50. The initial learning rates are different for dif-

ferent phases of training as mentioned in the architecture tables. The

learning rate is reduced by 10 times at 100 epochs and then further by


10 times at 300 epochs for all training phases. In the implementation,

this three-phase model is trained jointly in two iteration steps, first two

phases (Phase-I & Phase-II) are trained together in a single iteration

step and the third phase is trained in next iteration step. These two

training iteration steps are repeated alternately.

The data samples are scaled from 0 to 1. We use dropout layer

at the fully-connected layers with the dropout rate of 50%. No other

dropout or Gaussian noise regularization was used in any other layer.

The labeled examples are chosen at random, but it is assured that they

are evenly distributed across all the classes. The unlabeled examples

belong to one of the ten classes. Batch-normalization is used in all the

convolutional layers.

Learning curves

The training and testing classification-loss functions shown in Figure

5.8b, converge quite well similar to supervised learning procedure and

do not suffer from overfitting even with only 500 labeled samples. The

model learns to perform equally well as the supervised model (with the

same number of labeled samples) within first 20 epochs. Since this ar-

chitecture trains using four different loss functions concurrently, a little

precision is required while training the model. Learning rates are kept

very low to allow convergence of each training loss. We can observe

in Figure 5.8a, that the accuracy increases slowly over 500 epochs. We

can also verify that the adversarial training is conducted successfully

in Figure 5.8e. The blue and green curves in Figure 5.8e denote the

adversarial-discriminator loss functions. Their average values are con-

stant around 2× ln 2, which means the discriminator is fairly confused

between samples generated by the encoder and samples picked from

the prior distributions. In Figure 5.8f, we observe that the number of

samples produced for each class is approximately the same; therefore it

verifies that the generator of the categorical adversarial network learns

an unbiased model. This also shows that this adversarial training pro-

cedure does not suffer from the challenges like mode collapse. The

reconstruction loss (Figure 5.8c) decreases until 150 epochs and then

remains constant for rest of the training process.

To achieve the best performance for this semi-supervised model, the

model should be trained over 1000 epochs with diminishing learning

rates. In this work, all our experiments are limited to only 500 epochs.


(a) (b)

(c) (d)

(e) (f)

Figure 5.8: These figures show different loss values and their behavior

as learning progresses. These figures are obtained from the model

trained with 1000 labeled samples and 15000 unlabeled samples on

the WIS dataset. (a) The semi-supervised classification accuracy, (b)

the cross-entropy classification loss for training (magenta) and testing

data (blue), (c) the autoencoder reconstruction loss, (d) the adversarial

generator loss curve , (e) the adversarial discriminator loss for class-

label (green) and style (blue) latent codes, (f) the output frequency of

the categorical generator for each class.


Importance of Latent Distributions

Imposing categorical distribution on the output of the encoder helps

in making confident decisions about the class-label of the inputs. This

ensures that the latent code y does not carry any continuous style

information and it is only learned by the second part of the latent

code. The adversarial regularization using categorical distribution also

ensures that the output y of the encoder is a uniform distribution over

all the labels. This result can be observed in Figure 5.8f, where each

colored curve corresponds to the number of samples generated for each

class in one epoch.

Imposing continuous distribution on the output of the encoder tries

to capture remaining non-class label information, which is termed as

style information. This style information disentanglement can be easily

observed with the MNIST dataset [32], but it is difficult to comprehend

this fact for high-resolution real-world images visually. From our exper-

iments, we found out that in case of the MNIST dataset, it is not possible

to learn anything without a continuous prior on the style distribution.

According to our investigation, for real image data, this regularization

on the latent code helps in stabilizing the semi-supervised training

procedure. It also enhances the training speed and alleviates overfitting

of the AAE network.

5.3.2 Object Recognition Results

Here, we show the performance of our convolutional semi-supervised

AAE method on all three datasets. We compare the semi-supervised

classification results of our method with a competent supervised CNN

baseline method.

CNN Baseline Method

In the following section, we compare our semi-supervised classifica-

tion results with a Convolutional Neural Network (CNN) baseline

method. Our CNN baseline network is a VGG-type 9-layer network

with 7 convolutional layers and 2 fully-connected layers. The complete

architecture of the CNN baseline network is shown in Table B.3. The

baseline CNN method is also trained for 500 epochs with reduced learn-

ing rate after 100 and 300 epochs subsequently. The learning rates are

mentioned in the hyperparameter section of Table B.3.


Method MNIST (100) MNIST (1000) MNIST (All)

NN Baseline 33.21 8.60 1.70

VAE [20] 3.33 2.40 0.96

AAE-NN [32] 2.92 2.40 1.50

Conv-AAE [our] 2.81 1.75 0.53

State-of-the-art [15] 0.89 0.74 0.36

Table 5.2: Semi-supervised classification performance (error-rate in %)

on MNIST dataset. The number written in the (·) indicate the number

of labeled samples used in each case with 50000 unlabeled samples.

The category ’All’ denotes that all 50000 labeled samples were used and

trained using the AAE model. The ‘NN Baseline’ is a regular neural

network architecture made up of two hidden layers with 1024 hidden

units each.

We also tested various other standard competent CNN methods like

AlexNet [21] and DenseNet [17] as CNN baseline. But, we found the

VGG-type network to be most effective. Table 5.1 shows the comparison

between different networks.

Method Depth CNN Accuracy

VGG-type 9 layers 79.19

AlexNet 7 layers 66.89

DenseNet 10 layers 65.26

Table 5.1: CNN baseline performance with different competent stan-

dard networks. This performance is based on only 1000 labeled samples

from WIS dataset.

MNIST Dataset

We performed our first experiments with the standard MNIST dataset

to verify the correct implementation of the Adversarial Autoencoders.

The implementation described in [32] is based on fully-connected net-

works. We first implement the same architecture to obtain similar

results on semi-supervised classification. Further, we implemented

our convolutional AAE architecture on MNIST dataset and obtained a

better performance as compared to the fully-connected AAE network.

The semi-supervised classification results are shown in Table 5.2.


Internet Dataset

Method WIS (500) WIS (1000) WIS (4000) WIS (All)

CNN Baseline 73.12 79.19 87.53 91.67

AAE 76.92 81.98 88.45 92.68

Increase 3.80 2.79 0.92 1.01

Table 5.3: Semi-supervised classification performance (accuracy) on

Internet (WIS) dataset. The category ’All’ denotes that all 15000 labeled

samples were used and trained using the AAE model.

We consider WIS dataset to be a complete dataset concerning volume

and variety. This dataset is very well suited for semi-supervised learn-

ing methods. Our semi-supervised learning experiments on WIS vali-

dates the working of Convolutional AAE method. We obtain a perfor-

mance raise by 3.8% using unlabeled samples along with 3% labeled

samples as compared to the CNN baseline method. Table 5.3 shows

performance increase using AAE methods as compared to CNN base-

line for varying proportions of labeled samples. In this experiment, we

assume that the number of unlabeled samples is constant (15,000).

Real-world Dataset

Method WIS (1000) WIS (4000) WIS (10000)

RW (15000) RW (15000) RW (15000)

CNN Baseline 52.57 54.50 57.41

AAE 56.54 59.41 61.05

Increase 3.97 4.91 3.64

Table 5.4: Semi-supervised classification performance (accuracy ) on

Real-world (RW) dataset. In this experiment, we use labeled samples

from WIS dataset along with unlabeled samples from the RW dataset.

Two rows in the dataset heading shows the number of labeled samples

from the WIS dataset and the number of unlabeled samples from the

RW dataset.

RW dataset is a challenging dataset because it also captures the natural

artifacts that are present in the real world. Although this dataset does


not contain many variants of object instances, we can achieve compet-

itive performance for semi-supervised classification as compared to

the CNN baseline method. In this semi-supervised learning technique,

we believe that the unlabeled samples are useful for identifying all the

underlying manifolds for the dataset and the labeled samples reinforce

the useful manifolds for the classification task. It is important for the

semi-supervised learning that labeled samples strongly represent the

true class. Thus, we expect that the labeled samples should cover as

much variety as possible and should be of high quality.

Since the RW dataset lacks variety and quality, it is difficult to

achieve a consistent classification performance. Therefore, for this

experiment, we use the labeled samples from the WIS dataset and

unlabeled samples from the RW dataset for training. We randomly

select the labeled samples from the WIS dataset in need of cleaner

and more diverse labeled data. Training and test splits of RW dataset

contain different object instances. The semi-supervised classification

performance of the RW dataset is measured on 2000 samples (200

samples for each class) belonging to unseen object instances. In this

experiment, we achieve a gain in performance by about 4% shown in

Table 5.4. The lack of variation in the dataset explains the inconsistency

in the performance rise.

Hyperbolic Mapping of Latent (style) Code

Mapping WIS

(500-4.0)

WIS

(500-2.0)

Linear 76.92 76.81

Tanh 75.73 77.45

Table 5.5: Results of semi-supervised classification (accuracy) with

hyperbolic mapping of style distribution on the WIS dataset. The

values in (·) indicate the number of labeled samples along with the

variance of the prior distribution. The values are averaged over five

runs.

The default semi-supervised AAE architecture contains a linear map-

ping from the last hidden layer to the latent (style) representation. We

experimented with different non-linear mappings instead of using a lin-

ear operation and found that mapping the style distribution using the


hyperbolic tangent (tanh) function results in further performance gain

by approximately 0.5%. We speculate that using tanh mapping on the

style part of the latent representation reduces the weight of the encoder

parameters to regularize the style distribution and helps to capture

more robust class-label information in the categorical distribution.

Scalability

Another advantage of this method is that it works irrespective of the

scale of the image. Therefore, higher performance can be obtained if

input image of higher resolution is used. However, in GAN based

methods, it becomes more difficult to learn a stable model for higher

resolution images. This experiment shows that the performance of

the semi-supervised AAE classification improves along with the base-

line accuracies with the high-resolution input image. The results on

different scales are shown in Table 5.6.

Scale CNN(%) AAE(%) Increase

32× 32 76.28 78.27 1.99

64× 64 79.19 81.98 2.79

96× 96 80.10 83.04 2.94

128× 128 81.89 84.47 2.58

Table 5.6: The semi-supervised classification performance (accuracy)

for input images at different scales increasing from 32× 32 to 128× 128.

This experiment is conducted on WIS dataset trained with 1000 labeled

samples and 15000 unlabeled samples. The network architecture is

adjusted by increasing or decreasing a (Convolutional + Pooling) layer

for different resolutions of the input image.

5.4 Online Learning with AAE

In this section, we demonstrate that this semi-supervised AAE method

can also be used for online learning. In an online learning setup, the

system is expected to learning autonomously over time with minimal or

no supervision. Using a naive approach, we show that this method may

be extended for continuous online learning, although this is beyond

the scope of this work. In our naive setting, the model is trained with

only a few labeled samples, and then it learns independently from


the unlabeled samples in a continuous manner without any further

supervision.

Phase Labeled Unlabeled Accuracy

Phase-1 1000 0 79.19%

Phase-2 1000 3000 79.74%

Phase-3 1000 6000 80.23%

Phase-4 1000 9000 81.11%

Phase-5 1000 12000 81.78%

Phase-6 1000 15000 81.98%

Phase-7 1000 18000 82.12%

Table 5.7: The table shows that the performance of the semi-supervised

AAE improves as the number of unlabeled samples is increased. This

verifies that the system can be used in an online learning setup. The

values in the table are averaged over 3 runs each. The performance of

this system is shown on the WIS dataset.

In this experiment, we randomly select and label only 100 samples

per class from the dataset. Then, we increase the number of unlabeled

samples by 3000 for different phases of learning. These 3000 samples

are randomly picked from the dataset. Therefore the number of samples

per class may be slightly unbalanced. This experiment is performed on

the WIS dataset. We observe that the semi-supervised classification ac-

curacy consistently improves with more number of unlabeled samples.

The results are shown in Table 5.7.

We also conducted an online learning experiment where samples are

continuously increased in a single run to demonstrate its feasibility. In

this run, the number of unlabeled samples is increased by 1000 at each

100th epoch until 1900 epochs. We observe an increase in performance

by 2.64% from 100th epoch to 1900th epoch, shown in Figure 5.9. This

run is conducted with unlimited memory where all the labeled and

unlabeled samples are stored and iterated during training.


Figure 5.9: Performance of the AAE method for online learning. The

red curve corresponding to the case where all the samples are stored;

it shows consistent performance gain with more number of unlabeled

samples. The blue curve corresponding to the limited-memory case

also shows a performance gain but with a few fluctuations.

In another run, we performed the same experiment with limited

memory. In this run, the system is allowed to store only 1000 labeled

and 1000 unlabeled images at a time. In this experiment, new unla-

beled data is given in batches of 1000 images, whereas the labeled data

is always available during the training process. We observe sudden

drops in the performance when new data is exposed to the learning

system. Overall we observe an increase in classification accuracy as

compared to CNN baseline, but it underperforms as compared to the

unlimited memory case. This drop in performance is due to the well-

studied phenomena in the literature known as ’catastrophic forgetting’

or ’catastrophic inference’ [33]. When we remove the old unlabeled

data and add the new unlabeled data, we believe that learned mani-

folds are disturbed, thus resulting in a sudden drop in performance.

The performance of both cases is shown in Figure 5.9.


5.5 Discussions

While studying different AAE architectures in this work, we observed

numerous interesting features and properties of the convolutional ad-

versarial autoencoder model:

• This network can be trained using an end-to-end training proce-

dure, and the same architecture can be easily scaled for different

datasets with consistent performance.

• The network trains devoid of any overfitting even with 3% labeled

samples for the WIS dataset.

• Although the original work on AAE [32] suggests the addition of

Gaussian noise to the input layer, this convolutional version of

AAE does not require an addition of noise to supplement learning.

• Finding the optimal learning rates is challenging since four differ-

ent objective functions work collectively on different parts of the

network.

• Due to the intricate training procedure, the model needs to be

trained at very low learning rates. Therefore, it needs long com-

putational time for optimal results.

• The reconstructed image from the decoder is blurry, but a better

reconstruction does not guarantee a better classification perfor-

mance.

Chapter 6

Conclusion and Future Work

6.1 Conclusion

In this thesis, we investigated whether the semi-supervised learning

approaches based on deep generative models can be used for the appli-

cation of real-world object recognition. We provided an introduction

to the latest deep generative models which use deep neural networks

for inference. Further, we summarized the existing literature on semi-

supervised learning based on different generative models and real-

world applications of semi-supervised learning.

We proposed convolutional adversarial autoencoder architectures

for learning on real-world data. The deep neural network models

often lack interpretability, but modularity of this approach helped us

understand the learning dynamics to a good extent. We evaluated the

presented approach with increasing level of model complexity. This

allowed us to investigate all the components of the semi-supervised

AAE classification architecture, leading to a network architecture.

In this work, we show that our proposed convolutional AAE archi-

tecture can be successfully used for semi-supervised object recognition

on real-world data. We achieve a performance gain of approximately

4% for semi-supervised object recognition as compared to the fully

supervised method on real-world datasets. We obtain a competitive

semi-supervised classification performance on MNIST dataset as com-

pared to state-of-the-art semi-supervised learning techniques and also

outperform the fully-connected AAE model for MNIST dataset pro-

posed in [32]. The method performs consistently well over different

datasets without any major change in the network architecture or train-

50

CHAPTER 6. CONCLUSION AND FUTURE WORK 51

ing procedure.

We also compiled two new real-world datasets for object recognition,

which are highly diverse and can also be used in tandem. Using our

dataset compilation approach, the internet-based dataset can be easily

expanded since we do not need to annotate the samples individually.

We also realized that the training dataset must contain a minimum

variance concerning the variety of objects. Finally, through some simple

experiments, we also found that our semi-supervised AAE approach

can be applied to lifelong learning.

6.2 Future Work

Most of the current methods in machine learning work under the closed-

world assumption. They often assume that the world is comprised

of only a certain number of classes, which is predetermined before

learning anything. However, our world is changing rapidly, where new

categories appear and old ones disappear. Therefore, the system should

learn to adapt to these changes.

In this work, we notice that the increase in performance gain satu-

rates as the number of unlabeled samples increase after a certain limit.

To further improve the performance of the system, we need to label

a few more samples from the unlabeled dataset. This gap between

human and the autonomous system can be efficiently bridged with

methods like active learning and novelty detection, although it is quite

challenging to develop such methods for complex data distributions.

Using these techniques, the autonomous system can smartly query

the human to annotate the most valuable samples and also notify if it

detects samples from a new category. This way we can further improve

lifelong learning performance for visual recognition tasks.

Bibliography

[1] “Autoencoding beyond pixels using a learned similarity metric”.

In: CoRR (2015).

[2] Christopher M. Bishop. Pattern Recognition and Machine Learning

(Information Science and Statistics). Secaucus, NJ, USA: Springer-

Verlag New York, Inc., 2006.

[3] Avrim Blum and Tom Mitchell. “Combining Labeled and Un-

labeled Data with Co-training”. In: Proceedings of the Eleventh

Annual Conference on Computational Learning Theory. 1998.

[4] Ulf Brefeld, Christoph Büscher, and Tobias Scheffer. “Multi-view

Discriminative Sequential Learning”. In: Machine Learning: ECML

2005: 16th European Conference on Machine Learning, Porto, Portugal,

October 3-7, 2005. Proceedings. 2005.

[5] Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. “NEIL:

Extracting Visual Knowledge from Web Data”. In: International

Conference on Computer Vision (ICCV). CMU-RI-TR-. Pittsburgh,

PA, 2013.

[6] Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert

Fergus. “Deep Generative Image Models using a Laplacian Pyra-

mid of Adversarial Networks”. In: CoRR (2015).

[7] Alexey Dosovitskiy and Thomas Brox. “Generating Images with

Perceptual Similarity Metrics based on Deep Networks”. In: CoRR

(2016).

[8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Mar-

tin Arjovsky, Olivier Mastropietro, and Aaron Courville. “Adver-

sarially Learned Inference.” In: CoRR (2016).

[9] Zackory M. Erickson, Sonia Chernova, and Charles C. Kemp.

“Semi-Supervised Haptic Material Recognition for Robots using

Generative Adversarial Networks”. In: CoRR (2017).

52

BIBLIOGRAPHY 53

[10] Rob Fergus, Yair Weiss, and Antonio Torralba. “Semi-Supervised

Learning in Gigantic Image Collections”. In: Advances in Neural

Information Processing Systems 22. 2009.

[11] Akinori Fujino, Naonori Ueda, and Kazumi Saito. “A Hybrid

Generative/Discriminative Approach to Semi-supervised Clas-

sifier Design”. In: Proceedings of the 20th National Conference on

Artificial Intelligence - Volume 2. 2005.

[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,

David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua

Bengio. “Generative Adversarial Nets”. In: Advances in Neural In-

formation Processing Systems 27. Ed. by Z. Ghahramani, M. Welling,

C. Cortes, N. D. Lawrence, and K. Q. Weinberger. 2014.

[13] Ian J. Goodfellow. “NIPS 2016 Tutorial: Generative Adversarial

Networks”. In: CoRR (2017).

[14] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir

Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, and Gang

Wang. “Recent Advances in Convolutional Neural Networks”.

In: CoRR (2015).

[15] Philip Häusser, Alexander Mordvintsev, and Daniel Cremers.

“Learning by Association - A versatile semi-supervised training

method for neural networks”. In: CoRR (2017).

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep

Residual Learning for Image Recognition”. In: CoRR (2015).

[17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q

Weinberger. “Densely connected convolutional networks”. In:

Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition. 2017.

[18] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accel-

erating Deep Network Training by Reducing Internal Covariate

Shift”. In: CoRR (2015).

[19] Diederik P. Kingma, Danilo Jimenez Rezende, Shakir Mohamed,

and Max Welling. “Semi-Supervised Learning with Deep Genera-

tive Models”. In: CoRR (2014).

[20] Diederik P. Kingma and Max Welling. “Auto-Encoding Varia-

tional Bayes.” In: CoRR (2013).

54 BIBLIOGRAPHY

[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “Ima-

geNet Classification with Deep Convolutional Neural Networks”.

In: Proceedings of the 25th International Conference on Neural Infor-

mation Processing Systems. USA, 2012.

[22] S. Kullback and R. A. Leibler. “On Information and Sufficiency”.

In: Ann. Math. Statist. 1 (Mar. 1951).

[23] Abhishek Kumar, Prasanna Sattigeri, and P. Thomas Fletcher.

“Improved Semi-supervised Learning with GANs using Manifold

Invariances”. In: CoRR (2017).

[24] Samuli Laine and Timo Aila. “Temporal Ensembling for Semi-

Supervised Learning”. In: CoRR (2016).

[25] Alex Lamb, Vincent Dumoulin, and Aaron C. Courville. “Dis-

criminative Regularization for Generative Models”. In: CoRR

(2016).

[26] Jeongtae Lee, Jaehong Yun, Sungju Hwang, and Eunho Yang.

“Lifelong Learning with Dynamically Expandable Networks”. In:

(Aug. 2017).

[27] Fei-Fei Li, Andrej Karpathy, and Justin Johnson. Stanford Lecture

CS231n: Convolutional Neural Networks for Visual Recognition. 2016.

URL: \urlhttp://cs231n.stanford.edu/.

[28] Shan Luo, Xiaozhou Liu, Kaspar Althoefer, and Hongbin Liu.

“Tactile Object Recognition with Semi-Supervised Learning”. In:

(Aug. 2015).

[29] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and

Ole Winther. “Auxiliary Deep Generative Models”. In: Proceedings

of the 33rd International Conference on International Conference on

Machine Learning - Volume 48. 2016.

[30] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. “Rectifier

nonlinearities improve neural network acoustic models”. In: in

ICML Workshop on Deep Learning for Audio, Speech and Language

Processing. 2013.

[31] Alireza Makhzani and Brendan J. Frey. “PixelGAN Autoencoders”.

In: CoRR (2017).

[32] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J.

Goodfellow. “Adversarial Autoencoders”. In: CoRR (2015).

\urlhttp://cs231n.stanford.edu/

BIBLIOGRAPHY 55

[33] Michael McCloskey and Neal J. Cohen. “Catastrophic Interfer-

ence in Connectionist Networks: The Sequential Learning Prob-

lem”. In: Psychology of Learning and Motivation - Advances in Re-

search and Theory (1989).

[34] Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger.

“Adversarial Variational Bayes: Unifying Variational Autoencoders

and Generative Adversarial Networks”. In: CoRR (2017).

[35] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. “Stochas-

tic Backpropagation and Approximate Inference in Deep Gener-

ative Models”. In: Proceedings of the 31st International Conference

on Machine Learning (ICML-14). JMLR Workshop and Conference

Proceedings, 2014.

[36] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman.

“Semi-Supervised Self-Training of Object Detection Models”. In:

Proceedings of the Seventh IEEE Workshops on Application of Com-

puter Vision (WACV/MOTION’05) - Volume 1 - Volume 01. 2005.

[37] Fereshteh Sadeghi, Santosh K Divvala, and Ali Farhadi. “VisKE:

Visual Knowledge Extraction and Question Answering by Vi-

sual Verification of Relation Phrases”. In: Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition. 2015.

[38] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Che-

ung, Alec Radford, and Xi Chen. “Improved Techniques for Train-

ing GANs”. In: CoRR (2016).

[39] Karen Simonyan and Andrew Zisserman. “Very Deep Convolu-

tional Networks for Large-Scale Image Recognition”. In: CoRR

(2014).

[40] Jost Tobias Springenberg. “Unsupervised and Semi-supervised

Learning with Categorical Generative Adversarial Networks”. In:

CoRR (2015).

[41] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. “Particular ob-

ject retrieval with integral max-pooling of CNN activations”. In:

CoRR (2015).

[42] Junbo Jake Zhao, Michaël Mathieu, Ross Goroshin, and Yann

LeCun. “Stacked What-Where Auto-encoders.” In: CoRR (2015).

56 BIBLIOGRAPHY

[43] Denny Zhou, Jiayuan Huang, and Bernhard Schölkopf. “Learning

from Labeled and Unlabeled Data on a Directed Graph”. In: ACM

Press, 2005.

[44] Xiaojin Zhu. Semi-Supervised Learning Literature Survey. 2006.

Appendix A

Datasets

A.1 Dataset Filtering

(a)

Figure A.1: This figure shows qualitative results of filtering process

discussed in Section 5.1.4. This filtering approach helps in removing

most of the false-positives obtained from image search engine.

57

58 APPENDIX A. DATASETS

A.2 Real-world Dataset: Video Streams

Here, we show two video streams captured using a hand-held camera.

We sample image frames from such video streams to create our RW

dataset.

Figure A.2: Samples from one of the video streams used for collecting

the RW dataset.

Figure A.3: Samples from another video stream used for collecting the

RW dataset.

Appendix B

Architecture Details

B.1 Semi-supervised Convolutional AAE

B.1.1 Adversarial Network: Discriminator

Operation Hidden

Units

BN? Drop. Non-lin

Input:30-D

FC-1 1024 0.0 ReLU

FC-2 1024 0.0 ReLU

Output 2 0.0 Sigmoid

Hyperparameters

Learning-rate (Epochs) 0− 100 100− 300 300− 500

Opt:Adam-Adv (α) 1e− 5 1e− 6 1e− 7

Opt:Adam-Adv (β) β1 = 0.1, β2 = 0.999

Epochs = 500, Batch Size = 50

Weight initialization Isotropic Gaussian (µ = 0, σ = 0.02)

Bias initialization Constant (0.1)

Table B.1: Adversarial module: Discriminator sub-network of the ad-

versarial network for both categorical and continuous distribution. BN

in the column heading stands for ‘Batch-Normalization’. FC denotes

fully-connected layer. ‘Non-lin’ stands for non-linearity type of the

activation function used for the corresponding layer.

59

60 APPENDIX B. ARCHITECTURE DETAILS

B.1.2 Autoencoder Network

Operation Filters Kernel Strides BN? Dropout Non-lin

Input-64× 64× 3 3 — — — — —

Convolution-1 64 3× 3 1× 1 0.0 LReLU


Max-Pooling-1 64 2× 2 2× 2 0.0 LReLU









FC-1y,z 1024 — — — 0.5 LReLU

FC-2y 512 — — — 0.5 LReLU

Latent-Code (lc-y) 10 — — — 0.0 Softmax

Latent-Code (lc-z) 30 — — — 0.0 Linear

Concat (lc-y + lc-z) 40 — — — — —

FC-3y+z 16384 — — — 0.0 Linear

Reshape-8×8×256 256 — — — 0.0 Linear

Up-convolution-1 128 3× 3 2× 2 0.0 LReLU

Up-convolution-2 64 3× 3 2× 2 0.0 LReLU

Up-convolution-3 3 3× 3 2× 2 0.0 Sigmoid

Output 64× 64× 3

Hyperparameters


Opt:Adam-AE (α) 5e− 7 5e− 8 5e− 9

Opt:Adam-AE (β) β1 = 0.9, β2 = 0.999

Epochs = 500, Batch Size = 50, Leaky ReLU slope = 0.01

Weight initialization: Isotropic Gaussian (µ = 0, σ = 0.02)

Bias initialization: Constant (0.1)

Table B.2: ‘AE’ stands for autoencoder. The superscripts ’y’ and ’z’

represent the style and class-label latent variables respectively. lc-y and

lc-z represent the latent code for style and class-label respectively. For

FC layers, ‘filters’ corresponds to the number of hidden units.

APPENDIX B. ARCHITECTURE DETAILS 61

B.1.3 Classification/Adversarial Network: Generator

Operation Filters Kernel Strides BN? Dropout Non-lin

Input-64×64×3 3 — — — — —












FC-1 1024 — — 0.5 LReLU

FC-2 512 — — 0.5 LReLU

Output-y 10 — — — 0.0 Softmax

Hyperparameters


Opt:Adam-CNN (α) 1e− 5 1e− 6 1e− 7

Opt:Adam-CNN (β) β1 = 0.9, β2 = 0.999

Opt:Adam-Gen (α) 1e− 4 1e− 5 1e− 6

Opt:Adam-Gen (β) β1 = 0.1, β2 = 0.999

Epochs = 500, Batch Size = 50, Leaky ReLU slope = 0.01

Weight initialization: Isotropic Gaussian (µ = 0, σ = 0.02)

Bias initialization: Constant (0.1)

Table B.3: Semi-supervised classification module: This architecture

is also used as the CNN Baseline architecture. ’Gen’ indicates Gen-

erator network, which is active during adversarial training. ’CNN’

indicates ’Convolutional Neural Network’ which is active during the

semi-supervised classification phase.

www.kth.se

semi-supervised learning for real-world object recognition ... · sarial autoencoders (aae) for...

Documents