semi-supervised learning for real-world object recognition ... · sarial autoencoders (aae) for...
TRANSCRIPT
IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS
, STOCKHOLM SWEDEN 2017
Semi-supervised Learning for
Real-world Object Recognition
using Adversarial Autoencoders
SUDHANSHU MITTAL
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION
Semi-supervised Learning for
Real-world Object
Recognition using
Adversarial Autoencoders
SUDHANSHU MITTAL
Master in Computer Science
Date: December 22, 2017
Supervisor: Prof. Thomas Brox (University of Freiburg), Prof.
Wolfram Burgard (University of Freiburg), Prof. Atsuto Maki (KTH)
Examiner: Prof. Danica Kragic
School of Computer Science and Communication
ii
Abstract
For many real-world applications, labeled data can be costly to obtain.
Semi-supervised learning methods make use of substantially available
unlabeled data along with few labeled samples. Most of the latest work
on semi-supervised learning for image classification show performance
on standard machine learning datasets like MNIST, SVHN, etc. In this
work, we propose a convolutional adversarial autoencoder architecture
for real-world data. We demonstrate the application of this architecture
for semi-supervised object recognition. We show that our approach
can learn from limited labeled data and outperform fully-supervised
CNN baseline method by about 4% on real-world datasets. We also
achieve competitive performance on the MNIST dataset compared to
state-of-the-art semi-supervised learning techniques. To spur research
in this direction, we compiled two real-world datasets: Internet (WIS)
dataset and Real-world (RW) dataset which consists of more than 20K
labeled samples each, comprising of small household objects belonging
to ten classes. We also show a possible application of this method for
online learning in robotics.
iii
Sammanfattning
I de flesta verklighetsbaserade tillämpningar kan det vara kostsamt
att erhålla märkt data. Inlärningsmetoder som är semi-övervakade
använder sig oftast i stor utsträckning av omärkt data med stöd av
en liten mängd märkt data. Mycket av det senaste arbetet inom semi-
övervakade inlärningsmetoder för bildklassificering visar prestanda på
standardiserad maskininlärning så som MNIST, SVHN, och så vidare.
I det här arbetet föreslår vi en convolutional adversarial autoencoder
arkitektur för verklighetsbaserad data. Vi demonstrerar tillämpningen
av denna arkitektur för semi-övervakad objektidentifiering och visar
att vårt tillvägagångssätt kan lära sig av ett begränsat antal märkt data.
Därmed överträffar vi den fullt övervakade CNN-baslinjemetoden
med ca. 4% på verklighetsbaserade datauppsättningar. Vi uppnår även
konkurrenskraftig prestanda på MNIST datauppsättningen jämfört
med moderna semi-övervakade inlärningsmetoder. För att stimulera
forskningen i den här riktningen, samlade vi två verklighetsbaserade
datauppsättningar: Internet (WIS) och Real-world (RW) datauppsät-
tningar, som består av mer än 20 000 märkta prov vardera, som utgörs
av små hushållsobjekt tillhörandes tio klasser. Vi visar också en möjlig
tillämpning av den här metoden för online-inlärning i robotik.
iv
Acknowledgement
I would like to thank my supervisors at the University of Freiburg,
Prof. Thomas Brox and Prof. Wolfram Burgard for giving me this op-
portunity to pursue my master thesis at their lab. I greatly appreciate
their constant support, feedback and guidance throughout the thesis
work. I would like to thank my supervisor at KTH, Prof. Atsuto Maki
for supporting this collaboration in all respects and for his meticulous
feedback on scientific writing. I would like to thank Prof. Danica
Kragic Jensfelt for examining the thesis and organizing the public pre-
sentation at KTH. I owe a great debt of gratitude to Andreas Eitel and
Maxim Tatarchenko for being great mentors, for countless discussions,
motivation and guidance.
I had the privilege of discussing and learning from many excep-
tional researchers at AIS. Special thanks to Gabriel Oliveira, Ayush
Dewan, Tayyab Naseer, Marcel Binz and Noha Radwan for numerous
interesting discussions. Many thanks to Andreas Eitel, Michael Keser
and Philipp Jund for their technical support. I would like to thank An-
dreas Eitel and Prof. Wolfram Burgard for offering me a student job at
AIS which supported me financially throughout my stay in Germany. I
thank Anna Hellberg Gustafsson from KTH for providing me Erasmus+
scholarship for my stay in Germany.
I thank Andreas Eitel, Maxim Tatarchenko and Florian Kraemer for
proofreading the thesis report. This work would not have been possible
without the support of everyone at the AIS group. Special thanks to
Marcus Lundin, Gabriela Zarzar Gandler and Sebastian Zarzar Gandler
for helping me write the Swedish version of the abstract. I thank every-
one who helped me to collect the dataset: Tobias Paxian, Andreas Eitel,
V.K.Mittal, Shashi Kabdal, Himanshu Mittal, Shruti Kabdal, Shuchi Kab-
dal, Hannah Rosa Nesswetter, David Czudnochowski, Anand Narayan,
Sophie Ninnemann, Gabriela Zarzar Gandler, Jingwei Zhang, Oier
Mees, Rendani Mbuvha, Ronak Shah, Vishakha Patel, Andy Wachaja
and Federico Boniardi.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Ethics, Societal Aspects and Sustainability . . . . 3
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . 6
2.1.1 Convolutional Neural Networks . . . . . . . . . . 7
2.2 Deep Generative Models . . . . . . . . . . . . . . . . . . . 9
2.2.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Generative Adversarial Network . . . . . . . . . . 12
3 Related Work 15
3.1 Deep Generative Models . . . . . . . . . . . . . . . . . . . 15
3.1.1 VAE-based Methods . . . . . . . . . . . . . . . . . 16
3.1.2 GAN-based Methods . . . . . . . . . . . . . . . . . 16
3.1.3 Hybrid Methods . . . . . . . . . . . . . . . . . . . 16
3.1.4 Real-world Applications . . . . . . . . . . . . . . . 17
4 Methodology 19
4.1 Adversarial Autoencoders . . . . . . . . . . . . . . . . . . 19
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Basic AAE Architecture . . . . . . . . . . . . . . . 20
4.1.3 Learning Latent Distributions . . . . . . . . . . . . 22
4.1.4 Semi-supervised AAE . . . . . . . . . . . . . . . . 23
4.1.5 Convolutional Semi-supervised AAE Architecture 27
v
vi CONTENTS
5 Experiments and Results 30
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 MNIST Dataset . . . . . . . . . . . . . . . . . . . . 30
5.1.2 Internet Dataset . . . . . . . . . . . . . . . . . . . . 30
5.1.3 Real-world Dataset . . . . . . . . . . . . . . . . . . 32
5.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . 34
5.2 Learning of the Latent Distribution . . . . . . . . . . . . . 35
5.3 Semi-supervised Classification . . . . . . . . . . . . . . . 38
5.3.1 Implementation Details . . . . . . . . . . . . . . . 38
5.3.2 Object Recognition Results . . . . . . . . . . . . . 42
5.4 Online Learning with AAE . . . . . . . . . . . . . . . . . 46
5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Conclusion and Future Work 50
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A Datasets 57
A.1 Dataset Filtering . . . . . . . . . . . . . . . . . . . . . . . . 57
A.2 Real-world Dataset: Video Streams . . . . . . . . . . . . . 58
B Architecture Details 59
B.1 Semi-supervised Convolutional AAE . . . . . . . . . . . . 59
B.1.1 Adversarial Network: Discriminator . . . . . . . . 59
B.1.2 Autoencoder Network . . . . . . . . . . . . . . . . 60
B.1.3 Classification/Adversarial Network: Generator . 61
Chapter 1
Introduction
1.1 Motivation
The idea behind semi-supervised learning for object recognition comes
from the learning ability of human beings. A human child can learn
about objects like animals, toys, etc. from only a few examples. For
example, once a child is shown what a cat looks like, it can thereafter
recognize a new type of cats in the world. Human beings do not require
thousands of labeled examples to learn the visual appearance of an
object, and they become better at recognition with subsequent exposure
to other variants of that object.
Image classification is one of the important tasks in the field of
computer vision. This task is highly relevant for various applications
like autonomous driving, service robotics, remote sensing and medical
diagnosis. Most of the latest image classification methods like Deep
Residual Networks [16] require a large collection of manually labeled
images to perform well. Collecting labeled samples can be difficult and
very expensive for specific real-world applications.
One way to tackle this challenge is by leveraging information from
unlabeled data in an unsupervised or semi-supervised manner. Al-
though image classification in a completely unsupervised manner is
not yet practical for complex distributions like natural images, recent
methods based on neural networks have shown promising results for
semi-supervised learning. In semi-supervised learning methods, we
can make use of unlabeled data for training - typically a small amount
of labeled data with a large amount of unlabeled data. Semi-supervised
methods make use of unlabeled data to better capture the shape of
1
2 CHAPTER 1. INTRODUCTION
underlying data distribution and generalize better to new samples.
In fields like medical science and robotics, it is much easier to obtain
unlabeled data as compared to obtaining labeled data. For example,
in robotics, a mobile robot can autonomously interact with the envi-
ronment and collect unlabeled data in abundance without any human
supervision. Therefore, semi-supervised learning is very well suited to
fields like robotics.
Several methods have been studied in the literature for semi-su-
pervised learning. In this work, we plan to focus on techniques based
on generative models. Building scalable generative models to capture
rich distributions such as audio, images or video is one of the impor-
tant challenges in machine learning. Until recently, deep generative
models, such as Restricted Boltzmann Machines, Deep Belief Networks
and Deep Boltzmann Machines were trained primarily by sampling
algorithms. In these sampling-based approaches, the methods become
more imprecise as training progresses. This happens because samples
from the procedures are unable to mix between modes fast enough.
In recent years, several deep generative models, namely, Variational
Autoencoder (VAE) and Generative Adversarial Network (GAN), have
been developed that can be trained via direct back-propagation and
avoid the difficulties that come with sampling-based training.
Figure 1.1: Examples for each class from the Real-world (RW) dataset:
banana, bottle, bowl, calculator, can, cup, orange, scissors, soccer-ball
and watering-can.
In this work, we explore, how well the latest methods based on
deep generative models can be used to recognize objects using semi-
supervised learning methods. We scale one such method called Adver-
CHAPTER 1. INTRODUCTION 3
sarial Autoencoders (AAE) for object recognition on real-world image
datasets. Figure 1.1 gives a glimpse of our real-world object dataset.
AAE is a hybrid approach which uses ideas from Variational Autoen-
coder (VAE) and Generative Adversarial Network (GAN). AAE is a
probabilistic autoencoder that uses an adversarial framework for varia-
tional inference. In a probabilistic autoencoder, the encoder approxi-
mates a posterior distribution, and the decoder is used to stochastically
reconstruct the input data from the latent variables; the resulting model
captures the distribution over images. Latent variable are the variables
that are not directly observed but rather are inferred using a mathemat-
ical model, from other observed variables.
Online learning is a related task which is highly relevant for robotics.
For example in service robotics, every time a new mobile robot is set
up in a new environment, it needs to adapt to the environment and
learn the objects in that environment for an interactive application. The
traditional way is to annotate all the objects manually to recognize and
interact with them. Additionally, the variety of objects also changes
dynamically in any given environment. To reduce these expenses, we
can deploy a robot with a semi-supervised learning approach. The
robot’s learning model can be initially trained with only a few labeled
instance of the objects, and then the robot can adapt its model to increase
the classification performance over time by collecting more unlabeled
data. In this work, we also show how this semi-supervised learning
method may be used for online learning on real-world data. Since
our real-world data is similar to the data captured by the robots, this
method can be readily applied to robotics.
1.1.1 Ethics, Societal Aspects and Sustainability
The contributions of this thesis work are very technical concerning the
usage of deep generative models for semi-supervised object recognition,
although there are many possible applications of object recognition in
general for example autonomous driving, medical diagnosis, service
robotics, etc.
Some applications of semi-supervised classification can be highly
relevant for the society, for example, cancer tumor detection in magnetic
resonance spectroscopic images. Since we all know that cancer is a fatal
disease and more than 10 million people are diagnosed with cancer
every year worldwide, it is one of the main challenges that our society
4 CHAPTER 1. INTRODUCTION
faces. Detection of a cancer tumor in early stages can help in curing
it before it becomes fatal. In oncology, medical diagnosis of cancer
involves differentiating between tumor types and grades. The classifi-
cation requires the availability of accurate diagnosis of past cases such
that they can be used as training samples. Such labeled data is scarce
in most areas of medical science while unlabeled data can be acquired
in abundance without keeping the identity of the person undisclosed.
Therefore, semi-supervised recognition can be a sensible choice for such
applications. In our opinion, semi-supervised classification methods
can have a positive societal impact. They can help us learn a model for
applications where supervision is scarce, and anonymity of the data is
the foremost priority.
One of the major ethical challenge in most of the computer vision
applications is the privacy of the individual’s private image data. But
this work majorly concerns about the common household objects, thus
it poses weaker challenges to privacy breach of an individual. The
method discussed in this work makes the application of object recogni-
tion to different fields relatively easier as we need lesser labeled images
to accomplish the same task. This method can also be used for ecologi-
cal sustainability by applying this method for better planning of land
and forest usage. This method can help us classify different species and
landmarks using aerial images. Similarly, such methods can also be
applied for applications in spatial informatics, which in general lack
labeled data. In conclustion, sustainable usage of technologies based
on semi-supervised learning can help to improve the performance of
systems in various such areas which are crucial for social and ecological
preservation.
1.2 Contributions
We apply one of the latest semi-supervised learning methods for object
recognition on real-world image data. In summary, our key contribu-
tions are:
• Real-world Datasets: Compilation of two different real-world
datasets. The first dataset is collected using a web-based image
search engine and the second dataset is collected using a hand-
held camera. Both datasets are automatically filtered using a latest
image retrieval method [41].
CHAPTER 1. INTRODUCTION 5
• Convolutional AAE for Semi-supervised Object Recognition:
A Convolutional Adversarial Autoencoder architecture for semi-
supervised end-to-end training. Extension of the fully-connected
AAE architecture, proposed in [32], to a convolutional AAE ar-
chitecture. An open-source Tensorflow implementation of our
method that can be applied to different datasets. Competitive re-
sults on the standard MNIST dataset and our real-world datasets.
• Online learning: Experiments demonstrating the possible appli-
cation of our semi-supervised method for online learning, espe-
cially in robotics.
1.3 Overview of the Thesis
In Chapter 2, we describe the theoretical concepts behind all the build-
ing blocks of an adversarial autoencoder, concepts of artificial neural
networks, convolutional neural networks and deep generative models
including both variational autoencoder and adversarial networks. In
Chapter 3, we review most relevant related works on deep generative
models for semi-supervised learning and other related semi-supervised
methods for real-world application. In Chapter 4, we first discuss
the motivation and theory behind the basic adversarial autoencoder
model in detail. Later in this chapter, we describe our new proposed
architecture for convolutional adversarial autoencoders.
Finally, in Chapter 5, we present our experiments and evaluation
results of the method discussed in Chapter 4. In this chapter, we present
one of the potential applications of the adversarial autoencoder model
in online learning and in Chapter 6, we summarize our results and
discuss the future work.
Chapter 2
Background
In this chapter, we will briefly discuss various building blocks impor-
tant to this work. Most of these components are based on deep neural
networks.
2.1 Artificial Neural Networks
The feed-forward neural network is a machine learning model that
learns to approximate some function. It can be used for both classifi-
cation and regression problems. For a classification problem, it learns
the function y = f(x; θ), that maps input x to the class-label y. The
feed-forward neural network learns the value of the parameter θ that
results in the best function approximation.
Figure 2.1: The general feed-forward neural network architecture [2].
6
CHAPTER 2. BACKGROUND 7
It can be described as the series of functional transformations. For
example, for input variables, x1, ..., xD, one set of function (also called
layer) is defined as:
zj = h(D∑
i=1
w(1)ji xi + w
(1)j0 ), (2.1)
where j = 1, ...,M and wji are the weights (same as θ in the above ex-
planation), w(1)j0 are the bias parameters, the superscript (1) indicates
the index of the layer in the network. Each linear combination is trans-
formed using a non-linear activation function h(·) as shown in Eq. 2.1.
Each of outputs zj are called hidden-units. These hidden units can be
further combined to form an overall network function as:
yk(x,w) = σ(M∑
j=1
w(2)kj zj + w
(2)k0 ), (2.2)
where k = 1, ..., K and K is the total number of outputs, and similarly
w(2)k0 are the bias parameters, and σ is the final activation function similar
to h(·) in Eq. 2.1. For classification problem, σ is often selected to be
the softmax activation function. Similarly, we can make a hierarchical
chain of such non-linear functions to form a neural network according
to our application. Figure 2.1 shows the general artifical neural network
architecture.
The feed-forward network can be trained with the standard back-
propagation algorithm. For multi-class classification, the cross-entropy
loss function is used with a feed-forward neural network, which is
defined as
H(y, y′) = −∑
i
y′i log yi, (2.3)
where yi and y′i denote the predicted class-label and true class-label of
the class i, respectively. Neural networks with more than three hidden
layers are sometimes called deep neural networks.
2.1.1 Convolutional Neural Networks
Convolutional Neural Networks (ConvNets/CNN) are a special type
of neural networks for processing data that has a grid-like topology
8 CHAPTER 2. BACKGROUND
for example image data can be thought of as a 2-D grid of pixels. Con-
volution is a kind of linear operation on the grid of numbers. These
networks use convolution operation instead of matrix multiplication
like in regular neural networks. Convolutional neural networks have
been tremendously successful in applications like image recognition
and classification.
ConvNet is a sequence of layers similar to hidden layers in regular
neural networks and every layer of a ConvNet transforms one volume
of activations to another through a differentiable function. There are
several types of layers in a ConvNet architecture: Convolutional layer,
Pooling layer, Dropout layer, Normalization layer and Fully-connected
layer. We now briefly discuss these small building block-layers. The
explanation of ConvNets is inspired by a latest survey on CNN methods
[14] and a course [27] from the Stanford University:
Convolutional Layer
Convolutional layers apply a convolution operation to the input, pass-
ing the output to the next layer. The primary purpose of convolution in
case of a ConvNet is to extract features from the input image. Convolu-
tion preserves the spatial relationship between pixels by learning image
features using small squares of input data. The convolution operation
allows the network to learn spatial features at hierarchical levels with
fewer parameters as compared to a regular neural network.
Pooling Layer
Pooling layer is used in-between successive convolution layers in a
ConvNet architecture. Its function is to progressively reduce the spatial
size of the representation to reduce the number of parameters in the
network, and hence to also control overfitting. The pooling Layer
operates independently on every depth slice of the input and resizes
it spatially, using the MAX operation. This is commonly called Max-
pooling operation. We also use a similar configuration of pooling layer
in our semi-supervised learning architecture.
Dropout Layer
This layer “drops out” a random set of activations in a layer by setting
them to zero. It makes sure that the network is not getting too “fitted”
CHAPTER 2. BACKGROUND 9
to the training data and thus helps alleviate the overfitting problem.
An important note is that this layer is only used during training, and
not during test time.
Batch-Normalization Layer
It is a technique [18] proposed for accelerating the learning process
of deep neural networks. They affirm that due to the change in the
distribution of each layer’s parameters, the learning process is slowed
down. They call this internal covariance shift and solve this problem by
normalizing layer inputs. Normalization is carried out for each training
mini-batch.
Fully-Connected Layer
Each unit in a fully-connected layer has full connections to all activa-
tions in the previous layer similar to conventional neural networks.
Their activations can hence be computed with a matrix multiplication
followed by a bias offset.
2.2 Deep Generative Models
Generative models can be trained with missing data, and semi-super-
vised learning is one of the interesting cases of missing data where
labels for most of the training data are missing. In deep generative
models, the generative model is implicitly or explicitly learned using
deep neural networks. Deep generative models are one of the success-
ful techniques that attempt to solve the problem of unsupervised and
semi-supervised learning. They have widespread applications besides
semi-supervised learning, like density estimation, image denoising and
representation learning.
2.2.1 Autoencoders
Variational Autoencoder (VAE) [20, 35] is a deep generative modeling
technique that uses neural networks to parameterize the posterior dis-
tribution of the latent variables along with a generative network. VAE
is based on an autoencoder architecture. We first briefly describe the
autoencoder model before explaining the VAE algorithm.
10 CHAPTER 2. BACKGROUND
Vanilla Autoencoder
Figure 2.2: The general Autoencoder architecture where the yellow
module is the encoder network and red module is the decoder network.
An autoencoder is a feed-forward neural network that tries to recon-
struct its input after passing it through a lower dimensional space.
Autoencoders are unsupervised learning models. An autoencoder
contains two connected set of neural networks: encoder network and
decoder network, shown in Figure 2.2. The encoder compresses the
input data to a lower dimensional space also called the latent space or
hidden representation, and the decoder takes this hidden representa-
tion as input with the goal to reconstruct the input to the encoder. In
other words, the encoder can be defined as function h = f(x) and a
decoder as function t = g(h), and effectively learns an identity function
g(f(x)) = x. An autoencoder is trained with the reconstruction loss
between the input and its reconstructed version. It is simple and most
effectively defined using a mean-squared loss or L2-loss function as.
L =1
2M
M∑
i
||xi − xi||22, (2.4)
where M is the total number of input data, xi is the original data and
xi is its reconstruction. If the input data is normalized between [0, 1],
the cross-entropy loss can also be used as reconstruction loss.
CHAPTER 2. BACKGROUND 11
Variational Autoencoder
Figure 2.3: The general Variational Autoencoder (VAE) architecture,
where the yellow module is the encoder network and red module is the
decoder network.
Autoencoder is capable of generating only those images, which are
shown during training. We cannot generate new images using a sim-
ple autoencoder. To build an explicit generative model, a probability
distribution is imposed over the latent space in the case of variational
autoencoder. In VAE, this is imposed using a Kullback-Liebler (KL)
divergence [22] loss term, which measures the distance between latent
variables (output of the encoder) and a standard Gaussian distribution
(prior distribution) along with a mean-squared error term to accurately
reconstruct the input images. The objective of this latent variable model
is to calculate the posterior p(z|x). According to Bayes:
p(z|x) =p(x|z)p(z)
p(x)(2.5)
Since evaluating the evidence term p(x) is intractable, we need to
approximate this posterior distribution. To overcome this challenge, the
VAE introduces an inference machine qφ(z|x) that learns to approximate
the posterior pθ(z|x). Hence, the objective of this latent variable model
becomes to minimize the following KL-divergence (DKL) term:
DKL[qφ(z|x)||pθ(z|x)], (2.6)
but computing the posterior pθ(z|x) is still intractable due to the pres-
ence of evidence term pθ(x). To make this variational inference tractable
12 CHAPTER 2. BACKGROUND
VAE combines Evidence Lower Bound (ELBO) function with KL di-
vergence term and tries to maximize the lower bound on the data
likelihood instead: pθ(x) ≥ L(θ, φ,x). The lower bound is written as:
L(θ, φ,x) = Eqφ(z|x)[log pθ(x|z)]︸ ︷︷ ︸
reconstruction term
−DKL(qφ(z|x))||pθ(z))︸ ︷︷ ︸
regularization term
. (2.7)
The overall loss term in the case of VAEs is a sum of the reconstruc-
tion term and the KL divergence regularization term as shown in Eq.
2.7. Figure 2.3 shows the general VAE architecture with appropriate
notations used in Eq. 2.7.
2.2.2 Generative Adversarial Network
Figure 2.4: The general Generative Adversarial Network (GAN) archi-
tecture where the green module is the discriminator (D) network, and
the yellow module is the generator (G) network.
The basic idea of Generative Adversarial Network (GAN) [12] is to set
up a game between two players. One of them is called the generator
G(z). The generator creates samples that are intended to come from the
same distribution as the training data. The other player is called the
discriminator D(x). The discriminator determines whether the samples
are generated (fake) by the generator or taken from the training data
(real). The discriminator is similar to a supervised model classifying
samples into two classes, which are real or fake. The generator learns
CHAPTER 2. BACKGROUND 13
to fool the discriminator by producing fake samples similar to the true
training data, and the discriminator learns to catch the counterfeiting
process of the generator. Figure 2.4 shows the basic GAN architecture
as explained above.
The adversarial game between the generator and discriminator can
be formalized as:
minG
maxD
V (D,G) = Ex∼pdata(x)[log(D(x))] + Ez∼p(z)[log(1−D(G(z)))].
(2.8)
In Eq. 2.8, pdata(x) denotes the data distribution, p(z) denotes the
latent distribution for sampling the noise vector, which is given as input
to the generator. Generally, the noise distribution is assumed to be a
standard normal distribution. V (D,G) is the overall GAN objective
function. Eq. 2.8 can be further broken into two separate equations for
discriminator and generator networks for implementation purpose:
Discriminator Network
maxD
V (D,G) = Ex∼pdata(x)[log(D(x))]︸ ︷︷ ︸
Maximize prob. of D(real)
+Ez∼p(z)[log(1−D(G(z)))︸ ︷︷ ︸
Minimize prob. of D(fake)
].
(2.9)
In Eq. 2.9, D(x) is trained with the sigmoid cross-entropy loss func-
tion with label 1 for a real sample and 0 for a fake sample.
Generator Network
minG
V (D,G) = Ez∼p(z)[log(1−D(G(z)))︸ ︷︷ ︸
Maximize prob. of D(fake)
. (2.10)
In Eq. 2.10, D(x) is trained with sigmoid cross-entropy loss function
with label 0 for a real sample and 1 for a fake sample. The classifier is
trained on two mini-batches of data; one coming from the dataset and
other coming from the generator.
As a result of this learning procedure, the generator learns to create
samples that are drawn from the same distribution as the training data.
A GAN [13] can be considered as an implicit generative model. In
GAN, the model does not explicitly represent a probability distribution
where data lies like in VAEs, but the model interacts indirectly with the
probability distribution. We can sample directly from the distribution
represented by the model itself.
14 CHAPTER 2. BACKGROUND
GANs are known to be hard to train due to several reasons. First, the
formulation from Equation 2.8 can become unstable if the discriminator
learns too quickly. In this case, the loss of the generator saturates before
reaching an equilibrium. Second, GANs suffer from ‘modal collapse’
[38] to a parameter setting, where the generator might get stuck with
generating one mode of the data.
Chapter 3
Related Work
Semi-supervised learning is a well-studied topic in the literature [44].
Some of the well-studied semi-supervised learning methods include
self-training [36], generative models [11], graph-based methods [43],
co-training [3] and multi-view training [4]. In self-training algorithms,
the model is bootstrapped with additional labeled data. This additional
labeled data is obtained from the highly confident prediction of the
unlabeled data. In graph-based methods, the model tries to propa-
gate label information by connecting similar observations from labeled
and unlabeled samples. Graph-based approaches are computationally
expensive and limited to small scale problems [10]. The co-training
method is based on the assumption that features can be split into two
conditionally independent sub-feature sets. In co-training, two sep-
arate classifiers are learned based on the sub-feature sets, and these
classifiers are trained to agree upon the labels from unlabeled data as
well as labeled data. In this work, we only focus on deep generative
model-based techniques for semi-supervised learning.
3.1 Deep Generative Models
Latest Deep generative models like [38, 32, 6, 8] have emerged as strong
candidates for unsupervised and semi-supervised learning of compli-
cated distributions like images. The learning model needs to discover
the abstract structures hidden within the unlabeled image data. Two
main classes of successful generative models for semi-supervised learn-
ing are Variational Autoencoder (VAE) and Generative Adversarial
Network (GAN).
15
16 CHAPTER 3. RELATED WORK
3.1.1 VAE-based Methods
VAE [20, 35] are a class of deep generative models that allow us to
learn latent variable generative models for the input data. Kingma
et al. [19] introduced the first successful deep generative model for
semi-supervised learning, but it needs to be coupled with a pretrained
feature extractor to perform well. Recently, several other competitive
VAE based methods [29, 42] have been proposed in the literature for
semi-supervised learning. Most of the VAE-based semi-supervised
methods are limited to simpler datasets like MNIST, SVHN and NORB,
but in this work, we successfully show semi-supervised classification
on real-world data. While the VAE-based semi-supervised learning
methods require pre-training of the autoencoder, our method can be
trained in an end-to-end manner without any preprocessing step on
the network.
3.1.2 GAN-based Methods
Most of the state-of-the-art semi-supervised methods are based on
GANs [23, 24, 40, 38]. In GANs, two neural networks namely, Generator
and Discriminator play a zero-sum game. The learned discriminator
module can be used for the application of semi-supervised learning
[8, 38]. For semi-supervised learning, a slight modification is made to
the output layer of the discriminator to accommodate the extra fake
class. Therefore, the dimension of the classifier output is increased
from K to K + 1, where K is the number of classes. While typical
GAN-based methods try to match the data distribution directly, our
approach aims to match the latent distribution of the autoencoder to a
prior distribution using a GAN.
3.1.3 Hybrid Methods
Recently, numerous hybrid works [1, 7, 8, 25, 32] of GAN and VAE have
been proposed in the literature for generative modeling. They try to
establish a connection between VAE and GAN to simultaneously learn
a good generative model while learning an efficient inference network.
In other words, the hybrid model can perform well on both the tasks of
image generation and latent space modeling. Adversarial Variational
Bayes [34] uses a more general GAN inference framework within a max-
imum likelihood setting. Adversarial Learned Inference [8] is another
CHAPTER 3. RELATED WORK 17
framework which used GAN framework for approximating maximum
likelihood. One such hybrid approach is Adversarial Autoencoders
[32] (AAE). The AAE model replaces the KL-divergence term in VAEs
[20] with an adversarial training method. In adversarial training, a
discriminator is jointly trained to distinguish between posterior and
prior samples. This method provides a better approach to matching
the latent representation with the prior distribution as compared to
VAE. In our work, we use the learning principles of AAE and extend
it for higher resolution real-world image data. Several other newer
techniques use implicit distributions to learn posterior approximations
other than AAE [32]. One of the latest hybrid method proposed as an
improvement to AAE, PixelGAN Autoencoders [31], captures the data
distribution jointly by the latent code and autoregressive decoder.
3.1.4 Real-world Applications
There is relatively little work on real-world applications of semi-su-
pervised learning. Recent work on semi-supervised haptic material
recognition [9] is one of the few successful works in robotics using
deep generative models. They used a GAN based approach for semi-
supervised learning from tactile sensory data. In another exception
[28], they use semi-supervised learning for object recognition using an
ensemble manifold regularization method. Both methods mentioned
above use low dimensional sensory input data, thus making it easier to
learn a semi-supervised model.
Developing online learning systems is another new emerging area of
research especially in robotics. There are a few related works on online
learning [26] which try to learn multiple tasks by sharing knowledge
among associated tasks. Other application-oriented online learning
works include [5, 37] where the system mines the web to learn visual
concepts and text-based relationships. [5] uses image search engines
to get weak labels for the images. In our work, we treat all the data
fetched from the Internet as strongly labeled.
Our work is the first of its kind to the best of our knowledge, where
a generative modeling based semi-supervised learning method has
been used for a real-world application like object recognition. In this
work, along with semi-supervised object recognition on real-world
image data, we also show potential application of this method for con-
tinuous learning. Our method is highly relevant for mobile robotics
18 CHAPTER 3. RELATED WORK
where robots can independently interact with the objects and improve
its performance without supervision. Our continuous learning setup
is devoid of any interactive web services in contrast to previous ap-
proaches, and it only focuses only on improving the performance of the
same task through unsupervised interactions.
Chapter 4
Methodology
4.1 Adversarial Autoencoders
Adversarial Autoencoder (AAE) is a method for regularizing an au-
toencoder. It imposes a prior distribution on the latent code of the
autoencoder using GANs. The Adversarial Autoencoder also converts
autoencoder into a probabilistic generative model which allows sam-
pling. In this chapter, we first discuss the motivation behind using an
Adversarial Autoencoder for our application. In Section 4.1.2, we dis-
cuss the structure of the basic AAE architecture where we explain how
the adversarial network is used to regularize the latent distribution. In
Section 4.1.3, we discuss another variant of the AAE architecture which
is essential to demonstrate its ability to learn the desired latent distribu-
tion in a semi-supervised setting. In the next section, we describe our
approach to scale the semi-supervised AAE method for classification
with high-resolution real-world images.
4.1.1 Motivation
One of the main drawbacks of VAEs is that the integral of the KL diver-
gence term (regularization term in Eq. 2.7) does not have a closed-form
solution except for few basic distributions because it requires an access
to the exact functional form of the prior distribution. Training such a
model can be difficult because backpropagation through the stochastic
hidden units is not possible and requires some reparameterization trick
to make the network differentiable.
The Adversarial Autoencoder model drops the KL divergence term
19
20 CHAPTER 4. METHODOLOGY
completely by making use of adversarial learning, and the model can
be learned in an end-to-end manner. Since AAE just needs to be able
to sample from the prior distribution, it allows imposing any arbitrary
prior distribution to the output of the neural network by regularizing it
using a GAN network. AAE architecture can also be used to disentangle
distinct aspects of the data into separate latent variables. This feature
of AAE is further utilized for semi-supervised learning. In the recent
literature, researchers have proposed a lot of different semi-supervised
methods for image classification on various low-resolution datasets like
MNIST and SVHN. Scaling these methods to high-resolution images
is hard due to various challenges involved with GAN networks like
mode-collapsing, also discussed in Section 2.2.2. Although GANs can
accurately model complex distributions, they are known to be challeng-
ing to train due to instabilities caused by difficult minimax optimization
problem, whereas AAE does not suffer from such challenges because it
uses GAN to learn a simple distribution on the latent code.
Figure 4.1: Architecture of a basic Adversarial Autoencoder. The ‘+’
and ‘−’ sign shows positive and negative input to the adversarial dis-
criminator network, respectively.
4.1.2 Basic AAE Architecture
Let x be the input and z be the latent code of the autoencoder. Let
p(z) be the prior distribution of the latent code, q(z|x) be the encoding
distribution and p(x|z) be the decoding distribution. Let pdata(x) be the
data distribution and p(x) be the model distribution to be learned. The
CHAPTER 4. METHODOLOGY 21
aggregated posterior distribution of q(z) on the hidden code is defined
by the encoding network q(z|x) as follows:
q(z) =
∫
x
q(z|x)pdata(x)dx (4.1)
AAE is a modified autoencoder, where the latent code is regularized
matching the above (Eq. 4.1) aggregated posterior distribution to the
prior distribution p(z) using the adversarial network.
Figure 4.1 schematically shows how the AAE works with Gaussian
prior on the latent code. The top structure in the network is a standard
autoencoder that reconstructs the image x from the latent code z. The
autoencoder is trained using standard reconstruction loss function:
L =1
2M
M∑
i
||xi − xi||22, (4.2)
where M is the total number of input data, xi is the original data and
xi is its reconstruction. The bottom structure is another network that
discriminates whether a sample is taken from the latent code of the
autoencoder or if it is sampled from the distribution p(z) specified by
the user. The discriminator receives z from the encoder q(z|x) and z′
sampled from the true prior distribution. The discriminator is trained
to distinguish between generated z and sampled z′ using the following
loss function:
LDz= −
1
m
m∑
k=1
log(D(z′)) + log(1−D(z)), (4.3)
where m is the minibatch size, D is the discriminator network. In Eq.
4.3, D(·) is trained with the sigmoid cross-entropy loss function with
label 1 for the true sample and 0 for the generated sample.
Then, the generator (encoder network) is updated using the follow-
ing loss function:
LG = −1
m
m∑
k=1
log(D(z)) (4.4)
We can notice from above loss functions that they both counteract
each other. As a result of careful training, the discriminator learns to
recognize fake (generated) samples and the generator learns to fool the
discriminator. Thus, we learn a good model distribution p(x) and ag-
gregated posterior distribution q(z). Optimization of this AAE network
22 CHAPTER 4. METHODOLOGY
involves three objective functions as described above. We do not train
with all the objective functions simultaneously, but rather alternate
between them for each mini-batch training process.
4.1.3 Learning Latent Distributions
Figure 4.2: Adversarial Autoencoder architecture with extra regulariza-
tion using label information.
Although with the basic AAE architecture it is possible to impose any
distribution on the latent code, we require an extra regularization to get
a class separation in the latent distribution. This extra regularization
requires label information. With the AAE approach this regularization
is realizable even when only a few labeled samples are available. Fig-
ure 4.2 shows how the label information can be leveraged to strictly
regularize the distribution of the latent code. In this network, the one-
hot vector is given as input to the discriminator network to select the
mode of the corresponding class from the prior distribution. For la-
beled samples, the true associated one-hot vector is given as input to
the discriminator network, and an extra one-hot vector where the 11th
category is switched on, is given as input to the discriminator network
for unlabeled samples. Thus, the discriminator can infer whether the
input comes from the labeled sample or from the unlabeled sample. As
a result of this model, the network learns to map the unlabeled samples
CHAPTER 4. METHODOLOGY 23
to the mode corresponding to the true class in the latent distribution.
This experiment helps us evaluate the complexity of the dataset in a
semi-supervised setting. Since this is a semi-supervised approach to
learn a latent distribution, we consider this model as a preliminary step
for the success of the semi-supervised classification.
We performed various experiments with this model on several
datasets to evaluate the proportion of labeled samples that is suffi-
cient for semi-supervised learning. The exact experimental setup and
results are discussed in Section 5.2. In the results section, we also visu-
alize the latent distribution in 2-dimensions. This model is also trained
with three objective functions as described in Section 4.1.2 using the
same alternating procedure.
4.1.4 Semi-supervised AAE
In Section 4.1.3, we studied how the latent distribution can be imposed
in a semi-supervised manner. In this section, we show how these
architectures can be combined to formulate a semi-supervised AAE
architecture for classification.
We assume that each image information can be decomposed and
reconstructed from two sets of independent components, namely, style
and class-label information. A continuous latent distribution can cap-
ture the style information, and a categorical latent distribution can cap-
ture class-label details. The network is designed such that the encoding
network can predict the class variable and continuous style variable
using labeled as well as unlabeled data, and the decoder network can
reconstruct the input image using both latent variables. Two separate
adversarial networks regularize the style and class-label hidden repre-
sentations; thus they ensure that they carry independent information
as shown in Figure 4.3.
In this architecture, let p(z) and p(y) be the continuous prior dis-
tribution on the style part of the latent code and the categorical prior
distribution on the class-label part of the latent code, respectively. This
semi-supervised AAE architecture comprises three different type of
modules that work in conjunction with each other, namely, autoencoder,
adversarial and classification module:
24 CHAPTER 4. METHODOLOGY
Autoencoder Module
In AAE, q(z|x) is a probabilistic encoder approximating the true poste-
rior distribution p(z|x), and p(x|z) is a generative decoder. We use the
standard reconstruction loss as written in Eq. 4.2. Autoencoder module
is an important part of the semi-supervised architecture for learning
the data representation from unlabeled samples.
Figure 4.3: Semi-supervised Adversarial Autoencoder architecture for
object recognition. The blue-highlighted box indicates the autoencoder
module, the red box corresponds to the adversarial module comprising
of two adversarial networks with a common generator, and the green-
highlighted box corresponds to the classification module.
Adversarial Module
Adversarial module comprises of two adversarial networks with a
common generator network, each for capturing style and class-label
information. This module is trained with unlabeled data. As discussed
in Sectio 2.2.2, the generator network of the GAN tries to mimic the
examples from the training data. Typically in GANs, the generator does
this by transforming a random noise sample, which is given as input,
CHAPTER 4. METHODOLOGY 25
into a synthetic or fake sample. In most image-based GAN applications,
the generator network is an expansion-type network, and the generated
synthetic samples are images. In contrast to typical image-based GANs,
the generator network in AAE is a compression-type network that
produces the synthetic sample by transforming the image input into
a low-dimensional latent vector. In AAE, the encoder network of the
autoencoder acts as the generator network during adversarial training,
and a separate discriminator network distinguishes between real and
fake latent codes.
The aggregated posterior distribution of the latent code, q(z), is
matched to an arbitrary prior, p(z), using an adversarial network as
illustrated in Figure 4.1. We use the term ‘posterior’ synonymous to
‘aggregated posterior’, defined in Eq. 4.1, for rest of the report for
convenience. In the semi-supervised AAE architecture, two separate
adversarial networks regularize style and class-label information of the
latent code. The first adversarial network ensures that the class-label
part (y) of the latent code does not carry any style information and the
aggregated posterior distribution of y matches the Categorical distri-
bution Cat(y). We assume that once the label information is removed,
the remaining information can be captured using a continuous distribu-
tion. Therefore, the second adversarial network imposes a continuous
distribution p(z) on the style part, (z), of the latent code.
For learning the adversarial discriminator network for the continu-
ous distribution, a loss function discussed in Eq. 4.3 is used. Similarly,
for learning the second adversarial discriminator for categorical distri-
bution, we use following loss function:
LDy= −
1
m
m∑
k=1
log(D(y′)) + log(1−D(y)) (4.5)
The generator loss can be combined for both adversarial networks
as :
LG = −1
m(
m∑
k=1
log(D(z)) +m∑
k=1
log(D(y))), (4.6)
where m is the size of the mini-batch training data and D(·) is the
discriminator network.
26 CHAPTER 4. METHODOLOGY
Classification Module
The classification module is the same as a classical convolutional neural
network as discussed in Section 2.1.1. With labeled data, the encoder
part of the autoencoder is used as the classification network with the
categorical output. We use the softmax cross-entropy loss for training
the classification network.
All four sub-networks including two adversarial networks, autoen-
coder network and classification network of the AAE models are trained
synchronously in an end-to-end manner in three phases: the reconstruc-
tion phase, the regularization phase and the semi-supervised classifi-
cation phase. We further discuss the training procedure in detail in
Section 5.3.1.
CHAPTER 4. METHODOLOGY 27
4.1.5 Convolutional Semi-supervised AAE Architecture
Figure 4.4: Semi-supervised Adversarial Autoencoders architecture for
object recognition. The ‘+’ and ‘−’ sign shows positive and negative
input to the adversarial discriminator network, respectively. The corre-
sponding color coding for different layers is shown in the right column.
The size of the filter and number of filters are mentioned on the top and
bottom of the layer, respectively. Usage of special activation functions
(if used) is mentioned in between two layers. The connection between a
convolution and fully-connected layer includes reshaping of the input
layer (not shown in the diagram).
The convolutional AAE architecture is obtained after an extensive
search over several hyperparameters: type of architecture, number
of layers, type of convolutional and upconvolutional layers, normal-
ization techniques, activation functions and loss functions. Figure 4.3
shows the optimal architecture obtained after an extensive hyperparam-
eter search. Since AAE is a modular approach, we divide and explain
the architecture for each module separately. Appendix Tables B.1, B.2
and B.3 show the convolutional semi-supervised AAE architecture with
details for different sub-networks separately. The training procedure
is discussed in detail in Section 5.3.1, where we discuss how these dif-
ferent sub-networks are optimized in an alternate manner to perfom
28 CHAPTER 4. METHODOLOGY
semi-supervised object recognition.
Autoencoder Network
The autoencoder network, shown in Table B.2, is a standard covolu-
tional autoencoder. The encoder network of the autoencoder is the
same as the classification network shown in Table B.3 with additional
output for style-latent code from FC-1 as shown in Figure 4.3. The
decoder network gets a concatenated input from both the latent parts.
We use transposed convolution operation for convolution operation
and upscaling the filter size by two times at each operation. The output
layer has the sigmoid activation function to match the scale of the input
sample. The dimensionality of the label representation is 10, and for the
style representation, we use 30 dimensions for the real-world datasets
and 10 dimensions for the MNIST dataset.
Classification Network/Adversarial Network: Generator
We implemented various competitive convolutional neural network ar-
chitectures for the encoding network like DenseNet [17], VGG Network
[39], AlexNet [21]. We found the VGG-type network to be the most
effective for our application. Table B.3 shows the architecture of the
encoding network which is also used for semi-supervised classification
on the test set. This network is also used as a generator network during
adversarial training. The architecture is designed for an input image
of 64× 64× 3. The architecture precisely contains seven convolutional
layers followed by two fully connected layers for classification with
the softmax activation function as the end operator. Each convolu-
tional layer is followed by batch-normalization on the mini-batches and
passed through the Leaky-Relu [30] non-linear activation function. The
max-pooling operation reduces the filter size from 64× 64 to 4× 4. The
network is trained with different learning rates at different phases of
training, which is discussed in detail in Section 5.3.1.
Adversarial Network: Discriminator
The discriminator sub-network of the adversarial network, shown
in Table B.1, is a regular neural network with two fully connected
layers of 1024 hidden units each. This discriminator classifies the
CHAPTER 4. METHODOLOGY 29
low-dimensional latent code from the vector sampled from the low-
dimensional (≤ 30) prior distribution. The last layer has a sigmoid
activation function such that we can use the sigmoid cross-entropy loss
function.
Chapter 5
Experiments and Results
In this chapter, we first describe our datasets and their compilation
procedure in Section 5.1. Subsequently, we discuss the results of all the
experiments performed using different AAE architectures for learning
latent distributions and semi-supervised classification in Section 5.2
and Section 5.3 respectively.
5.1 Datasets
We tested our convolutional adversarial autoencoder model on three
different datasets, namely, MNIST, Internet (WIS) dataset and Real-
world (RW) dataset for semi-supervised learning.
5.1.1 MNIST Dataset
MNIST is a database of handwritten digits. It contains 50,000 training
samples and 10,000 testing samples. It includes grayscale images of
resolution 28 × 28, where each digit is centered. It is a well known
dataset to experiment and analyze new learning techniques in the
machine learning community.
5.1.2 Internet Dataset
As we discussed in Chapter 1, collecting and annotating a large dataset
is an expensive and time-consuming procedure. Therefore, we use a
web-based image search engine to collect this real-world image dataset.
We call this internet dataset ‘WIS’ dataset since it is fetched using the
30
CHAPTER 5. EXPERIMENTS AND RESULTS 31
web-based image search engine. Using the Internet, it is possible to
collect a very diverse and abundant amount of data from different
sources, scales, domains, etc. Figure 5.1 gives a glimpse of the WIS
dataset.
Figure 5.1: Visualization of few samples from the Internet Dataset. This
dataset is fetched using a reverse image search engine.
We collect images belonging to 10 categories of objects. We only
consider small household objects for this dataset, namely: banana,
bottle, bowl, calculator, can, cup/mug, orange, scissors, soccer ball,
32 CHAPTER 5. EXPERIMENTS AND RESULTS
watering-can. To reduce the human effort for collecting and annotating
the images, we fetch images from the web using reverse image search
engine. Using this technique, we can obtain around 100 good quality
images for each image-query. For collecting this dataset, we select 40
exemplar images for each object category and query the web using
these selected images. This downloading step is followed by a filtration
process (discussed in Section 5.1.4) to remove unnecassary images from
the raw dataset. After filtering, this dataset contains approximately
24K labeled images. We split the dataset 80:20 for training and testing
respectively. Figure 5.2a shows more statistics about the WIS dataset.
(a) (b)
Figure 5.2: (a) shows the number of images per class in the WIS dataset.
(b) shows the number of videos collected per class for the RW dataset.
5.1.3 Real-world Dataset
The Real-world (RW) dataset is another object recognition dataset com-
prising of everyday objects. The RW object dataset consists of more
than 200 daily household objects. The objects are categorized into ten
classes (same as WIS dataset): banana, bottle, bowl, calculator, can,
cup/mug, orange-fruit, scissors, soccer-ball, watering-can. This dataset
is collected from video streams using hand-held cellphone cameras and
then later sampled to fetch image frames. The dataset contains 836
video streams of these 200 object instances. The data was captured by
different persons with some minimal instructions and no prior knowl-
edge about our work. Therefore, the resolution and quality of the video
data varies across different videos streams, but they are all captured
at the common frequency of 30Hz. In each video, the camera operator
moves around the object in a slow and random motion to capture the
CHAPTER 5. EXPERIMENTS AND RESULTS 33
object from different angles at various scale and with different back-
grounds. The data is captured in a very natural setting and is highly
diverse concerning illumination, clutter around the object and distance
of the object from the camera. Figure 5.3 gives a glimpse of the RW
dataset.
Figure 5.3: Visualization of few samples from the Real-world Dataset,
which is captured using hand-held cameras.
34 CHAPTER 5. EXPERIMENTS AND RESULTS
This dataset contains at least 36 videos for each object category. The
average duration of the videos is approximately 8 seconds. We sample
the frames from each video at the rate of 6Hz for our application. Before
using the data for learning the task, each sampled frame passes through
a filter to remove the noisy frames, which do not contain any of the 10
given object categories. We obtain approximately 22K image frames
equally distributed over all the classes. Figure 5.2b shows the dataset
statistics: number of videos captured per class.
5.1.4 Preprocessing
Both WIS and RW datasets contain a lot of noise images, where none
of the 10 categories are present in the image frame. We propose an
automatic filtering approach to remove these noisy images from the
dataset. This filtering process is discussed in details below.
Filtering the Dataset
Web images and their labels are more accessible to obtain, but directly
training on them can result in underperformance due to the presence of
noisy web query results and noisy labels. This can adversely affect the
precision of the manifolds learned from the unlabeled data and also the
semi-supervised classification performance. Therefore, we need to filter
the web query results before training any model using them. We use an
image retrieval technique [41], which builds a precise image descriptor
for object retrieval. This method encodes several image regions into a
single feature vector without feeding multiple inputs to the network.
We use this feature vector to filter the query results by matching the
cosine distance between the corresponding feature vectors. We use the
averaged cosine distance, (d), averaged over all the exemplar images
from that class:
d =
∑N
i=1 cos(xi,xj)
N, (5.1)
where xi is the feature vector of the ith query input image and xj is
the feature vector of the jth web-query result image.
This filtering technique is used for cleaning both WIS and RW
datasets. As a result of this cleaning procedure, approximated 10%
of the query results are rejected. Figure A.1 shows some qualitative
results of this filtering process with WIS data distribution. Since these
CHAPTER 5. EXPERIMENTS AND RESULTS 35
datasets are not annotated for each sample individually, and our pro-
posed method does not guarantee 100% correct filtering, they contain a
small amount of label impurity.
5.2 Learning of the Latent Distribution
(a) (b)
(c) (d)
Figure 5.4: Visualization of the latent space learned using unsupervised
basic AAE architecture. (a) A 2-D Gaussian distribution is used as
the prior distribution on the MNIST dataset. (b) A ten 2-D Gaussian
mixture distribution is used as the prior distribution on the MNIST
dataset. (c) A 2-D Gaussian distribution is used as the prior distribution
on the WIS dataset.(d) A ten 2-D Gaussian mixture distribution is used
as the prior distribution on the WIS dataset.
In this section, we test the ability of the AAEs to learn different la-
tent distributions using unsupervised, supervised and semi-supervised
learning. We first show results obtained with the basic AAE architec-
ture in an unsupervised setting for the MNIST dataset. In the next
experiments, we demonstrate the ability of the convolutional AAE
36 CHAPTER 5. EXPERIMENTS AND RESULTS
method to learn an arbitrary latent distribution using supervised and
semi-supervised learning for real images.
Experiment-1: Unsupervised Learning
It is possible to learn a visually comprehensible latent space for MNIST
datasets without using any label information with the basic AAE ar-
chitecture (Figure 4.1). We can visualize that the samples of different
classes are mapped to individual clusters in the latent space shown in
Figure 5.4a and 5.4b. But this unsupervised AAE model fails to produce
such a 2-D visualization for real-image datasets because of complex
data distribution, shown in Figure 5.4c and 5.4d. In these figures (5.4c
and 5.4d), all the class mappings overlap with each other. This shows
that the method can learn simple distributions like MNIST data dis-
tribution in an unsupervised manner, but it fails in case of complex
distributions like real-world WIS dataset. In the next experiments, we
try to learn the latent distributions using the semi-supervised learning
model for such complex data distributions.
(a) (b) (c)
Figure 5.5: Visualization of the latent space on WIS dataset. In this
experiment, we leverage the label information to better regularize the
latent space. This model is trained using all labeled samples. (a) The
ten 2-D Gaussian prior distribution imposed on the latent code (b) The
posterior distribution on the latent space using the training data (c) The
posterior distribution on the latent space using the testing data.
Experiment-2: Supervised Learning
This experiment uses the AAE architecture shown Figure 4.2, where
label information can be utilized in a supervised learning setup to regu-
CHAPTER 5. EXPERIMENTS AND RESULTS 37
larize the latent distribution. We assume a ten 2-D Gaussian mixture
model as the prior distribution where each Gaussian represents the
distribution of a separate class. Figure 5.5 shows the prior and posterior
distributions of the latent code when 100% label information is used.
These results indicate that it is possible to shape the latent distribution
using the proposed convolutional encoder with an adversarial training
procedure.
(a) (b) (c)
(d) (e) (f)
Figure 5.6: Experiment-2. Visualization of the posterior distribution of
the latent space on WIS dataset. In this experiment, we leverage the
label information to better regularize as shown in Figure 4.2. Figure
(a) and (d) show learned latent distribution on training and test set,
respectively, when the model is trained with 50% labeled samples and
50% unlabeled samples. (b) and (e) show learned latent distribution
learned using training and testing set respectively when the model is
trained with 20% labeled samples and 80% unlabeled samples. Simi-
larly, (c) and (f) show latent space distribution when trained with only
10% labeled samples and 90% unlabeled samples.
38 CHAPTER 5. EXPERIMENTS AND RESULTS
Experiment-3: Semi-supervised Learning
In this experiment, we test the same architecture in a semi-supervised
setting where only a proportion of labels is available along with unla-
beled samples. The only difference is that there are 11 categories in the
one-hot input vector where the 11th category is switched on when the
label of the input is unknown. Figure 5.6 shows the posterior distribu-
tion of the 2-D latent representation code for different ratios of labeled
to unlabeled data. We perform this experiment for three different ratios
where 50%, 20% and 10% samples are labeled from the whole dataset.
The qualitative result (Figure 5.6f) shows that it is possible to achieve
a visually discernible distribution even with only 10% labeled data.
The architecture is able to correctly map the unlabeled samples to the
right mode of the class distribution using limited label information.
This experiment verifies the feasibility of this AAE method for semi-
supervised learning for real-world data. As expected, the quality of the
posterior latent distribution degrades with lower ratios of labeled to
unlabeled samples as shown in Figure 5.6d to Figure 5.6f. We consider
this experiment to be a preliminary step before using this method for
semi-supervised classification on a new dataset.
5.3 Semi-supervised Classification
In this section, we demonstrate the performance of the semi-supervised
adversarial autoencoder architecture, described in Section 4.1.4, for
object recognition. First, we discuss the implementation details of
the semi-supervised classification experiment majorly concerning the
non-trivial training procedure. Then, we analyze the dynamics of the
learning procedure with the help of all the loss curves. Finally, we
show the object recognition results on all the datasets. We compare
the results of our semi-supervised method with a competitive CNN
baseline method.
5.3.1 Implementation Details
The exact details of the neural network architecture used for semi-
supervised classification are shown in Tables B.1, B.2 and B.3. As
mentioned in Section 4.1.4, the semi-supervised AAE classification
model is concurrently learned in three phases: the reconstruction phase,
CHAPTER 5. EXPERIMENTS AND RESULTS 39
Figure 5.7: Training procedure of semi-supervised AAE divided into
three phases- reconstruction phase (I), regularization phase (II) and
semi-supervised classification phase (III). The yellow box indicates the
encoder (generator) network; the red box indicates the decoder network
and green boxes indicates the discriminator network for adversarial
training.
the regularization phase, and the semi-supervised classification phase.
Training Procedure
Training this combination of networks is a little tricky and requires care-
ful calibration of learning rates for optimal results. The semi-supervised
AAE model is trained in three phases with different loss functions as
discussed in Section 4.1.3. A phase comprises a single mini-batch train-
ing process. Figure 5.7 shows the training process in different phases;
the active modules in the figure are highlighted in red color for each
training phase.
The training procedure runs for 500 epochs with diminishing learn-
ing rates. The objective functions are trained using the Adam optimizer
with a batch size of 50. The initial learning rates are different for dif-
ferent phases of training as mentioned in the architecture tables. The
learning rate is reduced by 10 times at 100 epochs and then further by
40 CHAPTER 5. EXPERIMENTS AND RESULTS
10 times at 300 epochs for all training phases. In the implementation,
this three-phase model is trained jointly in two iteration steps, first two
phases (Phase-I & Phase-II) are trained together in a single iteration
step and the third phase is trained in next iteration step. These two
training iteration steps are repeated alternately.
The data samples are scaled from 0 to 1. We use dropout layer
at the fully-connected layers with the dropout rate of 50%. No other
dropout or Gaussian noise regularization was used in any other layer.
The labeled examples are chosen at random, but it is assured that they
are evenly distributed across all the classes. The unlabeled examples
belong to one of the ten classes. Batch-normalization is used in all the
convolutional layers.
Learning curves
The training and testing classification-loss functions shown in Figure
5.8b, converge quite well similar to supervised learning procedure and
do not suffer from overfitting even with only 500 labeled samples. The
model learns to perform equally well as the supervised model (with the
same number of labeled samples) within first 20 epochs. Since this ar-
chitecture trains using four different loss functions concurrently, a little
precision is required while training the model. Learning rates are kept
very low to allow convergence of each training loss. We can observe
in Figure 5.8a, that the accuracy increases slowly over 500 epochs. We
can also verify that the adversarial training is conducted successfully
in Figure 5.8e. The blue and green curves in Figure 5.8e denote the
adversarial-discriminator loss functions. Their average values are con-
stant around 2× ln 2, which means the discriminator is fairly confused
between samples generated by the encoder and samples picked from
the prior distributions. In Figure 5.8f, we observe that the number of
samples produced for each class is approximately the same; therefore it
verifies that the generator of the categorical adversarial network learns
an unbiased model. This also shows that this adversarial training pro-
cedure does not suffer from the challenges like mode collapse. The
reconstruction loss (Figure 5.8c) decreases until 150 epochs and then
remains constant for rest of the training process.
To achieve the best performance for this semi-supervised model, the
model should be trained over 1000 epochs with diminishing learning
rates. In this work, all our experiments are limited to only 500 epochs.
CHAPTER 5. EXPERIMENTS AND RESULTS 41
(a) (b)
(c) (d)
(e) (f)
Figure 5.8: These figures show different loss values and their behavior
as learning progresses. These figures are obtained from the model
trained with 1000 labeled samples and 15000 unlabeled samples on
the WIS dataset. (a) The semi-supervised classification accuracy, (b)
the cross-entropy classification loss for training (magenta) and testing
data (blue), (c) the autoencoder reconstruction loss, (d) the adversarial
generator loss curve , (e) the adversarial discriminator loss for class-
label (green) and style (blue) latent codes, (f) the output frequency of
the categorical generator for each class.
42 CHAPTER 5. EXPERIMENTS AND RESULTS
Importance of Latent Distributions
Imposing categorical distribution on the output of the encoder helps
in making confident decisions about the class-label of the inputs. This
ensures that the latent code y does not carry any continuous style
information and it is only learned by the second part of the latent
code. The adversarial regularization using categorical distribution also
ensures that the output y of the encoder is a uniform distribution over
all the labels. This result can be observed in Figure 5.8f, where each
colored curve corresponds to the number of samples generated for each
class in one epoch.
Imposing continuous distribution on the output of the encoder tries
to capture remaining non-class label information, which is termed as
style information. This style information disentanglement can be easily
observed with the MNIST dataset [32], but it is difficult to comprehend
this fact for high-resolution real-world images visually. From our exper-
iments, we found out that in case of the MNIST dataset, it is not possible
to learn anything without a continuous prior on the style distribution.
According to our investigation, for real image data, this regularization
on the latent code helps in stabilizing the semi-supervised training
procedure. It also enhances the training speed and alleviates overfitting
of the AAE network.
5.3.2 Object Recognition Results
Here, we show the performance of our convolutional semi-supervised
AAE method on all three datasets. We compare the semi-supervised
classification results of our method with a competent supervised CNN
baseline method.
CNN Baseline Method
In the following section, we compare our semi-supervised classifica-
tion results with a Convolutional Neural Network (CNN) baseline
method. Our CNN baseline network is a VGG-type 9-layer network
with 7 convolutional layers and 2 fully-connected layers. The complete
architecture of the CNN baseline network is shown in Table B.3. The
baseline CNN method is also trained for 500 epochs with reduced learn-
ing rate after 100 and 300 epochs subsequently. The learning rates are
mentioned in the hyperparameter section of Table B.3.
CHAPTER 5. EXPERIMENTS AND RESULTS 43
Method MNIST (100) MNIST (1000) MNIST (All)
NN Baseline 33.21 8.60 1.70
VAE [20] 3.33 2.40 0.96
AAE-NN [32] 2.92 2.40 1.50
Conv-AAE [our] 2.81 1.75 0.53
State-of-the-art [15] 0.89 0.74 0.36
Table 5.2: Semi-supervised classification performance (error-rate in %)
on MNIST dataset. The number written in the (·) indicate the number
of labeled samples used in each case with 50000 unlabeled samples.
The category ’All’ denotes that all 50000 labeled samples were used and
trained using the AAE model. The ‘NN Baseline’ is a regular neural
network architecture made up of two hidden layers with 1024 hidden
units each.
We also tested various other standard competent CNN methods like
AlexNet [21] and DenseNet [17] as CNN baseline. But, we found the
VGG-type network to be most effective. Table 5.1 shows the comparison
between different networks.
Method Depth CNN Accuracy
VGG-type 9 layers 79.19
AlexNet 7 layers 66.89
DenseNet 10 layers 65.26
Table 5.1: CNN baseline performance with different competent stan-
dard networks. This performance is based on only 1000 labeled samples
from WIS dataset.
MNIST Dataset
We performed our first experiments with the standard MNIST dataset
to verify the correct implementation of the Adversarial Autoencoders.
The implementation described in [32] is based on fully-connected net-
works. We first implement the same architecture to obtain similar
results on semi-supervised classification. Further, we implemented
our convolutional AAE architecture on MNIST dataset and obtained a
better performance as compared to the fully-connected AAE network.
The semi-supervised classification results are shown in Table 5.2.
44 CHAPTER 5. EXPERIMENTS AND RESULTS
Internet Dataset
Method WIS (500) WIS (1000) WIS (4000) WIS (All)
CNN Baseline 73.12 79.19 87.53 91.67
AAE 76.92 81.98 88.45 92.68
Increase 3.80 2.79 0.92 1.01
Table 5.3: Semi-supervised classification performance (accuracy) on
Internet (WIS) dataset. The category ’All’ denotes that all 15000 labeled
samples were used and trained using the AAE model.
We consider WIS dataset to be a complete dataset concerning volume
and variety. This dataset is very well suited for semi-supervised learn-
ing methods. Our semi-supervised learning experiments on WIS vali-
dates the working of Convolutional AAE method. We obtain a perfor-
mance raise by 3.8% using unlabeled samples along with 3% labeled
samples as compared to the CNN baseline method. Table 5.3 shows
performance increase using AAE methods as compared to CNN base-
line for varying proportions of labeled samples. In this experiment, we
assume that the number of unlabeled samples is constant (15,000).
Real-world Dataset
Method WIS (1000) WIS (4000) WIS (10000)
RW (15000) RW (15000) RW (15000)
CNN Baseline 52.57 54.50 57.41
AAE 56.54 59.41 61.05
Increase 3.97 4.91 3.64
Table 5.4: Semi-supervised classification performance (accuracy ) on
Real-world (RW) dataset. In this experiment, we use labeled samples
from WIS dataset along with unlabeled samples from the RW dataset.
Two rows in the dataset heading shows the number of labeled samples
from the WIS dataset and the number of unlabeled samples from the
RW dataset.
RW dataset is a challenging dataset because it also captures the natural
artifacts that are present in the real world. Although this dataset does
CHAPTER 5. EXPERIMENTS AND RESULTS 45
not contain many variants of object instances, we can achieve compet-
itive performance for semi-supervised classification as compared to
the CNN baseline method. In this semi-supervised learning technique,
we believe that the unlabeled samples are useful for identifying all the
underlying manifolds for the dataset and the labeled samples reinforce
the useful manifolds for the classification task. It is important for the
semi-supervised learning that labeled samples strongly represent the
true class. Thus, we expect that the labeled samples should cover as
much variety as possible and should be of high quality.
Since the RW dataset lacks variety and quality, it is difficult to
achieve a consistent classification performance. Therefore, for this
experiment, we use the labeled samples from the WIS dataset and
unlabeled samples from the RW dataset for training. We randomly
select the labeled samples from the WIS dataset in need of cleaner
and more diverse labeled data. Training and test splits of RW dataset
contain different object instances. The semi-supervised classification
performance of the RW dataset is measured on 2000 samples (200
samples for each class) belonging to unseen object instances. In this
experiment, we achieve a gain in performance by about 4% shown in
Table 5.4. The lack of variation in the dataset explains the inconsistency
in the performance rise.
Hyperbolic Mapping of Latent (style) Code
Mapping WIS
(500-4.0)
WIS
(500-2.0)
Linear 76.92 76.81
Tanh 75.73 77.45
Table 5.5: Results of semi-supervised classification (accuracy) with
hyperbolic mapping of style distribution on the WIS dataset. The
values in (·) indicate the number of labeled samples along with the
variance of the prior distribution. The values are averaged over five
runs.
The default semi-supervised AAE architecture contains a linear map-
ping from the last hidden layer to the latent (style) representation. We
experimented with different non-linear mappings instead of using a lin-
ear operation and found that mapping the style distribution using the
46 CHAPTER 5. EXPERIMENTS AND RESULTS
hyperbolic tangent (tanh) function results in further performance gain
by approximately 0.5%. We speculate that using tanh mapping on the
style part of the latent representation reduces the weight of the encoder
parameters to regularize the style distribution and helps to capture
more robust class-label information in the categorical distribution.
Scalability
Another advantage of this method is that it works irrespective of the
scale of the image. Therefore, higher performance can be obtained if
input image of higher resolution is used. However, in GAN based
methods, it becomes more difficult to learn a stable model for higher
resolution images. This experiment shows that the performance of
the semi-supervised AAE classification improves along with the base-
line accuracies with the high-resolution input image. The results on
different scales are shown in Table 5.6.
Scale CNN(%) AAE(%) Increase
32× 32 76.28 78.27 1.99
64× 64 79.19 81.98 2.79
96× 96 80.10 83.04 2.94
128× 128 81.89 84.47 2.58
Table 5.6: The semi-supervised classification performance (accuracy)
for input images at different scales increasing from 32× 32 to 128× 128.
This experiment is conducted on WIS dataset trained with 1000 labeled
samples and 15000 unlabeled samples. The network architecture is
adjusted by increasing or decreasing a (Convolutional + Pooling) layer
for different resolutions of the input image.
5.4 Online Learning with AAE
In this section, we demonstrate that this semi-supervised AAE method
can also be used for online learning. In an online learning setup, the
system is expected to learning autonomously over time with minimal or
no supervision. Using a naive approach, we show that this method may
be extended for continuous online learning, although this is beyond
the scope of this work. In our naive setting, the model is trained with
only a few labeled samples, and then it learns independently from
CHAPTER 5. EXPERIMENTS AND RESULTS 47
the unlabeled samples in a continuous manner without any further
supervision.
Phase Labeled Unlabeled Accuracy
Phase-1 1000 0 79.19%
Phase-2 1000 3000 79.74%
Phase-3 1000 6000 80.23%
Phase-4 1000 9000 81.11%
Phase-5 1000 12000 81.78%
Phase-6 1000 15000 81.98%
Phase-7 1000 18000 82.12%
Table 5.7: The table shows that the performance of the semi-supervised
AAE improves as the number of unlabeled samples is increased. This
verifies that the system can be used in an online learning setup. The
values in the table are averaged over 3 runs each. The performance of
this system is shown on the WIS dataset.
In this experiment, we randomly select and label only 100 samples
per class from the dataset. Then, we increase the number of unlabeled
samples by 3000 for different phases of learning. These 3000 samples
are randomly picked from the dataset. Therefore the number of samples
per class may be slightly unbalanced. This experiment is performed on
the WIS dataset. We observe that the semi-supervised classification ac-
curacy consistently improves with more number of unlabeled samples.
The results are shown in Table 5.7.
We also conducted an online learning experiment where samples are
continuously increased in a single run to demonstrate its feasibility. In
this run, the number of unlabeled samples is increased by 1000 at each
100th epoch until 1900 epochs. We observe an increase in performance
by 2.64% from 100th epoch to 1900th epoch, shown in Figure 5.9. This
run is conducted with unlimited memory where all the labeled and
unlabeled samples are stored and iterated during training.
48 CHAPTER 5. EXPERIMENTS AND RESULTS
Figure 5.9: Performance of the AAE method for online learning. The
red curve corresponding to the case where all the samples are stored;
it shows consistent performance gain with more number of unlabeled
samples. The blue curve corresponding to the limited-memory case
also shows a performance gain but with a few fluctuations.
In another run, we performed the same experiment with limited
memory. In this run, the system is allowed to store only 1000 labeled
and 1000 unlabeled images at a time. In this experiment, new unla-
beled data is given in batches of 1000 images, whereas the labeled data
is always available during the training process. We observe sudden
drops in the performance when new data is exposed to the learning
system. Overall we observe an increase in classification accuracy as
compared to CNN baseline, but it underperforms as compared to the
unlimited memory case. This drop in performance is due to the well-
studied phenomena in the literature known as ’catastrophic forgetting’
or ’catastrophic inference’ [33]. When we remove the old unlabeled
data and add the new unlabeled data, we believe that learned mani-
folds are disturbed, thus resulting in a sudden drop in performance.
The performance of both cases is shown in Figure 5.9.
CHAPTER 5. EXPERIMENTS AND RESULTS 49
5.5 Discussions
While studying different AAE architectures in this work, we observed
numerous interesting features and properties of the convolutional ad-
versarial autoencoder model:
• This network can be trained using an end-to-end training proce-
dure, and the same architecture can be easily scaled for different
datasets with consistent performance.
• The network trains devoid of any overfitting even with 3% labeled
samples for the WIS dataset.
• Although the original work on AAE [32] suggests the addition of
Gaussian noise to the input layer, this convolutional version of
AAE does not require an addition of noise to supplement learning.
• Finding the optimal learning rates is challenging since four differ-
ent objective functions work collectively on different parts of the
network.
• Due to the intricate training procedure, the model needs to be
trained at very low learning rates. Therefore, it needs long com-
putational time for optimal results.
• The reconstructed image from the decoder is blurry, but a better
reconstruction does not guarantee a better classification perfor-
mance.
Chapter 6
Conclusion and Future Work
6.1 Conclusion
In this thesis, we investigated whether the semi-supervised learning
approaches based on deep generative models can be used for the appli-
cation of real-world object recognition. We provided an introduction
to the latest deep generative models which use deep neural networks
for inference. Further, we summarized the existing literature on semi-
supervised learning based on different generative models and real-
world applications of semi-supervised learning.
We proposed convolutional adversarial autoencoder architectures
for learning on real-world data. The deep neural network models
often lack interpretability, but modularity of this approach helped us
understand the learning dynamics to a good extent. We evaluated the
presented approach with increasing level of model complexity. This
allowed us to investigate all the components of the semi-supervised
AAE classification architecture, leading to a network architecture.
In this work, we show that our proposed convolutional AAE archi-
tecture can be successfully used for semi-supervised object recognition
on real-world data. We achieve a performance gain of approximately
4% for semi-supervised object recognition as compared to the fully
supervised method on real-world datasets. We obtain a competitive
semi-supervised classification performance on MNIST dataset as com-
pared to state-of-the-art semi-supervised learning techniques and also
outperform the fully-connected AAE model for MNIST dataset pro-
posed in [32]. The method performs consistently well over different
datasets without any major change in the network architecture or train-
50
CHAPTER 6. CONCLUSION AND FUTURE WORK 51
ing procedure.
We also compiled two new real-world datasets for object recognition,
which are highly diverse and can also be used in tandem. Using our
dataset compilation approach, the internet-based dataset can be easily
expanded since we do not need to annotate the samples individually.
We also realized that the training dataset must contain a minimum
variance concerning the variety of objects. Finally, through some simple
experiments, we also found that our semi-supervised AAE approach
can be applied to lifelong learning.
6.2 Future Work
Most of the current methods in machine learning work under the closed-
world assumption. They often assume that the world is comprised
of only a certain number of classes, which is predetermined before
learning anything. However, our world is changing rapidly, where new
categories appear and old ones disappear. Therefore, the system should
learn to adapt to these changes.
In this work, we notice that the increase in performance gain satu-
rates as the number of unlabeled samples increase after a certain limit.
To further improve the performance of the system, we need to label
a few more samples from the unlabeled dataset. This gap between
human and the autonomous system can be efficiently bridged with
methods like active learning and novelty detection, although it is quite
challenging to develop such methods for complex data distributions.
Using these techniques, the autonomous system can smartly query
the human to annotate the most valuable samples and also notify if it
detects samples from a new category. This way we can further improve
lifelong learning performance for visual recognition tasks.
Bibliography
[1] “Autoencoding beyond pixels using a learned similarity metric”.
In: CoRR (2015).
[2] Christopher M. Bishop. Pattern Recognition and Machine Learning
(Information Science and Statistics). Secaucus, NJ, USA: Springer-
Verlag New York, Inc., 2006.
[3] Avrim Blum and Tom Mitchell. “Combining Labeled and Un-
labeled Data with Co-training”. In: Proceedings of the Eleventh
Annual Conference on Computational Learning Theory. 1998.
[4] Ulf Brefeld, Christoph Büscher, and Tobias Scheffer. “Multi-view
Discriminative Sequential Learning”. In: Machine Learning: ECML
2005: 16th European Conference on Machine Learning, Porto, Portugal,
October 3-7, 2005. Proceedings. 2005.
[5] Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. “NEIL:
Extracting Visual Knowledge from Web Data”. In: International
Conference on Computer Vision (ICCV). CMU-RI-TR-. Pittsburgh,
PA, 2013.
[6] Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert
Fergus. “Deep Generative Image Models using a Laplacian Pyra-
mid of Adversarial Networks”. In: CoRR (2015).
[7] Alexey Dosovitskiy and Thomas Brox. “Generating Images with
Perceptual Similarity Metrics based on Deep Networks”. In: CoRR
(2016).
[8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Mar-
tin Arjovsky, Olivier Mastropietro, and Aaron Courville. “Adver-
sarially Learned Inference.” In: CoRR (2016).
[9] Zackory M. Erickson, Sonia Chernova, and Charles C. Kemp.
“Semi-Supervised Haptic Material Recognition for Robots using
Generative Adversarial Networks”. In: CoRR (2017).
52
BIBLIOGRAPHY 53
[10] Rob Fergus, Yair Weiss, and Antonio Torralba. “Semi-Supervised
Learning in Gigantic Image Collections”. In: Advances in Neural
Information Processing Systems 22. 2009.
[11] Akinori Fujino, Naonori Ueda, and Kazumi Saito. “A Hybrid
Generative/Discriminative Approach to Semi-supervised Clas-
sifier Design”. In: Proceedings of the 20th National Conference on
Artificial Intelligence - Volume 2. 2005.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,
David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua
Bengio. “Generative Adversarial Nets”. In: Advances in Neural In-
formation Processing Systems 27. Ed. by Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger. 2014.
[13] Ian J. Goodfellow. “NIPS 2016 Tutorial: Generative Adversarial
Networks”. In: CoRR (2017).
[14] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir
Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, and Gang
Wang. “Recent Advances in Convolutional Neural Networks”.
In: CoRR (2015).
[15] Philip Häusser, Alexander Mordvintsev, and Daniel Cremers.
“Learning by Association - A versatile semi-supervised training
method for neural networks”. In: CoRR (2017).
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep
Residual Learning for Image Recognition”. In: CoRR (2015).
[17] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q
Weinberger. “Densely connected convolutional networks”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2017.
[18] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accel-
erating Deep Network Training by Reducing Internal Covariate
Shift”. In: CoRR (2015).
[19] Diederik P. Kingma, Danilo Jimenez Rezende, Shakir Mohamed,
and Max Welling. “Semi-Supervised Learning with Deep Genera-
tive Models”. In: CoRR (2014).
[20] Diederik P. Kingma and Max Welling. “Auto-Encoding Varia-
tional Bayes.” In: CoRR (2013).
54 BIBLIOGRAPHY
[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “Ima-
geNet Classification with Deep Convolutional Neural Networks”.
In: Proceedings of the 25th International Conference on Neural Infor-
mation Processing Systems. USA, 2012.
[22] S. Kullback and R. A. Leibler. “On Information and Sufficiency”.
In: Ann. Math. Statist. 1 (Mar. 1951).
[23] Abhishek Kumar, Prasanna Sattigeri, and P. Thomas Fletcher.
“Improved Semi-supervised Learning with GANs using Manifold
Invariances”. In: CoRR (2017).
[24] Samuli Laine and Timo Aila. “Temporal Ensembling for Semi-
Supervised Learning”. In: CoRR (2016).
[25] Alex Lamb, Vincent Dumoulin, and Aaron C. Courville. “Dis-
criminative Regularization for Generative Models”. In: CoRR
(2016).
[26] Jeongtae Lee, Jaehong Yun, Sungju Hwang, and Eunho Yang.
“Lifelong Learning with Dynamically Expandable Networks”. In:
(Aug. 2017).
[27] Fei-Fei Li, Andrej Karpathy, and Justin Johnson. Stanford Lecture
CS231n: Convolutional Neural Networks for Visual Recognition. 2016.
URL: \urlhttp://cs231n.stanford.edu/.
[28] Shan Luo, Xiaozhou Liu, Kaspar Althoefer, and Hongbin Liu.
“Tactile Object Recognition with Semi-Supervised Learning”. In:
(Aug. 2015).
[29] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and
Ole Winther. “Auxiliary Deep Generative Models”. In: Proceedings
of the 33rd International Conference on International Conference on
Machine Learning - Volume 48. 2016.
[30] Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. “Rectifier
nonlinearities improve neural network acoustic models”. In: in
ICML Workshop on Deep Learning for Audio, Speech and Language
Processing. 2013.
[31] Alireza Makhzani and Brendan J. Frey. “PixelGAN Autoencoders”.
In: CoRR (2017).
[32] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J.
Goodfellow. “Adversarial Autoencoders”. In: CoRR (2015).
BIBLIOGRAPHY 55
[33] Michael McCloskey and Neal J. Cohen. “Catastrophic Interfer-
ence in Connectionist Networks: The Sequential Learning Prob-
lem”. In: Psychology of Learning and Motivation - Advances in Re-
search and Theory (1989).
[34] Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger.
“Adversarial Variational Bayes: Unifying Variational Autoencoders
and Generative Adversarial Networks”. In: CoRR (2017).
[35] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. “Stochas-
tic Backpropagation and Approximate Inference in Deep Gener-
ative Models”. In: Proceedings of the 31st International Conference
on Machine Learning (ICML-14). JMLR Workshop and Conference
Proceedings, 2014.
[36] Chuck Rosenberg, Martial Hebert, and Henry Schneiderman.
“Semi-Supervised Self-Training of Object Detection Models”. In:
Proceedings of the Seventh IEEE Workshops on Application of Com-
puter Vision (WACV/MOTION’05) - Volume 1 - Volume 01. 2005.
[37] Fereshteh Sadeghi, Santosh K Divvala, and Ali Farhadi. “VisKE:
Visual Knowledge Extraction and Question Answering by Vi-
sual Verification of Relation Phrases”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2015.
[38] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Che-
ung, Alec Radford, and Xi Chen. “Improved Techniques for Train-
ing GANs”. In: CoRR (2016).
[39] Karen Simonyan and Andrew Zisserman. “Very Deep Convolu-
tional Networks for Large-Scale Image Recognition”. In: CoRR
(2014).
[40] Jost Tobias Springenberg. “Unsupervised and Semi-supervised
Learning with Categorical Generative Adversarial Networks”. In:
CoRR (2015).
[41] Giorgos Tolias, Ronan Sicre, and Hervé Jégou. “Particular ob-
ject retrieval with integral max-pooling of CNN activations”. In:
CoRR (2015).
[42] Junbo Jake Zhao, Michaël Mathieu, Ross Goroshin, and Yann
LeCun. “Stacked What-Where Auto-encoders.” In: CoRR (2015).
56 BIBLIOGRAPHY
[43] Denny Zhou, Jiayuan Huang, and Bernhard Schölkopf. “Learning
from Labeled and Unlabeled Data on a Directed Graph”. In: ACM
Press, 2005.
[44] Xiaojin Zhu. Semi-Supervised Learning Literature Survey. 2006.
Appendix A
Datasets
A.1 Dataset Filtering
(a)
Figure A.1: This figure shows qualitative results of filtering process
discussed in Section 5.1.4. This filtering approach helps in removing
most of the false-positives obtained from image search engine.
57
58 APPENDIX A. DATASETS
A.2 Real-world Dataset: Video Streams
Here, we show two video streams captured using a hand-held camera.
We sample image frames from such video streams to create our RW
dataset.
Figure A.2: Samples from one of the video streams used for collecting
the RW dataset.
Figure A.3: Samples from another video stream used for collecting the
RW dataset.
Appendix B
Architecture Details
B.1 Semi-supervised Convolutional AAE
B.1.1 Adversarial Network: Discriminator
Operation Hidden
Units
BN? Drop. Non-lin
Input:30-D
FC-1 1024 0.0 ReLU
FC-2 1024 0.0 ReLU
Output 2 0.0 Sigmoid
Hyperparameters
Learning-rate (Epochs) 0− 100 100− 300 300− 500
Opt:Adam-Adv (α) 1e− 5 1e− 6 1e− 7
Opt:Adam-Adv (β) β1 = 0.1, β2 = 0.999
Epochs = 500, Batch Size = 50
Weight initialization Isotropic Gaussian (µ = 0, σ = 0.02)
Bias initialization Constant (0.1)
Table B.1: Adversarial module: Discriminator sub-network of the ad-
versarial network for both categorical and continuous distribution. BN
in the column heading stands for ‘Batch-Normalization’. FC denotes
fully-connected layer. ‘Non-lin’ stands for non-linearity type of the
activation function used for the corresponding layer.
59
60 APPENDIX B. ARCHITECTURE DETAILS
B.1.2 Autoencoder Network
Operation Filters Kernel Strides BN? Dropout Non-lin
Input-64× 64× 3 3 — — — — —
Convolution-1 64 3× 3 1× 1 0.0 LReLU
Convolution-2 64 3× 3 1× 1 0.0 LReLU
Max-Pooling-1 64 2× 2 2× 2 0.0 LReLU
Convolution-3 128 3× 3 1× 1 0.0 LReLU
Convolution-4 128 3× 3 1× 1 0.0 LReLU
Max-Pooling-2 128 2× 2 2× 2 0.0 LReLU
Convolution-5 128 3× 3 1× 1 0.0 LReLU
Convolution-6 256 3× 3 1× 1 0.0 LReLU
Max-Pooling-3 256 2× 2 2× 2 0.0 LReLU
Convolution-7 256 3× 3 1× 1 0.0 LReLU
Max-Pooling-4 256 2× 2 2× 2 0.0 LReLU
FC-1y,z 1024 — — — 0.5 LReLU
FC-2y 512 — — — 0.5 LReLU
Latent-Code (lc-y) 10 — — — 0.0 Softmax
Latent-Code (lc-z) 30 — — — 0.0 Linear
Concat (lc-y + lc-z) 40 — — — — —
FC-3y+z 16384 — — — 0.0 Linear
Reshape-8×8×256 256 — — — 0.0 Linear
Up-convolution-1 128 3× 3 2× 2 0.0 LReLU
Up-convolution-2 64 3× 3 2× 2 0.0 LReLU
Up-convolution-3 3 3× 3 2× 2 0.0 Sigmoid
Output 64× 64× 3
Hyperparameters
Learning-rate (Epochs) 0− 100 100− 300 300− 500
Opt:Adam-AE (α) 5e− 7 5e− 8 5e− 9
Opt:Adam-AE (β) β1 = 0.9, β2 = 0.999
Epochs = 500, Batch Size = 50, Leaky ReLU slope = 0.01
Weight initialization: Isotropic Gaussian (µ = 0, σ = 0.02)
Bias initialization: Constant (0.1)
Table B.2: ‘AE’ stands for autoencoder. The superscripts ’y’ and ’z’
represent the style and class-label latent variables respectively. lc-y and
lc-z represent the latent code for style and class-label respectively. For
FC layers, ‘filters’ corresponds to the number of hidden units.
APPENDIX B. ARCHITECTURE DETAILS 61
B.1.3 Classification/Adversarial Network: Generator
Operation Filters Kernel Strides BN? Dropout Non-lin
Input-64×64×3 3 — — — — —
Convolution-1 64 3× 3 1× 1 0.0 LReLU
Convolution-2 64 3× 3 1× 1 0.0 LReLU
Max-Pooling-1 64 2× 2 2× 2 0.0 LReLU
Convolution-3 128 3× 3 1× 1 0.0 LReLU
Convolution-4 128 3× 3 1× 1 0.0 LReLU
Max-Pooling-2 128 2× 2 2× 2 0.0 LReLU
Convolution-5 128 3× 3 1× 1 0.0 LReLU
Convolution-6 256 3× 3 1× 1 0.0 LReLU
Max-Pooling-3 256 2× 2 2× 2 0.0 LReLU
Convolution-7 256 3× 3 1× 1 0.0 LReLU
Max-Pooling-4 256 2× 2 2× 2 0.0 LReLU
FC-1 1024 — — 0.5 LReLU
FC-2 512 — — 0.5 LReLU
Output-y 10 — — — 0.0 Softmax
Hyperparameters
Learning-rate (Epochs) 0− 100 100− 300 300− 500
Opt:Adam-CNN (α) 1e− 5 1e− 6 1e− 7
Opt:Adam-CNN (β) β1 = 0.9, β2 = 0.999
Opt:Adam-Gen (α) 1e− 4 1e− 5 1e− 6
Opt:Adam-Gen (β) β1 = 0.1, β2 = 0.999
Epochs = 500, Batch Size = 50, Leaky ReLU slope = 0.01
Weight initialization: Isotropic Gaussian (µ = 0, σ = 0.02)
Bias initialization: Constant (0.1)
Table B.3: Semi-supervised classification module: This architecture
is also used as the CNN Baseline architecture. ’Gen’ indicates Gen-
erator network, which is active during adversarial training. ’CNN’
indicates ’Convolutional Neural Network’ which is active during the
semi-supervised classification phase.
www.kth.se