arxiv:2001.03919v1 [cs.cv] 12 jan 2020gordon setter feature spaces learnt with binary labels 1 1 1 1...

10
Rethinking Class Relations: Absolute-relative Few-shot Learning Hongguang Zhang 1,2,3* Philip H. S. Torr 2 Hongdong Li 1,4 Songlei Jian 5 Piotr Koniusz 3,1 1 Australian National University, 2 University of Oxford, 3 Data61/CSIRO 4 Australian Centre for Robotic Vision, 5 National University of Defense Technology firstname.lastname@{anu.edu.au 1 , eng.ox.ac.uk 2 , data61.csiro.au 3 } Abstract The majority of existing few-shot learning describe image relations with {0, 1} binary labels. However, such binary relations are insufficient to teach the network complicated real-world relations, due to the lack of decision smoothness. Furthermore, current few-shot learning models capture only the similarity via relation labels, but they are not exposed to class concepts associated with objects, which is likely detrimental to the classification performance due to under- utilization of the available class labels. To paraphrase, while children learn the concept of tiger from a few of examples with ease, and while they learn from comparisons of tiger to other animals, they are also taught the actual concept names. Thus, we hypothesize that in fact both similarity and class concept learning must be occurring simultaneously. With these observations at hand, we study the fundamen- tal problem of simplistic class modeling in current few-shot learning, we rethink the relations between class concepts, and propose a novel absolute-relative learning paradigm to fully take advantage of label information to refine the im- age representations and correct the relation understanding. Our proposed absolute-relative learning paradigm improves the performance of several the state-of-the-art models on publicly available datasets. 1. Introduction Deep learning, a popular learning paradigm in computer vision, has improved the performance on numerous computer vision tasks, such as category recognition, scene understand- ing and action recognition. However, deep models heavily rely on large amount of labeled training data, demanding costly data collection and labelling processes. In contrast, humans posses the ability to learn and memo- rize new complex visual concepts from very few examples. * This work is under review. Please respect the authors’ efforts by not copying/plagiarizing bits and pieces of this work for your own gain. If you find anything inspiring in this work, be kind enough to cite it thus showing you care for the CV community. Compare Predict red yellow green furry smooth eyes legs spot Absolute Learning Relative Learning How similar? Relations? e. g., colour, shape Compare Absolute-relative Learning How similar? Relations? e. g., colour, shape Predict red yellow green furry smooth eyes legs spot Learn to predict various class annotations when simulating the object relations. Learn to simulate realistic object relations with all available class annotations. Absolute Labels: Relative Labels: Colours, Shapes, Materials, WordNet Colours, Shapes, Materials, WordNet Class Label, Scales, Rotations, Jigsaw Figure 1: Our few-shot learning paradigm. Absolute Learning (AL) refers to the strategy such that a pipeline learns to predict absolute object information e.g., object or concept class. Relative Learning (RL) denotes similarity (relation) learning with the use of binary {0, 1} and/or soft [0; 1] similarity labels. Absolute-relative Learning (ArL) is a combination of AL and RL which is akin to multi-task learning but also seems conceptually closer to how humans learn from few examples. Inspired by this observation, researchers have focused on the so-called few-shot learning problem, for which a network is trained by the use of only few labeled training instances. Re- cently, deep networks based on relation-learning have gained the popularity [29, 27, 28, 26, 33]. Such approaches can be viewed as a form of metric learning adapted to the few-shot learning task. These works learn to capture image relations (similarity between pairs of images) between images rep- resented by so-called base classes and can be evaluated on images containing novel classes. 1 arXiv:2001.03919v1 [cs.CV] 12 Jan 2020

Upload: others

Post on 23-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

Rethinking Class Relations: Absolute-relative Few-shot Learning

Hongguang Zhang1,2,3∗ Philip H. S. Torr2 Hongdong Li1,4 Songlei Jian5 Piotr Koniusz3,1

1Australian National University, 2University of Oxford, 3Data61/CSIRO4Australian Centre for Robotic Vision, 5National University of Defense Technology

firstname.lastname@{anu.edu.au1, eng.ox.ac.uk2, data61.csiro.au3}

Abstract

The majority of existing few-shot learning describe imagerelations with {0, 1} binary labels. However, such binaryrelations are insufficient to teach the network complicatedreal-world relations, due to the lack of decision smoothness.Furthermore, current few-shot learning models capture onlythe similarity via relation labels, but they are not exposedto class concepts associated with objects, which is likelydetrimental to the classification performance due to under-utilization of the available class labels. To paraphrase, whilechildren learn the concept of tiger from a few of exampleswith ease, and while they learn from comparisons of tigerto other animals, they are also taught the actual conceptnames. Thus, we hypothesize that in fact both similarity andclass concept learning must be occurring simultaneously.With these observations at hand, we study the fundamen-tal problem of simplistic class modeling in current few-shotlearning, we rethink the relations between class concepts,and propose a novel absolute-relative learning paradigm tofully take advantage of label information to refine the im-age representations and correct the relation understanding.Our proposed absolute-relative learning paradigm improvesthe performance of several the state-of-the-art models onpublicly available datasets.

1. IntroductionDeep learning, a popular learning paradigm in computer

vision, has improved the performance on numerous computervision tasks, such as category recognition, scene understand-ing and action recognition. However, deep models heavilyrely on large amount of labeled training data, demandingcostly data collection and labelling processes.

In contrast, humans posses the ability to learn and memo-rize new complex visual concepts from very few examples.

∗This work is under review. Please respect the authors’ efforts by notcopying/plagiarizing bits and pieces of this work for your own gain. If youfind anything inspiring in this work, be kind enough to cite it thus showingyou care for the CV community.

RelationNetwork

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

MulticlassMSE

Attribute MSE

Word2vec MSE

Feature Encoder

RelationNetwork 0

Feature Encoder

RelationNetwork

00 1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1/0

0.6

Binary MSE

Relative MSE Soft Semantic Label

Feature Encoder

RelationNetwork 0 1/0

Binary MSE

Object Info. Predictor

Relative Learning

Absolute Learning

Original Relation Learning

Relation Learning: Learn to know if the two object are from the same class.Relative Learning: Learn to know if the two objects are from the same class, AND how similar the two objects are.Absolute Learning: Learn to know if the two objects are from the same class, AND what the two objects are.

MSE

MSE

1/0MSE

ImageRelation

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

MulticlassMSE

Attribute MSE

Word2vec MSE

MSE

MSE

1/0MSE

Compare

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Absolute Learning

Relative Learning

How similar?Relations? e. g., colour, shape

Compare

Absolute-relative Learning

How similar?Relations? e. g., colour, shape

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Learn to predict various class annotations when simulating the object relations.

Learn to simulate realistic object relations with all available class annotations.

Absolute Labels: Relative Labels:

Feature Pairs

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Colours, Shapes, Materials, WordNetColours, Shapes, Materials, WordNet

Class Label, Scales, Rotations, Jigsaw

Object Info. Predictor

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Binary Labels

1

11

1

1

1

1

0

1

0

0

0

1

1

1

1

1 1

1

1

1

11

1 11

1

11

0

0

0

0

00

0

0

0

0

0

0

00

Negative SpacePositive Space

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Semantic Labels

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

00

11

0

0

0

1

01

0

0

1

1 1

Negative SpacePositive Space

0.1

0.1 0.10.1

0.1

0.10.10.1

0.2

0.2

0.2 0.2

0.2

0.20.2

0.2

0.2

0.2

0.6

0.6

0.7 0.7

0.8

0.8

0.80.8

σ increases in Eq. (4)

α increases in Eq. (5)

Binary Labels

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

Figure 1: Our few-shot learning paradigm. Absolute Learning(AL) refers to the strategy such that a pipeline learns to predictabsolute object information e.g., object or concept class. RelativeLearning (RL) denotes similarity (relation) learning with the use ofbinary {0, 1} and/or soft [0; 1] similarity labels. Absolute-relativeLearning (ArL) is a combination of AL and RL which is akinto multi-task learning but also seems conceptually closer to howhumans learn from few examples.

Inspired by this observation, researchers have focused on theso-called few-shot learning problem, for which a network istrained by the use of only few labeled training instances. Re-cently, deep networks based on relation-learning have gainedthe popularity [29, 27, 28, 26, 33]. Such approaches can beviewed as a form of metric learning adapted to the few-shotlearning task. These works learn to capture image relations(similarity between pairs of images) between images rep-resented by so-called base classes and can be evaluated onimages containing novel classes.

1

arX

iv:2

001.

0391

9v1

[cs

.CV

] 1

2 Ja

n 20

20

Page 2: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

However, there are two major problems in these relationlearning pipelines, namely, (i) binary labels {0, 1} are usedexpress the similarity between pairs of images, which cannotcapture the similarity nuisances in the real-world setting dueto the hardness of such modeling, which leads to biases inthe relation-based models, (ii) only pair-wise relation labelsare used in these pipelines, so the models have no knowledgeof the actual class concepts. In other words, these modelsare trained to learn the similarity between image pairs whilethey discard the explicit object classes which are otherwiseavailable for use in the training stage.

We conjuncture that it is these two problems that leads tothe inconsistency between current few-shot machine learningapproaches and human’s cognitive processes. To validatethis, we propose the Absolute-relative Learning (ArL) toexpose few-shot learners to both similarity and class labels,and we employ semantic annotations to circumvent the issuewith the somewhat rigid binary similarity labels {0, 1}.

Our ArL consists of two separate learning modules,namely Absolute Learning (AL) and Relative Learning (RL).AL denotes the strategy for which we learn to predict theactual object categories or class concepts in addition to learn-ing the class relations. In this way, the feature extractingnetwork is exposed to additional object- or concept-relatedknowledge. RL refers to the similarity learning strategyfor which (apart of binary {0, 1} labels) we employ varioussemantic annotations to promote the realistic similarity be-tween image pairs, that is we employ labels continuous on[0; 1] interval. Such labels are further used as the supervisorycue in relation learning. In this way, the relation networkis trained to better capture the actual soft object relationsbeyond the binary similarity labels {0, 1}.

By combing the AL and RL strategies which consti-tute on ArL, the relation network will be simultaneouslytaught the class/object concepts together with more realisticclass/object relations, thus naturally yielding an improvedaccuracy.

Our approach is somewhat related to multi-modal learn-ing which leverages multiple sources of data for training andtesting. However, while multi-modal learning approachesfeed multiple streams of data into network inputs, our ArLmodels the semantic annotations in the label space, that is wesue them for the network output. We believe that our strategyof using multiple abstractions of labels (relative vs. absolute)encourages the network to preserve more information aboutobjects relevant to the classification task. Thus, our strategydraws on experience in multi-task learning where learningtwo tasks simultaneously is expected to help each other andoutperform a naive fusion of two separate tasks.

We note that obtaining the semantic information for novelclasses (the testing step in few-shot learning) is not alwayseasy or possible. Since our pipeline design is akin to multi-task rather than multi-modal learning, our model does not

require additional labeling at the testing stage, therefore is amore realistic setting than that of existing approaches.

Below, we summarize our contributions:

i. We propose the novel Absolute-relative Learningparadigm, which can be embedded into popular few-shotpipelines to exploit both similarity and object/conceptlabelling.

ii. We investigate the influence of different types of se-mantic annotations and similarity measures in RelativeLearning e.g., hard vs. soft similarity labelling.

iii. We investigate the influence of different Absolute Learn-ing branches on the classification performance.

To the best of our knowledge, we are the first to performan in-depth analysis of object and class relation modeling inthe context of few-shot learning, and propose the Absolute-relative Learning paradigm and the soft relation learning viasemantic annotations.

2. Related WorkBelow we describe recent zero-, one- and few-shot learn-

ing algorithms followed by semantic-based approaches.

2.1. Learning From Few Samples

For deep learning algorithms, the ability of “learningfrom only a few examples is the desired characteristic toemulate in any brain-like system” [23] is a desired operat-ing principle which poses a challenge as typical CNNs aredesigned for the large scale visual category recognition [25].One- and Few-shot Learning has been studied widely incomputer vision in both shallow [19, 18, 7, 4, 6, 16] anddeep learning scenarios [15, 29, 27, 8, 27, 28, 33].

Early works [6, 16] propose one-shot learning methodsmotivated by the observation that humans can learn newconcepts from very few examples. Siamese Network [15]presents a two-streams convolutional neural network ap-proach which generates image descriptors and learns the sim-ilarity between them. Matching Network [29] introduces theconcept of support set and L-way Z-shot learning protocols.It captures the similarity between one testing and severalsupport images, thus casting the one-shot learning problemas set-to-set learning. Prototypical Networks [27] learns amodel that computes distances between a datapoint and pro-totype representations of each class. Model-Agnostic Meta-Learning (MAML) [8] introduces a meta-learning modeltrained on a variety of different learning tasks. Relation Net[28] is an efficient end-to-end network for learning the rela-tionship between testing and support images. Conceptually,this model is similar to Matching Network [29]. However,Relation Net leverages an additional deep neural networkto learn similarity on top of the image descriptor generating

2

Page 3: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

RelationNetwork

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

MulticlassMSE

Attribute MSE

Word2vec MSE

Feature Encoder

RelationNetwork 0

Feature Encoder

RelationNetwork

00 1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1/0

0.6

Binary MSE

Relative MSE Soft Semantic Label

Feature Encoder

RelationNetwork 0 1/0

Binary MSE

Object Info. Predictor

Relative Learning

Absolute Learning

Original Relation Learning

Relation Learning: Learn to know if the two object are from the same class.Relative Learning: Learn to know if the two objects are from the same class, AND how similar the two objects are.Absolute Learning: Learn to know if the two objects are from the same class, AND what the two objects are.

MSE

MSE

1/0MSE

ImageRelation

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

MulticlassMSE

Attribute MSE

Word2vec MSE

MSE

MSE

1/0MSE

Compare

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Absolute Learning

Relative Learning

How similar?Relations? e. g., colour, shape

Compare

Absolute-relative Learning

How similar?Relations? e. g., colour, shape

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Learn to predict various class annotations when simulating the object relations.

Learn to simulate realistic object relations with all available class annotations.

Absolute Labels: Relative Labels:

Feature Pairs

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Colours, Shapes, Materials, WordNetColours, Shapes, Materials, WordNet

Class Label, Scales, Rotations, Jigsaw

Object Info. Predictor

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Binary Labels

1

11

1

1

1

1

0

1

0

0

0

1

1

1

1

1 1

1

1

1

11

1 11

1

11

0

0

0

0

00

0

0

0

0

0

0

00

Negative SpacePositive Space

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Semantic Labels

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

00

11

0

0

0

1

01

0

0

1

1 1

Negative SpacePositive Space

0.1

0.1 0.10.1

0.1

0.10.10.1

0.2

0.2

0.2 0.2

0.2

0.20.2

0.2

0.2

0.2

0.6

0.6

0.7 0.7

0.8

0.8

0.80.8

σ increases in Eq. (4)

α increases in Eq. (5)

Binary Labels

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

House Finch

Robin

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Semantic Labels

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

11

Negative SpacePositive Space

0.1

0.1 0.10.1

0.1

0.10.10.1

0.2

0.2 0.2

0.2

0.20.2

0.2

0.2

0.2

0.6

0.6

0.7 0.7

0.8

0.8

0.80.8

Walker Hound

0 or 10 or 1

0.6~1.00.6~1.0

Binary Relation Labels

Soft Relation Labels

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Binary Labels

1

11

1

1

1

1

0

1

0

0

0

1

1

1

1

1 1

1

1

1

11

1 11

1

11

0

0

0

0

00

0

0

0

0

0

0

00

Negative SpacePositive Space

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Semantic Labels

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

00

1

10

0

00

0

0

1

1 1

Negative SpacePositive Space

0.1

0.1 0.10.1

0.1

0.10.10.1

0.2

0.2

0.2 0.2

0.2

0.20.2

0.2

0.2

0.2

0.6

0.6

0.7 0.7

0.8

0.8

0.80.8

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Binary Labels

1

11

1

1

1

1

0

1

0

1

1

1

1

1

11

1

1

11

1

1

1

1

1

0

0

0

0

0

0

0

0

Negative SpacePositive Space

00

1

0

0

0

1

0

0

01

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Soft Labels

1

11

1

1

1

1

1 1

1

1

1

1

11

1

1

11

1

1

1

1

1

Negative SpacePositive Space

1

1

1

0.2

0.1

0.1

0.3 0.2

0.10.2

0.3

0.1

0.30.7

0.7

0.6

0.6

0.80.8

0.70.7

positive negtive

positive negtivepositive negtive

Figure 2: Motivation for relative learning. Binary vs. semantic learning. (Left) ’Ibizan Hound’ and ’Walker Hound’ are closer in the featurespace, the image pairs from the two classes (no matter if they are from the same class or not) are likely to randomly flip from 0 to 1 (or 1 to0) due to the quantization error if the network is learnt with binary labels. (Right) If the network is trained with semantic labels, it can assigna more accurate score for class pairs which are close to each other, and limit the quantization error.

network. Second-order Similarity Network (SoSN) [33] issimilar to Relation Net [28], which consists of the featureencoder and relation network. However, approach [28] usesfirst-order representations for similarity learning. In contrast,SoSN investigates second-order representations to captureco-occurrences of features. Graph Neural Networks (GNN)have also been applied to few-shot learning in many recentworks [9, 14, 11] achieving promising results.Zero-shot Learning can be implemented within the similar-ity learning frameworks which follow [15, 29, 27, 28, 32].Below we summarize popular zero-shot learning methods.Methods such as Attribute Label Embedding (ALE) [1] usesattribute vectors as label embedding and uses an objectiveinspired by a structured WSABIE ranking method, whichassigns more importance to the top of the ranking list. Embar-rassingly Simple Zero-Shot Learning (ESZSL) [24] imple-ments regularization terms for a linear mapping to penalizethe projection of feature vectors to the attribute space andthe projection of attribute vectors to the feature space.

2.2. Learn From Semantic Labels

Semantic labels are often used in various computer visiontasks e.g., object classification, face and emotion recognition,image retrieval, transfer learning, and especially in zero-shotlearning. Metric learning approaches often use semanticinformation as detailed below.

Approach [5] describes an image retrieval system whichutilizes semantics of images via probabilistic modeling. Pa-per [31] presents a novel bi-relational graph model that com-prises both the data graph and semantic label graph, andconnects them by an additional bipartite graph built fromlabel assignments. Approach [22] proposes a classifier basedon semantic annotations and provides the theoretical boundlinking the error rate of the classifier and the number of in-stances required for training. Their proposed classifier guar-antees a fast rate of convergence. Approach [12] improves

metric learning via the use of semantic labels. They studyhow to effectively learn the distance metric from datasetsthat contain semantic annotation, and propose novel metriclearning mechanisms for different types of semantic annota-tions. Our relative learning is somewhat similar on the ideausing semantic information to learn the metrics, however, weuse similarity measure functions to simulate realistic relationlabels, which is especially beneficial for few-shot learning.

2.3. Multi-task and Self-supervised Learning

Multi-task learning aims at learning feature descriptionsacross a set of multiple related tasks. Early work [3] treatsthe multi-task learning as a convex optimization problem,and propose the iterative algorithm to solve the it. [13] pro-poses a principled approach that considers the homoscedasticuncertainty of each task to weight multiple loss functions.Self-supervised learning can be regarded as a specilized typeof multi-task learning, which simultaneously learns to pre-dict image attributes, e.g., rotation, scale and colour, whensolving classic learning tasks such as image classification,action recognition. Recent work [10] proposes to apply self-supervision on few-shot learning pipelines to learn to predictthe image rotations, which boost the model’s performance.Our absolute learning is somewhat related to multi-task andself-supervised learning, however, we focus on how to refinethe backbone by learning the knowledge of class conceptswithin a single task to address the above-mentioned problemin few-shot learning.

3. BackgroundThe concept of and the pipeline for few-shot similarity

learning are described in this section.

3.1. Similarity Network

Few-shot learning network typically consists of two partswhich are (i) feature encoder and (ii) relation network. The

3

Page 4: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

role of the feature encoder is to generate convolutional fea-tures used as image descriptors. The role of the relationnetwork is to learn the relation between so-called supportand query image descriptors in order to compare them. Toimprove understanding of the impact of ArL on few-shotlearning, we implement it within three recent few-shot learn-ing pipelines, Relation Net [28], SoSN [33] and SalNet [34].

Below we take the two-stage ‘feature encoder-relationnetwork’ [28, 33, 34] as an example in order to illustrate themain aspects of few-shot learning pipelines.

A basic relation network [28, 33] contains 2-4 convolu-tional blocks and 2 fully-connected layers. Each convolu-tional block is the combination of a convolution with filters,Batch Normalization, ReLU activation and the max-poolinglayer.

Let us define the feature encoding network as an operatorf : (RW×H ;R|F |) � RK×N , where W and H denote thewidth and height of an input image,K is the length of featurevectors (number of filters), N=NW ·NH is the total numberof spatial locations in the last convolutional feature map. Forsimplicity, we denote an image descriptor by Φ ∈ RK×N ,where Φ=f(X;F) for an image X∈RW×H and F are theparameters-to-learn of the encoding network.

The relation network, whose role is to compare two datapoints encoded as some K ′ dim. vectorized second-orderrepresentations, is denoted by operator r : (RK′;R|P|)�R.Typically, we write r(ψ;P), where ψ∈RK′ and P are theparameters-to-learn of the relation network.

3.2. Problem Formulation

For the L-way 1-shot problem, we assume one supportimage X with its image descriptor Φ and one query im-age X∗with its image descriptor Φ∗. In general, we use ‘∗’to indicate query-related variables. Moreover, each of theabove descriptors belong to one of L classes in the subset{c1, ..., cL}⊂IC that forms so-called L-way learning prob-lem and the class subset {c1, ..., cL} is chosen at randomfrom IC≡{1, ..., C}. Then, the L-way 1-shot learning stepcan be defined as learning similarity:

SI(Φ,Φ∗;P) = r (ϑ(Φ,Φ∗) ;P) , (1)

where SI refers to similarity prediction of given image pairs,r refers to the relation network, and P denotes networkparameters that have to be learnt. ϑ is the descriptor/operatoron features of image pairs, and here it means ‘concatenation’in our paper.

For the L-way Z-shot problem, we assume some sup-port images {Xn}n∈W from setW and their correspondingimage descriptors {Φs}s∈W which can be considered as aZ-shot descriptor. Moreover, we assume one query imageX∗with its image descriptor Φ∗. Again, both the Z-shotand the query descriptors belong to one of L classes in thesubset C‡≡{c1, ..., cL}⊂IC≡C. Then, the L-way Z-shot

learning step can be defined as learning similarity:

SI({Φs}s∈W ,Φ∗;P) = r (ϑ({Φs}s∈W ,Φ∗) ,P) . (2)

Following approach [28] and SoSN [33], the MeanSquare Error (MSE) is employed as the objective function:

L=∑c∈C‡

∑c′∈C‡

(SI({Φs}s∈Wc

,Φ∗q∈Q:`(q)=c′ ,P)−δ(c−c′)

)2,

s.t. Φs=f(Xs;F) and Φ∗q=f(X∗q;F). (3)

In the above equation,Wc is a randomly chosen set of sup-port image descriptors of class c∈C‡, Q is a randomly cho-sen set ofL query image descriptors so that its consecutive el-ements belong to the consecutive classes in C‡≡{c1, ..., cL}.`(q) corresponds to the label of q∈Q. Lastly, δ refers to theindicator function.

4. ApproachBelow, we firstly illustrate the Relative Learning and

Absolute Learning modules followed by the introduction ofthe Absolute-relative Learning pipeline. We note that alltypes of auxiliary information e.g., attributes and wordvecembeddings are used in label space. They are not used asextra input datapoints.

4.1. Relative Learning

In conventional few-shot learning approaches, binaryclass labels are employed to train the CNNs in order tomodel the relations between pairs of images. However, la-beling such pairs as similar/dissimilar (think {0, 1}) cannotfully reflect the actual relations between objects.

In Figure 4, we illustrate the difference between usingbinary labels and semantic labels for few-shot similaritylearning. As can be seen, in conventional setup, the pairsformed from samples belonging to similar concepts are likelyto confuse the training process.

In this paper, we take a deeper look at how to representrelations in the few-shot learning scenario. To better illus-trate class relations in the label space, we introduce semanticannotations e.g., attribute and Word2vec. Based on thesesemantic annotations, we develop novel strategies to investi-gate how semantic relation labels influence the final few-shotlearning performance.

The way we use semantic annotations differs from multi-modal few-shot learning which leverages the semantic infor-mation as an extra input. Instead, we use semantic annota-tions as training labels and we do not use them during testingat all. The difference between SoSN (on which we build)and our semantic-based model is demonstrated in Figure 3.

A natural way to utilize the semantic information is togenerate soft labels from semantic annotation through somesimilarity measure. The first similarity measure applied byus is based on the RBF similarity, defined as follows:

4

Page 5: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

RelationNetwork

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

MulticlassMSE

Attribute MSE

Word2vec MSE

Feature Encoder

RelationNetwork 0

Feature Encoder

RelationNetwork

00 1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1/0

0.6

Binary MSE

Relative MSE Soft Semantic Label

Feature Encoder

RelationNetwork 0 1/0

Binary MSE

Object Info. Predictor

Relative Learning

Absolute Learning

Original Relation Learning

Relation Learning: Learn to know if the two object are from the same class.Relative Learning: Learn to know if the two objects are from the same class, AND how similar the two objects are.Absolute Learning: Learn to know if the two objects are from the same class, AND what the two objects are.

MSE

MSE

1/0MSE

ImageRelation

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

MulticlassMSE

Attribute MSE

Word2vec MSE

MSE

MSE

1/0MSE

Compare

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Absolute Learning

Relative Learning

How similar?Relations? e. g., colour, shape

Compare

Absolute-relative Learning

How similar?Relations? e. g., colour, shape

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Learn to predict various class annotations when simulating the object relations.

Learn to simulate realistic object relations with all available class annotations.

Absolute Labels: Relative Labels:

Feature Pairs

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Predict

Colours, Shapes, Materials, WordNetColours, Shapes, Materials, WordNet

Class Label, Scales, Rotations, Jigsaw

Object Info. Predictor

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Binary Labels

1

11

1

1

1

1

0

1

0

0

0

1

1

1

1

1 1

1

1

1

11

1 11

1

11

0

0

0

0

00

0

0

0

0

0

0

00

Negative SpacePositive Space

House Finch

Robin

King CrabWalker Hound

Ibizan Hound

Gordon Setter

Feature Spaces Learnt with Semantic Labels

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

11

1

1

1

1

1

1

1

11

00

11

0

0

0

1

01

0

0

1

1 1

Negative SpacePositive Space

0.1

0.1 0.10.1

0.1

0.10.10.1

0.2

0.2

0.2 0.2

0.2

0.20.2

0.2

0.2

0.2

0.6

0.6

0.7 0.7

0.8

0.8

0.80.8

σ increases in Eq. (4)

α increases in Eq. (5)

Binary Labels

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

Figure 3: The pipeline: standard Relation Learning, Relative Learning and Absolute Learning. The original SoSN uses the binary labelsfor supervision while we introduce different types of labelling which can be though of as soft vs. hard, and relative vs. absolute. RelativeLearning employs semantic annotations (e.g., attributes, Word2vec) to capture more realistic object relations while Absolute Learningleverages an extra branch of classifier to predict object annotations e.g., class label, attributes, and Word2vec embeddings.

k1(c, c′) = e

−‖ψ(c)−ψ(c′)‖22σ2 , (4)

where c and c′ denote the classes of two images, resp., vectorψ(c) is the semantic embedding of class c, and σ is the RBFradius.

A perhaps better similarity measure presented below em-ploys the α parameter, which interpolates the shape of theslope between Laplacian and Gaussian kernels:

k2(c, c′) =

(1 + α)k1(c, c′)

1 + αk1(c, c′). (5)

The profiles of above two similarity measures are shownin Figure 4. The similarity measure given by Eq. 5 canvary from Laplacian- to Gaussian-shaped as we vary α. Thematrices for the two similarity measures generated fromminiImagenet classes are shown in Figure 4.

The objective function for few-shot learning pipeline withsemantic similarity measure is defined as follows:

Lrel=∑c∈C‡

∑c′∈C‡

(SI({Φs}s∈Wc ,Φ

∗q∈Q:`(q)=c′ ,P

)−k(c,c′)

)2,

s.t. Φs=f(Xs;F) and Φ∗q=f(X∗q;F). (6)

where k is the similarity measure selected according to Eq.4 or 5.

4.2. Absolute Learning

In contrast to relative learning which applies the relativesoft relation labels to supervise the relation network, absolutelearning refers to the strategy in which the network learnspredefined object information, e.g.class labels, attributes,Word2vec embeddings.

σ increases in Eq. (4)

α increases in Eq. (5)

Binary Labels

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

0 1 2 3 4 5 6 7 8 9 10d(x, y)

0

0.2

0.4

0.6

0.8

1

R(x

, y)

increase increases when ( = 1)

Figure 4: Semantic soft labels obtained with Eq. 4 (1st row) and Eq. 5 (2nd row) for miniImagenet classes.

5

Page 6: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

RelationNetwork

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

MulticlassMSE

Attribute MSE

Word2vec MSE

Feature Encoder

RelationNetwork 0

Feature Encoder

RelationNetwork

00 1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1/0

0.6

Binary MSE

Relative MSE Soft Semantic Label

Feature Encoder

RelationNetwork 0 1/0

Binary MSE

Object Info. Predictor

Image Pair

Relative Learning

Absolute Learning

Original Relation Learning

Relation Learning: Learn to know if the two object are from the same class.Relative Learning: Learn to know if the two objects are from the same class, AND how similar the two objects are.Absolute Learning: Learn to know if the two objects are from the same class, AND what the two objects are.

MSE

MSE

1/0MSE

( a ) ( b ) ( c )

ImageRelation

0

Class Predictor

Attribute Predictor

Word2vec Predictor

0.2

0.3

Attribute Soft Label

Word2vec Soft Label

Vote

Feature Encoder

Img1

Img2

1/0

Absolute Learning Modules

Relative Learning Modules

MulticlassMSE

Attribute MSE

Word2vec MSE

MSE

MSE

1/0MSE

Compare

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Absolute Learning

Relative Learning

How similar?Relations? e. g., colour, shape

Compare

Absolute-relative Learning

How similar?Relations? e. g., colour, shape

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Predict

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Learn to predict various class annotations when simulating the object relations.

Learn to simulate realistic object relations with all available class annotations.

Types of annotations can be used in Absolute-relative learning

Absolute Learning: Colours, Shapes, Materials, WordNet, Class Label, Scales, Rotations, Jigsaw Patterms

Relative Learning: Colours, Shapes, Materials, WordNet

Feature Pairs

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

red

yello

wgr

een

furry

smoo

th

eyes

legs

spot

Figure 5: The proposed pipeline for our absolute-relative relation learning. It consists of three blocks: i) feature encoder to extract theimage representations, ii) absolute learning module to enhance the feature quality with auxiliary supervision, iii) the relative learning moduleto learn image relations based on multi-modal relation supervisions. With our absolute-relative learning, we want to not only learn if the twoobjects are from the same class, but also additionally learn how similar the two objects are (relative learning) and what the two objects are(absolute learning).

The motivation behind the absolute learning is that mostof current few-shot learning pipelines use the binary relationlabels as supervision, which prevents the network from cap-turing objects concepts. In other words, the network knowsif the two objects are similar (or not) but it does not knowwhat the objects are.

The pipeline for absolute learning is shown in Figure 3. Inthis paper, we apply an additional network branch followingthe feature encoder to learn the absolute object information.

Firstly, consider the class prediction as an example. Oncewe obtain the second-order representation Φi given image i,we feed it into the object class predictor, which is denoted ash.

The output pci of class predictor is defined as follows:

pci = h(Φ;H). (7)

Then we apply a cross-entropy loss function to train thepredictor:

Labs = −N∑i

log(exp(pi[lci])∑j

exp(pi[j])). (8)

To predict the attribute and Word2vec embeddings, weuse the MSE loss on the embedding vectors and rather thenthe cross-entropy to train the absolute predictor.

This absolute learning module is somewhat similar toso-called self-supervised. However, we use a discriminatorto recognize the different types of object annotations whilethe typical self-supervision recognises the patterns of dataaugmentation. We believe our strategy helps refine the fea-ture encoder to capture both the notion of similarity as wellas concrete object concepts.

4.3. Absolute-relative Learning

Combining the proposed relative and absolute learningforms our final Absolute-relative Learning (ArL). Specifi-cally, For this strategy, we simultaneously train the relationnetwork with relative object similarity labels and introducean auxiliary task which learns specific object labels. Thepipeline of ArL is shown in Figure 5 which highlights thatthe ArL model uses the auxiliary semantic soft label to trainthe relation network to capture more realistic image relationswhile employing auxiliary predictor branches to infer dif-ferent types of object information, thus refining the featurerepresentations and the feature encoder.

We minimize the following objective for the absolute-relative learning:

argminF ,P,H

Lr + βLrel + γLabs, (9)

where β, γ refers to the hyper-parameters that control thegradient backprop.

In what follows, we randomly choose different relative-learning modules (attribute-based or Wor2vec-based softlabels) and absolute-learning branches (class, attribute andWord2vec predictor) to from various absolute-relative learn-ing pipelines.

5. ExperimentsBelow, we demonstrate the usefulness of our approach by

evaluating it on the miniImagenet [29], fine-grained CUB-200-2011 [30] and Flower102 [21] datasets.

We employ the recent Relation Net [28], SoSN [33] andSalNet [34] as our baseline models to evaluate our relative

6

Page 7: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

Table 1: Evaluations on the miniImagenet dataset (5-way acc. given). We evaluate the relative and absolute learning modules on RelationNet[28], SoSN [33], and SalNet [34].

Model Backbone 1-shot 5-shot

Matching Nets [29] - 43.56± 0.84 55.31± 0.73Meta Nets [20] - 49.21± 0.96 -Prototypical Net [27] Conv-4-64 49.42± 0.78 68.20± 0.66MAML [8] Conv-4-64 48.70± 1.84 63.11± 0.92RelationNet [28] Conv-4-64 51.36± 0.82 65.32± 0.70SoSN [33] Conv-4-64 53.73± 0.83 68.58± 0.70MAML++ [2] Conv-4-64 52.15± 0.26 68.32± 0.44MetaOpt [17] Conv-4-64 52.87± 0.57 68.76± 0.48SalNet [34] Conv-4-64 57.45± 0.88 72.01± 0.67

Relative LearningRelationNet - RL Conv-4-64 53.20± 0.65(↑1.84) 66.74± 0.45(↑1.32)SoSN - RL Conv-4-64 55.49± 0.68(↑1.76) 70.86± 0.44(↑2.28)SalNet - RL Conv-4-64 58.67± 0.66(↑1.22) 73.01± 0.46(↑1.00)

Absolute LearningRelation Net - AL Conv-4-64 52.67± 0.67(↑1.31) 66.97± 0.46(↑1.65)SoSN - AL Conv-4-64 55.61± 0.67(↑1.88) 71.03± 0.43(↑2.45)SalNet - AL Conv-4-64 58.94± 0.68(↑1.49) 73.12± 0.47(↑1.11)

Absolute-relative LearningRelationNet + ArL Conv-4-64 53.46± 0.68(↑2.10) 67.13± 0.43(↑1.79)SoSN - ArL Conv-4-64 55.92± 0.65(↑2.18) 71.52± 0.46(↑2.94)SalNet - ArL Conv-4-64 59.12± 0.67(↑1.67) 73.56± 0.45(↑1.55)

learning, absolute learning modules and the absolute-relativelearning approach. The Adam solver is used for modeltraining. We set the initial learning rate to be 0.001 anddecay it by 0.5 every 50000 iterations.

5.1. Datasets

Below, we describe our experimental setup, the standardbenchmarks, fine-grained datasets with semantic annotations,and our evaluations.miniImagenet [29] consists of 60000 RGB images from100 classes, each class containing 600 samples. We followthe standard protocol [29] and use 80 classes for trainingand remaining 20 classes for testing, and use images ofsize 84×84 for fair comparison with other methods. Forsemantic annotations, we manually annotate 31 attributesfor each class. We also leverage Word2Vec extracted from

Table 2: Ablation studies re. the impact of absolute and relativelearning modules given miniImagenet dataset (5-way acc. given).

Model 1-shot 5-shotRelative Learning

RelationNet-RL(w2v.) 53.20 66.21RelationNet-RL(att.) 52.38 66.74RelationNet-RL(bin. + att. + w2v.) 52.38 66.73SoSN-RL(w2v.) 54.31 69.64SoSN-RL(att.) 54.49 70.21SoSN-RL(bin. + att. + w2v.) 55.49 70.86

Absolute LearningRelation Net-AL(cls.) 51.41 66.01Relation Net-AL(att.) 52.35 66.53Relation Net-AL(w2v.) 52.67 66.91Relation Net-AL(cls.+att.+w2v.) 52.30 66.51SoSN-AL(cls.) 55.12 70.91SoSN-AL(att.) 55.61 71.03SoSN-AL(w2v.) 54.78 70.85SoSN-AL(cls.+att.+w2v.) 55.40 71.02

GloVe as the class embedding.Caltech-UCSD-Birds 200-2011 (CUB-200-2011) [30] has11788 images of 200 bird species. 150 classes are randomlyselected for training and the rest 50 categories for testing.312 attributes are provided for each class.Flower102 [21] is a fine-grained category recognitiondataset that contains 102 classes of various flowers. Eachclass consists of 40-258 images. We randomly select 80classes for training and 22 classes for testing. 1024 attributesare provides for each class.

5.2. Performance Analysis

Relative Learning (RL). Table 1 (miniImagenet) com-pares our methods with the state-of-the-art models which donot use any semantic information i.e., Matching Nets, MetaNets, Prototypical Net, MAML, Relation Net, SoSN andSalNet. Furthermore, Table 2 (miniImagenet as example)illustrates the semantic relation enhanced performance forboth Relation Net, SoSN and SalNet. The results in the tableindicate that the performance of few-shot similarity learningcan be improved by employing the semantic relation labelsat the training stage. For instance, SoSN with attribute softlabel (att.) achieves 0.6% and 1.7% improvements for 1-and 5-shot compared to the baseline (SoSN) in Table 1. Theresults on CUB-200-2011 and Flower102 indicate similargains, which are demonstrated in Table 3. We further com-pare the two similarity measures on miniImagenet i.e., Eq.(4) and Eq. (5) in Table 4. It appears that a better perfor-mance is obtained with Eq. (5), which is consistent with ourprevious analysis regarding ‘soft’ similarity measures.

Moreover, we investigate how to simultaneously employmultiple types of relation labels for training. To find the

7

Page 8: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

Table 3: Evaluations on the CUB-200-2011 and Flower102 datasets (5-way acc. given).

CUB-200-2011 Flower102

Model 1-shot 5-shot 1-shot 5-shot

Prototypical Net [27] 37.42 51.57 62.81 82.11Relation Net [28] 40.56 53.91 68.26 80.94Relation Net-RL(att.) 42.15 54.45 68.76 81.57Relation Net-RL(bin. + att.) 42.56(↑2.00) 54.89(↑0.98) 69.02(↑0.72) 82.01(↑1.07)Relation Net-AL(cls.) 41.08 54.76 69.30 82.17Relation Net-AL(att.) 42.06(↑1.50) 55.12(↑1.21) 69.76(↑1.50) 82.31(↑1.37)Relation Net-AL(cls. + att.) 41.38 54.49 68.95 81.73Relation Net-ArL 43.79(↑3.23) 56.31(↑2.40) 70.21(↑1.95) 82.93(↑1.99)SoSN [33] 46.72 60.34 71.90 84.87SoSN-RL(att.) 47.53 60.98 74.67 86.75SoSN-RL(bin. + att.) 49.24(↑2.52) 64.04(↑3.70) 74.96(↑3.06) 87.21(↑2.34)SoSN-AL(cls.) 46.88 60.90 72.97 85.35SoSN-AL(att.) 48.85(↑2.13) 63.64(↑3.30) 74.31(↑2.41) 86.97SoSN-AL(cls. + att.) 48.70 61.77 74.10 87.01(↑2.14)SoSN-ArL 49.58(↑2.86) 64.56(↑4.22) 75.37(↑3.47) 87.23(↑2.36)

( a ) ( b )

( c ) ( d )

( a ) ( b )

( c ) ( d ) Figure 6: The t-SNE visualization of different models on Flower-102 for 1-shot learning. (a) Vanilla SoSN, (b) SoSN + RL, (c) SoSN+ AL, (d) SoSN + ArL.

optimal relative learning combinations, we set a group ofhyper-parameters to re-weight the relation objective terms.The ablation studies for different types of relative learningmodules are illustrated in Table 2 which shows that combin-ing binary and soft relation labels (bin.+att+w2v.) yields thebest performance on miniImagenet for all baseline models,except for Relation Net, which achieves the best accuracy byusing (w2v.) soft labels alone.

Absolute Learning (AL). Table 1 shows that differentabsolute learning modules help improve the performance onminiImagenet. SoSN with the attribute predictor (SoSN-AL)achieves the best performance of 55.61 on 1-shot and 71.03on 5-shot. We also demonstrate the ablation studies for abso-lute learning in Table 2 which shows that applying multipleabsolute learning modules does not improve further the ac-curacy. Table 3 shows that the attribute predictor (att.) alsoworks the best among all variants on CUB-200-2011 andFlower102 datasets. For instance, SoSN with the attributepredictor achieves 2.1% and 3.3% improvements on CUB-

Table 4: Ablation studies of the similarity measure functions giventhe miniImagenet dataset (5-way acc. given).

Mode 1-shot 5-shot

RelationNet+RL(att.) Eq. 4 52.38 66.74RelationNet+RL(att.) Eq. 5 52.86 66.95SoSN+RL(att.) Eq. 4 54.49 70.21SoSN+RL(att.) Eq. 5 54.98 70.56

200-2011, and 2.4% and 2.1% improvement on Flower102for 1- and 5-shot, respectively. We note that the class pre-dictor (cls.) does not work well on both the fine-grainedclassification datasets.

Absolute-relative Learning (ArL). Tables 1 and 3 showthat combining AL and RL into ArL further boosts the per-formance on all datasets from 0.5 to 1.0% on 1-shot and 0.1to 0.5% on 5-shot. In Figure 6, we apply the t-SNE visu-alization for the 1-shot baseline, baseline-RL, baseline-ALand baseline-ArL models trained on Flower102 which showsthat RL, AL and ArL help refine the class boundaries whichboosts the few-shot learning performance.

6. ConclusionsIn this paper, we demonstrate that binary labels commonly

used in few-shot learning cannot capture complex class rela-tions well, leading to inferior results. To this end, we haveintroduced semantic annotations to aid the modeling of morerealistic class relations during network training. Further, wehave proposed a novel absolute-relative learning paradigmwhich combines the similarity learning with the conceptlearning. This surprisingly simple strategy appears to workwell on all datasets and it perhaps resembles a bit more thehuman learning process is compared to a mere similarityor concept learning. In contrast to multi-modal learning,we only use semantic annotations as labels in training, anddo not use them during testing. Our proposed approachachieves the state-of-the-art performance on all few-shotlearning protocols.

8

Page 9: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

Acknowledgements. This research is supported in partby the Australian Research Council through AustralianCentre for Robotic Vision (CE140100016), Australian Re-search Council grants (DE140100180), the China Scholar-ship Council (CSC Student ID 201603170283).

References[1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and

Cordelia Schmid. Label-embedding for attribute-based classi-fication. CVPR, pages 819–826, 2013.

[2] Antreas Antoniou, Harrison Edwards, and Amos Storkey.How to train your maml. arXiv preprint arXiv:1810.09502,2018.

[3] Andreas Argyriou, Theodoros Evgeniou, and MassimilianoPontil. Multi-task feature learning. In Advances in neuralinformation processing systems, pages 41–48, 2007.

[4] Evgeniy Bart and Shimon Ullman. Cross-generalization:Learning novel classes from a single example by feature re-placement. CVPR, pages 672–679, 2005.

[5] Ben Bradshaw. Semantic based image retrieval: a probabilis-tic approach. In Proceedings of the eighth ACM internationalconference on Multimedia, pages 167–176. ACM, 2000.

[6] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learningof object categories. PAMI, 28(4):594–611, 2006.

[7] Michael Fink. Object classification from a single exampleutilizing class relevance metrics. NIPS, pages 449–456, 2005.

[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In ICML, pages 1126–1135, 2017.

[9] Victor Garcia and Joan Bruna. Few-shot learning with graphneural networks. arXiv preprint arXiv:1711.04043, 2017.

[10] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, PatrickPérez, and Matthieu Cord. Boosting few-shot visual learn-ing with self-supervision. arXiv preprint arXiv:1906.05186,2019.

[11] Spyros Gidaris and Nikos Komodakis. Generating classifica-tion weights with gnn denoising autoencoders for few-shotlearning. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2019.

[12] Mengdi Huai, Chenglin Miao, Yaliang Li, Qiuling Suo, LuSu, and Aidong Zhang. Metric learning from probabilistic la-bels. In Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining, pages1541–1550. ACM, 2018.

[13] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-tasklearning using uncertainty to weigh losses for scene geometryand semantics. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2018.

[14] Jongmin Kim, Taesup Kim, Sungwoong Kim, and Chang D.Yoo. Edge-labeling graph neural network for few-shot learn-ing. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2019.

[15] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.Siamese neural networks for one-shot image recognition. InICML Deep Learning Workshop, volume 2, 2015.

[16] Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, andJoshua B. Tenenbaum. One shot learning of simple visualconcepts. CogSci, 2011.

[17] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and

Stefano Soatto. Meta-learning with differentiable convex op-timization. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 10657–10665,2019.

[18] Fei Fei Li, Rufin VanRullen, Christof Koch, and Pietro Perona.Rapid natural scene categorization in the near absence ofattention. Proceedings of the National Academy of Sciences,99(14):9596–9601, 2002.

[19] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning fromone example through shared densities on transforms. CVPR,1:464–471, 2000.

[20] Tsendsuren Munkhdalai and Hong Yu. Meta networks. InProceedings of the 34th International Conference on MachineLearning-Volume 70, pages 2554–2563. JMLR. org, 2017.

[21] M-E. Nilsback and A. Zisserman. Automated flower classifi-cation over a large number of classes. In Proceedings of theIndian Conference on Computer Vision, Graphics and ImageProcessing, Dec 2008.

[22] Peng Peng, Raymond Chi-Wing Wong, and Phillp S Yu.Learning on probabilistic labels. In Proceedings of the 2014SIAM International Conference on Data Mining, pages 307–315. SIAM, 2014.

[23] Jagath Chandana Rajapakse and Lipo Wang. Neural Informa-tion Processing: Research and Development. Springer-VerlagBerlin and Heidelberg GmbH & Co. KG, 2004.

[24] Bernardino Romera-Paredes and Philip Torr. An embarrass-ingly simple approach to zero-shot learning. ICML, pages2152–2161, 2015.

[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, andLi Fei-Fei. ImageNet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015.

[26] Adam Santoro, David Raposo, David G Barrett, Mateusz Ma-linowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap.A simple neural network module for relational reasoning.NIPS, 2017.

[27] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypicalnetworks for few-shot learning. In NIPS, pages 4077–4087,2017.

[28] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HSTorr, and Timothy M Hospedales. Learning to compare:Relation network for few-shot learning. CoRR:1711.06025,2017.

[29] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra,et al. Matching networks for one shot learning. In NIPS, pages3630–3638, 2016.

[30] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-port CNS-TR-2011-001, California Institute of Technology,2011.

[31] Hua Wang, Heng Huang, and Chris Ding. Image annotationusing bi-relational graph of images and semantic labels. InCVPR 2011, pages 793–800. IEEE, 2011.

[32] Hongguang Zhang and Piotr Koniusz. Zero-shot kernel learn-ing. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2018.

[33] Hongguang Zhang and Piotr Koniusz. Power normalizingsecond-order similarity network for few-shot learning. In

9

Page 10: arXiv:2001.03919v1 [cs.CV] 12 Jan 2020Gordon Setter Feature Spaces Learnt with Binary Labels 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Negative

2019 IEEE Winter Conference on Applications of ComputerVision (WACV), pages 1185–1193. IEEE, 2019.

[34] Hongguang Zhang, Jing Zhang, and Piotr Koniusz. Few-shot learning via saliency-guided hallucination of samples.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2019.

10