direction concentration learning: enhancing congruency in ...qzhao/publications/... · dataset, the...

0162-8828 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2019.2963387, IEEETransactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. X, AUGUST 20XX 1

Direction Concentration Learning:Enhancing Congruency in Machine Learning

Yan Luo, Yongkang Wong, Member, IEEE, Mohan Kankanhalli, Fellow, IEEE, and Qi Zhao, Member, IEEE

Abstract—One of the well-known challenges in computer vision tasks is the visual diversity of images, which could result in anagreement or disagreement between the learned knowledge and the visual content exhibited by the current observation. In this work,we first define such an agreement in a concepts learning process as congruency. Formally, given a particular task and sufficiently largedataset, the congruency issue occurs in the learning process whereby the task-specific semantics in the training data are highlyvarying. We propose a Direction Concentration Learning (DCL) method to improve congruency in the learning process, whereenhancing congruency influences the convergence path to be less circuitous. The experimental results show that the proposed DCLmethod generalizes to state-of-the-art models and optimizers, as well as improves the performances of saliency prediction task,continual learning task, and classification task. Moreover, it helps mitigate the catastrophic forgetting problem in the continual learningtask. The code is publicly available at https://github.com/luoyan407/congruency.

Index Terms—Optimization, Machine Learning, Computer Vision, Accumulated Gradient, Congruency.

F

1 INTRODUCTION

D EEP learning has been receiving considerable attention dueto its success in various computer vision tasks [4], [13],

[16], [27] and challenges [6], [32]. To prevent model overfittingand enhance the generalization ability, a training process oftensequentially updates the model with gradients w.r.t. a mini-batchof training samples, as opposed to using a larger batch [12]. Due tothe complexity and diversity in the nature of image data and task-specific semantics, the discrepancy between current and previousobserved mini-batches could result in a circuitous convergencepath, which possibly hinders the convergence to a local minimum.

To better understand the circuitousness/straightforwardnessin a learning process, we introduce congruency to quantify theagreement between new information used for an update and theknowledge learned from previous iterations. The word “congru-ency” is borrowed from a psychology study [51] that inspectsthe influence of an object which is inconsistent with the scenein the visual attention perception task. In this work, we definecongruency ν as the cosine similarity between the gradient g to beused for update and a referential gradient g that indicates a generaldescent direction resulting from previous updates, i.e.,

ν = cosα(g, g), (1)

(The detailed formulation is presented in Section 3). Figure 1presents an illustration of congruency in the saliency predictiontask. Due to similar scene (i.e., dining) and similar fixations onfaces and foods, the update of sample S2 (i.e., ∆wS2 ) is congruentwith ∆wS1 . In contrast, the scene and fixations in sample S3

are different from sample S1 and S2. This leads to a large angle(> 90◦) between ∆wS3 and ∆wS2 (or ∆wS1 ).

Congruency reflects the diversity of task-specific semanticsin training samples (i.e., images and the corresponding ground-

• Y. Luo and Q. Zhao are with the Department of Computer Science andEngineering, University of Minnesota, Minneapolis, MN, 55455.E-mail: [email protected] & [email protected]

• Y. Wong and M. Kankanhalli are with the School of Comput-ing, National University of Singapore, Singapore, 117417. E-mail:[email protected] & [email protected]

Model UpdatesImage Fixation

Figure 1: An illustration of congruency in the saliency predictiontask. Assuming training samples are provided in a sequentialmanner, an incongruency occurs since the food item is related todifferent saliency values across these samples. Here, Sj stands forsample j = {1, 2, 3}, wi is the weight at time step i, ∆wSj is theweight update generated with Sj for wi, and the arrows indicateupdates for the model. Specifically, ∆wSj = −ηgSj where η isthe learning rate and gSj is the gradient w.r.t. Sj . The updateof S2 (i.e., ∆wS2 ) is congruent with ∆wS1 , whereas ∆wS3 isincongruent with ∆wS1 and ∆wS2 .

truths). In the visual attention task, attention is explained byvarious hypotheses [2], [3], [50] and can be affected by manyfactors, such as bottom-up feature, top-down feature guidance,scene structure, and meaning [55]. As a result, objects in the samecategory may exhibit disagreements with each other in variousimages in terms of attracting attention. Therefore, there is a highvariability in the mapping between visual appearance and the cor-responding fixations. Another task that has a considerable amountof diversity is continual learning, which is able to learn continuallyfrom a stream of data that is related to new concepts (i.e., unseenlabels) [33]. The diversity of the data among multiple classification

Authorized licensed use limited to: University of Minnesota. Downloaded on February 21,2020 at 18:13:43 UTC from IEEE Xplore. Restrictions apply.

https://github.com/luoyan407/congruency




subtasks may be so much discrepant such that learning from newdata violates previously established knowledge (i.e., catastrophicforgetting) in the learning process. Moreover, congruency can alsobe found in the classification task. Compared to saliency predictionand continual learning, the source of diversity in classificationtask is relatively simple, namely, diverse visual appearances w.r.t.various labels in the real-world images. In summary, saliencyprediction, continual learning, and classification are challengingscenarios susceptible to the effects of congruency.

In machine learning, congruency can be considered as a factorthat influences the convergence of optimization methods, such asstochastic gradient descent (SGD) [42], RMSProp [14], or Adam[23]. Without specific rectification, the diversity among trainingsamples is implicitly and passively involved in a learning processand affects the descent direction in convergence. To understandthe effects of congruency on convergence, we explicitly formulatea direction concentration learning (DCL) method by sensing andrestricting the angle of deviation between an update gradient and areferential gradient that indicates the descent direction accordingto the previous updates. Inspired by Nesterov’s accelerated gradi-ent [37], we consider the accumulated gradient as the referentialgradient in the proposed DCL method.

We comprehensively evaluate the proposed DCL method withvarious models and optimizers in saliency prediction, continuallearning, and classification tasks. The experimental results showthat the constraints restricting the angle deviation between thegradient for an update and the accumulated referential gradientcan help the learning process to converge efficiently, comparing tothe approaches without such constraints. Furthermore, we presentthe congruency patterns to show how the task-specific semanticsaffect congruency in a learning process. Last but not least, ouranalysis shows that enhancing congruency in continual learningcan improve backward transfer.

The main contributions in this work are as follows:

• We define congruency to quantify the agreement betweennew information and the learned knowledge in a learningprocess, which is useful to understand the model conver-gence in terms of tractability.

• We propose a direction concentration learning (DCL)method to enhance congruency so that the disagreementbetween new information and the learned knowledge canbe alleviated. It also generally adapts to various optimizers(e.g., SGD, RMSProp and Adam) and various tasks (e.g.,saliency prediction, continual learning and classification).

• The experimental results from continual learning taskdemonstrate that enhancing congruency can improve back-ward transfer. Note that large negative backward transferis known as catastrophic forgetting [33].

• A general method analyzing congruency is presented andit can be used within both conventional models and modelswith the proposed DCL method. Comprehensive analysesw.r.t saliency prediction and classification show that ourDCL method generally enhances the congruencies of thecorresponding learning processes.

The rest of the paper is organized as follows. We begin byhighlighting related works in Section 2. Then, we formulate theproblem of congruency and discuss its factors in Section 3. Theproposed DCL method is introduced in Section 4. Moreover,the experiments and analyses are provided in Section 5 and 6,respectively. Section 7 concludes the paper.

2 RELATED WORKS

2.1 State-of-the-art Models for Classification

Convolutional networks (ConvNets) [13], [16], [27], [56] haveexhibited their powers in the classification task. AlexNet [27]is a typical ConvNet and consists of a series of convolutional,pooling, activation, and fully-connected layers, it achieves the bestperformance on ILSVRC 2012 [6]. Since then, there are more andmore attempts to delve into the architecture of ConvNets. He et al.proposed residual blocks to solve the vanishing gradient problemand the resulting model, i.e., ResNet [13], achieves best perfor-mance on ILSVRC 2015. Along with a similar line of ResNet,ResNeXt [56] is proposed to extend residual blocks to multi-branch architecture and DenseNet [16] is devised to establish theconnections between each layer and later layers in a feed-forwardfashion. Both models achieve desirable performance. Recently,Tan and Le [47] study how network depth, width, and resolutioninfluence the classification performance and propose EfficientNetthat achieves state-of-the-art performance on ImageNet. In thiswork, we use ResNet, ResNeXt, DenseNet, and EfficientNet inthe image classification experiments.

Yang et al. [60] introduce a regularized feature selectionframework for multi-task classification. Specifically, the tracenorm of a low rank matrix is used in the objective functionto share common knowledge across multiple classification tasks.Congruency generally works with gradient based optimizationmethods, whereas trace norm works with a specific optimizationmethod. Moreover, congruency measures the agreement (or dis-agreement) between new information learned from a sample andthe established knowledge, whereas trace norm is based on theweights of multiple classifiers and only measures the correlationbetween established knowledge w.r.t. different classification tasks.

2.2 Computational Modelling of Visual Attention

Saliency prediction is an attentional mechanism that focuseslimited perceptual and cognitive resources on the most pertinentsubset of the available sensory data. Itti et al. [19] implement thefirst computational model to predict saliency maps by integratingbottom-up features. Recently, Huang et al. [17] propose a data-driven DNN model, named SALICON, to model visual attention.Cornia et al. [5] propose a convolutional LSTM to iterativelyrefine the predictions and Kummerer et al. [28] design a readoutnetwork that is fed with the output features of VGG [46] toimprove saliency prediction. Yang et al. [59] introduce an end-to-end Dilated Inception Network (DINet) to capture multi-scalecontextual features for saliency prediction and achieves state-of-the-art performance. In this work, we adopt the SALICON modeland DINet in the saliency prediction experiments.

There are several insightful works [11], [51], [52] exploringthe effects of congruency/incongruency in visual attention. Inparticular, according to the perception experiments, Gordon findsthat the object which is inconsistent with the scene, e.g., a livechicken standing on a kitchen table, has significant influence onattentive allocation [11]. Underwood and Foulsham [51] find anunexpected interaction between saliency and negative congruencyin the search task, that is, the congruency of the conspicuous objectdoes not influence the delay in its fixation, but it is fixated earlierwhen the other object in the scene is incongruent. Furthermore,Underwood et al. [52] investigate whether the effects of semanticinconsistency appear in free viewing. In their studies, inconsistent





objects were fixated for significantly longer duration than consis-tent objects. These works inspire us to explore the congruencybetween the current and previous updates. In saliency prediction,negative congruency may result from the disagreement among thetraining samples in terms of visual appearance and ground-truth.

2.3 Catastrophic ForgettingCatastrophic forgetting problem has been extensively studied in[9], [10], [35], [39]. McCloskey and Cohen [35] study the problemthat new learning may interfere catastrophically with old learningwhen models are trained sequentially. New learning may alterweights that are involved in representing old learning, and this maylead to catastrophic interference. Along the same line, Ratcliff [39]further investigates the causes of catastrophic forgetting, and twoproblems are observed: 1) sequential learning is prone to rapidlyforget well-learned information as new information is learned; 2)discrimination between observed samples and unobserved sampleseither decreases or is non-monotonic as a function of learning.To address the catastrophic forgetting problem, there are severalworks [24], [33], [40] proposed to solve the problem by usingepisodic memory. Kirkpatrick et al. [24] propose an algorithmnamed elastic weight consolidation (EWC), which can adjustlearning to minimize changes in parameters important for previ-ously seen task. Moreover, Lopez and Ranzato [33] introduce thegradient episodic memory (GEM) method to alleviate catastrophicforgetting problem. However, there could exist incongruency inthe training process of GEM.

3 CONGRUENCY IN MACHINE LEARNING

3.1 Problem StatementWe first review the general goal in machine learning. Withoutloss of generality, given a training set D = {(Ii, yi)}ni=1, wherea pair (Ii, yi) represents a training sample composed of an imageIi ∈ RNI (NI is the dimension of images) and the correspondingground-truth yi ∈ Y , the goal is to learn a model f : RNI −→ Y .Specifically, a Deep Neural Network (DNN) model has a trunknet to generate discriminative features xi ∈ X and a classifierfw : X w−→ Y to fulfill the task, wherew is the weights of classifier.Note that we consider that DNN is a classifier as whole and theinput is raw RGB images.

To accomplish the learning process, the conventional approachis to first specify and initialize a model. Next, the empirical riskminimization (ERM) principle [53] is employed to find a desirablew w.r.t. f by minimizing a loss function ` : Y×Y → [0,∞) penal-izing prediction errors, i.e., minimize

w

1|D|

∑(xi,yi)∈D `(fw(xi), yi).

At time step k, the gradient computed by the loss is used to updatethe model, i.e., wk+1 := wk + ∆wk, where ∆wk is an update aswell as a function of gradient g(wk;xk, yk) = ∇wk`(fwk (xk), yk).Optimizers, such as SGD [42], RMSProp [14], or Adam [23],determine ∆wk(g(wk;xk, yk)). Without loss of generality, weassume the optimizer is SGD in the following for convenience.

There exist two challenges w.r.t. congruency for practicaluse. First, due to the dynamic nature of the learning process,how to find a stable referential direction which can quantify theagreement between current and previous updates. Second, how toguarantee the referential direction is beneficial to search for a localminimum.

As the gradient at a training step implies the direction towardsa local minimum by the currently observed mini-batch, the ac-cumulation of all previous gradients provides an overall direction

towards a local minimum. Hence, it provides a good referentialdirection to measure the agreement between a specific update andits previous updates. We denote the accumulated gradient as

gk|wm=

k∑i=m

gi, (2)

where wm is the weights learned at time step m and gk|wm

indicates that the accumulation starts from wm at time step k.If there is no explicit wm indicated, gk = gk|w1

. Figure 2 showsan example of accumulated gradient, where the gradient of S3

deviates from the accumulated gradient of S1 and S2. This alsoelicits our solution to measure congruency in a training process.

3.2 Definition

Congruency ν is a metric to measure the agreement of updatesin a training process. In general, it is directly related to an anglebetween the gradient for an update and the accumulated gradient,i.e., α(gk−1|wm

, gk) ∈ [0, π]. Smaller angle indicates highercongruency. Practically, we use cosine similarity to approximatethe angle for computational simplicity. Mathematically, at timestep k, νk can be defined as follows

νk|wm= cosα(gk, gk−1|wm

) =g>k−1|wm

gk

‖gk−1|wm‖‖gk‖

, m ≤ k (3)

where wm is the weight learned at time step m and taken as areference point in weight space. α(gk, gk) is the angle between gkand gk. Based on νk−1|wm

, the congruency of a training processthat starts from wj to learn out wn can be defined as

νwj→wn|wm=

1

n− j + 1

n∑i=j

νi|wm, m ≤ j < n (4)

Since the concept of congruency is built upon cosine similarity,νk|wm

will range from [−1, 1]. Another advantage of using cosinesimilarity is the tractability. The gradient computed from the lossis considered as a vector in weight space. Hence, cosine similaritycan take any pair of gradients, such as the accumulated gradientand the gradient computed by a training sample, or two gradientscomputed by two respective training samples.

3.3 Task-specific Factors

Congruency is semantics-aware. As congruency is based on thegradients which are computed with the images and the semanticground-truth, such as class label in the classification task or humanfixation in the saliency prediction task. Therefore, congruencyreflects the task-specific semantics. We discuss congruency task-by-task in the following subsection.Saliency Prediction. Visual attention is attracted to visuallysalient stimuli and is affected by many factors, such as scale, spa-tial bias, context and scene composition, oculomotor constraints,and so on. These factors result in high variabilities over fixationsacross various persons. The variabilities of visual semantics implythat same class objects in two images may have different saliencelevels, i.e., one object is predicted as salient object while the othersame class object is not. In this sense, negative congruency inlearning for saliency prediction may result from both feature-leveland label-label disagreement across the images.Continual Learning. In the continual learning setting [24], [33],[40], a classification model is learned with past observed classes





......

Model

DCL module

Visualize

Weight space

Optimizer

Image Fixation

Figure 2: An illustration of model training with the proposed DCL module. Here, 3 samples are observed in a sequential manner. Thegradient generated by S3 is expected to be different with the gradients generated by S1 and S2. Hence, to tackle the expected violationbetween the update −ηgS3

and the accumulated update −η∑2

i=1 gSi, the proposed DCL method finds a corrected update −ηgS3

(the pink arrow) by solving a quadratic programming problem (5). In this way, the angle between −ηgS3 and −η∑2

i=1 gSi (the bluearrow), i.e., α, is guaranteed to be equal to or less than α. Note that the gradient descent processes with or without the proposed DCLmodule is identical in the test phase.

and samples. New samples w.r.t. the unobserved classes may bedistinct from previously seen samples in terms of both visual ap-pearance and label. This leads to negative congruency in learning.Classification. For classification, the class labels are usually de-terministic to human. The factors that cause negative congruencyin learning lie in visual appearances. Due to the variability of real-world images, visual appearance of samples from the same classmay be very different from each other in different images.

4 METHODOLOGY

In this section, we first overview the proposed DCL method. Then,we introduce its formulation and properties in detail. Finally, wediscuss the lower bound of congruency with gradient descentmethods. For simplicity, we assume it is at time step k and omitunderscored k in the following formulations unless we explicitlyindicate it.

4.1 OverviewFigure 2 demonstrates the basic idea of the proposed DCL method.Given training sample (I, y), where I is an image and y is theground-truth, the corresponding feature x are first generated bythe sample before it is passed to the classifier for computingthe predictions y = fw(x). Conventionally, the derivatives g ofthe loss `(y, y) are computed to determine the update ∆w byan optimizer to back-propagate the error layer by layer. In theproposed DCL method, g is taken to estimate a corrected gradientg that is congruent with previous updates. For example, as shownin Figure 2, the gradient of S3 is expected to have a large deviationangle α to the accumulated anti-gradient −

∑2i=1 gsi because S1

and S2 share similar visual appearance, but S3 is different fromthem. The proposed DCL method aims to estimate a corrected gwhich has a smaller deviation angle α to −

∑2i=1 gSi .

4.2 Direction Concentration LearningThe core idea of the proposed DCL method is to concentratethe current update to a certain search direction. The accumulatedgradient g is the direction voted by previous updates whichprovides information towards the possible local minimum. Ideally,according to the definition of congruency, i.e., Eq. (3) and (4),

cosine similarity should be considered in optimization. However,minimizing cosine similarity with constraints is complicated.Therefore, similar to GEM [33], we adopt an alternative thatoptimizes the inner product, instead of the cosine similarity.According to Eq. (3), 〈g1, g2〉 ≥ 0 indicates that the angle betweenthe two vectors is less than or equal to 90◦.

As shown in Figure 2, the proposed DCL method uses theaccumulated gradient as a referential direction to compute acorrected gradient g, i.e.,

minimizeg

1

2‖g − g‖22

s.t. 〈−gri ,−g〉 ≥ 0, 1 ≤ i ≤ Nr

(5)

where ri is a reference point in weight space, gri is the accu-mulated gradient that starts the accumulation from ri, and Nr

is the number of reference points. The accumulated gradient griindicates that the accumulation starts from the reference ri tothe current weights w. The proposed DCL method can take Nr

points as the references {ri|1 ≤ i ≤ Nr}. Assume that theweights at time step t is taken as the reference ri, i.e., ri = wt,we denote sub(·) as a function to find the index of a point inweight space. For example, with t = sub(ri) = sub(wt), we cancompute the accumulated gradient gri =

∑j=sub(ri)

gj . On theother hand, the function 1

2‖g − g‖22 is widely used in gradient-

based methods [15], [25], [33], [44], [48] and forces g to beclose to g in Euclidean space as much as possible. The constraints〈−gri ,−g〉 ≥ 0 are to guarantee that the gradient that is used foran update should not substantially deviate from the accumulatedgradient.

In practice, instead of directly computing gri by its definition(2)), we compute it by subtracting the current point w with thereference point ri, i.e., gri = w − ri = −η

∑j=i gj . Hence, the

constraints can be deformed in a matrix form

A(−g) = −1×

(w − r1)>

(w − r2)>

...(w − rNr )>

g ≥ 0 (6)

Figure 3 demonstrates the effect of constraints in optimization.The dashed line in the same color indicates the border of feasible





Feasibleregion

Figure 3: An illustration of DCL constraints with two referencepoints r0 = w0, r1 = w1. gr0 is the pink arrow while gr1 isthe green one. The colored dashed line indicates the border offeasible region with regards to −gri , i ∈ {0, 1}, since Constraint(6) forces −ηgk to have an angle which is smaller than or equalto 90◦ w.r.t. gr0 and gr1 .

region with regards to −gri , i ∈ {1, 2} as Constraint (6) forces gto have an angle smaller than 90◦. Due to two references in thisexample, the intersection between two feasible regions, i.e., theshaded region, is the intersected feasible region for optimization.Note that an accumulated gradient determines half-plane (hyper-plane) as feasible region, instead of the full plane (hyperplane) inconventional gradient descent case.

The optimization (5) becomes a classic quadratic programmingproblem and we can easily solve it by off-the-shelf solvers likequadprog1 or CVXOPT2. However, since the size of g can besufficiently large, straightforward solution may be computationallyexpensive in terms of both time and storage. As introduced byDorn [7], we apply a primal-dual method for quadratic programsto solve it efficiently.

Given a general quadratic problem, it can be formulated asfollows

minimizez

1

2z>Cz + q>z s.t. Bz ≥ b, (7)

whereas the corresponding dual problem to Problem (7) is

minimizeu,v

1

2u>Cu+ b>v

s.t. B>v − Cu = q, v ≥ 0.(8)

Dorn provides the proof of the connection between Problem (7)and Problem (8).

Theorem 4.1 (Duality). if z = z∗ is a solution to Problem (7) thena solution (u, v) = (u∗, v∗) exists to Problem (8). Conversely, ifa solution (u, v) = (u∗, v∗) to Problem (8) exists then a solutionwhich satisfies Cz = Cu∗ to Problem (7) also exists.

Due to the equality constraint B>v − Cu = q, assume C isfull rank, we can plug u = C−1(B>v − q) back to the objective

1. https://github.com/rmcgibbo/quadprog2. https://cvxopt.org/

function to further simplify Problem (8), i.e.,

minimizev

1

2v>B(C−1)>B>v + (b− p>B>)v

s.t. v ≥ 0.(9)

Now it turns out to be a quadratic problem w.r.t. v only.The DCL quadratic problem can be solved by the aforemen-

tioned primal-dual method. Specifically, ‖g−g‖22 = (g−g)>(g−g) = g>g − 2g>g + g>g. By omitting the constant term g>g, itturns to a quadratic problem form g>g − 2g>g. Since we knowthe primal problem (7) can be converted to its dual problem (8),the related coefficient matrices/vectors are easily determined by

C = I, B = −A, b = 0, p = −g,

where I is a unit matrix. With these coefficients at hand, we havethe corresponding dual problem

minimizev

1

2v>AA>v − g>A>v s.t. v ≥ 0. (10)

By solving (10), we have v∗. On the other hand, Cg =

Cu∗, C = I and we can have the solution g∗ by

g∗ = Cu∗ = B>v − q = −A>v + g (11)

Note that g, u ∈ Rp, v ∈ RNr , A ∈ RNr×p, and b ∈ RNr

where p is the size of w. If taking the fully-connected layer ofResNet as w, p = 2048. In contrast with p, Nr is usually smaller,i.e., 1,2, or 3. As Nr becomes larger, it increases the possibilitythat the constraints are inconsistent. Thus, Nr � p. This impliesthat solving Problem (10) in RNr is more efficient than solvingProblem (5) in Rp.

4.3 Theoretical Lower BoundHere, we discuss about the congruency lower bound with gradientdescent methods. First, we recall the theoretical characteristicsw.r.t. gradient descent methods.

Proposition 4.2 (Quadratic upper bound [36]). If the gradient ofa function f : Rn → R is Lipschitz continuous with Lipschitzconstant L for any x, y ∈ Rn, i.e.,

‖∇f(y)−∇f(x)‖ ≤ L‖y − x‖ (12)

then

f(y) ≤ f(x) +∇f(x)>(y − x) +L

2‖y − x‖2 (13)

On the other hand, there is a proved bound w.r.t. the loss.

Corollary 4.3 (The bound on the loss at one iteration [8], [49]).Let xk be the k-th iteration result of gradient descent and ηk ≥ 0the k-th step size. If ∇f is L-Lipschitz continuous, then

f(xk+1) ≤ f(xk)− ηk(

1− Lηk2

)‖∇f(xk)‖2 (14)

By adding up a collection of inequalities, we can move furtheralong this line to have the following corollary.

Corollary 4.4. Let xk be the k-th iteration result of gradientdescent and ηk ≥ 0 the k-th step size. If ∇f is L-Lipschitzcontinuous, then

f(xk) ≤ f(x0)−k−1∑i=0

ηi

(1− Lηi

2

)‖∇f(xi)‖2 (15)


https://github.com/rmcgibbo/quadprog

https://cvxopt.org/




Theorem 4.5 (Congruency lower bound). Assume the gradientdescent method uses a fixed step size η and the gradient of theloss function f : Rn → R is Lipschitz continuous with Lipschitzconstant L, the congruency νk|x0

referring to the initial point x0at the k-th iteration has the following lower bound

νk|x0≥max

{(1− Lη)

k−1∑i=0

‖∇f(xi)‖‖∇f(xk)‖

− Lη∑k−1

i=0 ‖∇f(xi)‖‖∑i−1

j=0∇f(xj)‖‖∇f(xk)‖‖

∑k−1i=0 ∇f(xi)‖

,−1} (16)

Proof: Given xk and x0, according to Proposition 4.2 we have

∇f(xk)>(xk − x0) ≤ f(xk)− f(x0) +L

2‖xk − x0‖2

Since xk = x0 − η∑k−1i=0 ∇f(xi) and νk|x0 =

(−∇f(xk))>(−∑k−1i=0 ∇f(xi))/(‖∇f(xk)‖‖

∑k−1i=0 ∇f(xi)‖),

we can have

∇f(xk)>(xk − x0) =− η(−∇f(xk))>(−k−1∑i=0

∇f(xi))

=− η‖∇f(xk)‖‖k−1∑i=0

∇f(xi)‖νk|x0

Plugging this in the inequality, it yields

νk|x0 ≥1

η

f(x0)− f(xk)− Lη2

2‖∑k−1i=0 ∇f(xi)‖2

‖∇f(xk)‖‖∑k−1i=0 ∇f(xi)‖

According to Corollary 4.4, the inequality can be rewritten as

νk|x0 ≥(1− Lη

2)∑k−1i=0 ‖∇f(xi)‖

2 − Lη2‖∑k−1i=0 ∇f(xi)‖

2

‖∇f(xk)‖‖∑k−1i=0 ∇f(xi)‖

(17)

By using polynomial expansion and the Cauchy-Schwarz inequal-ity, we can expand the term ‖

∑k−1i=0 ∇f(xi)‖2 as follows

‖k−1∑i=0

∇f(xi)‖2 = ‖∇f(xk−1) +

k−2∑i=0

∇f(xi)‖2

≤‖∇f(xk−1)‖2 + 2‖∇f(xk−1)‖‖k−2∑i=0

∇f(xi)‖+ ‖k−2∑i=0

∇f(xi)‖2

Recursively, ‖∑k−2

i=0 ∇f(xi)‖2, ‖∑k−3

i=0 ∇f(xi)‖2, . . ., till‖∑1

i=0∇f(xi)‖2 can be expanded, e.g.,

‖1∑i=0

∇f(xi)‖2 = ‖∇f(x1) +∇f(x0)‖2

≤‖∇f(x1)‖2 + 2‖∇f(x1)‖‖∇f(x0)‖+ ‖∇f(x0)‖2

The above inequalities yield

‖k−1∑i=0

∇f(xi)‖2 ≤k−1∑i=0

‖∇f(xi)‖2 + 2

k−1∑i=0

‖∇f(xi)‖‖i−1∑j=0

∇f(xj)‖

Plugging it into Inequality (17), we have

νk|x0 ≥(1− Lη)

∑k−1i=0 ‖∇f(xi)‖2

‖∇f(xk)‖‖∑k−1i=0 ∇f(xi)‖

− Lη∑k−1i=0 ‖∇f(xi)‖‖

∑i−1j=0∇f(xj)‖

‖∇f(xk)‖‖∑k−1i=0 ∇f(xi)‖

Effective Window

Figure 4: An illustration to demonstrate the concept of the effec-tive window. Given the spiral convergence path,−ηg10|w0

restrictsthe search direction and the minimum (i.e., the red star) and w11

are unreachable according to the search direction. In contrast,w11 can be reached along the search direction of −ηg10|w7

. Toadaptively yield appropriate accumulated gradients that convergeto the minimum, we define an effective window to periodicallyupdate the reference.

Due to∑k−1

i=0 ‖∇f(xi)‖‖∑k−1

i=0 ∇f(xi)‖≥ 1, the congruency lower bound can be

further simplified as

νk|x0 ≥(1− Lη)

k−1∑i=0

‖∇f(xi)‖‖∇f(xk)‖

− Lη∑k−1i=0 ‖∇f(xi)‖‖

∑i−1j=0∇f(xj)‖

‖∇f(xk)‖‖∑k−1i=0 ∇f(xi)‖

Combining with the fact νk|x0≥ −1, we complete the proof.

Remark 4.6. Theorem 4.5 implies that when we apply gradientdescent method to search a local minimum, the congruency lowerbound at a certain iteration in the learning process is determinedby the gradients at current iteration and previous iterations.

Remark 4.7. Theorem 4.5 implies that the lower bound ofcongruency with a small step size, i.e., η < 1

L, is tighter than

the one of congruency with a large step size, i.e., η ≥ 1L

. This isconsistent with the fact the large step size could lead to a zigzagconvergence path. The negative lower bound of congruency whenη ≥ 1

Lindicates the huge turnaround would possibly occur in the

learning process.

4.4 Adaptivity to LearningAs the reference is used to compute the accumulated gradientfor narrowing down the search direction, a desirable referentialdirection should orient to a local minimum. Conversely, an inap-propriate referential direction could mislead the training and slowdown the convergence. Therefore, it is important to update thereferences to adapt to the target optimization problem.

In this work, we update the references with a short temporalwindow so as to yield a locally stable and reliable referential direc-tion. For instance, Figure 4 shows an unfavorable case that takesw0 as the reference, where the convergence path is spiral. Dueto the circuitous manifold, w0 results in a misleading direction−ηg10|w0

. In contrast, if taking w7 as a reference, it can yield theappropriate search direction to reach w11. Therefore, we introduce





(a) Convergence with GD at iteration 50 (b) Convergence with RMSProp at iteration 150 (c) Convergence with Adam at iteration 200

(d) Curves of z v.s. iteration with GD (e) Curves of z v.s. iteration with RMSProp (f) Curves of z v.s. iteration with Adam

Figure 5: An example demonstrating the effect of the proposed DCL method on three optimizers, i.e., gradient descent (GD), RMSProp,and Adam. Given a problem z = f(x, y), we use these optimization algorithms to compute the local minima, i.e., (x∗, y∗) that yieldthe minimal z∗. In the experiment, except the learning rate, the setting and hyperparameters are the same for ALGO and ALGO DCL,where ALGO={GD, RMSProp, Adam}. The proposed DCL method encourages the convergence paths to be as straight as possible.

an “effective window” to allow the proposed DCL method to findan appropriate search direction. The effective window forces theproposed DCL method to only accumulate the gradients withinthe window. In Figure 4, the proposed DCL method with a smallwindow size would converge, whereas the one with a large windowsize would diverge. We denote the window size as βw and thereference offset as βo. When the time step t satisfies

t mod βw = βo, (18)

where mod is the modulo operator, it would trigger the reset mech-anism, i.e., starting over to set references ri ← wt, 1 ≤ i ≤ Nr. βoindicates the first reference weight point. Once the reset processstarts, the proposed DCL method would use g, instead of g, forupdate until all the Nr references are reset.

4.5 Effect of DCLTo intuitively understand the effect of the proposed DCL method,we present visual comparisons of the convergence paths with threepopular optimizers, i.e., SGD [42], RMSProp [14], and Adam [23],on a publicly available problem3.

In particular, given the problem z = f(x, y), we apply thethree optimizers to compute a local minimum (x∗, y∗). Unlikeimage classification, the problem does not need randomized datasequence as input so there is no stochastic process. For a faircomparison, except the learning rate, we keep the settings and hy-perparameters the same between ALGO and ALGO DCL, whereALGO={GD, RMSProp, Adam} and GD stands for gradientdescent. The convergence paths w.r.t. the optimization algorithmsare shown in Figure 5(a)-(c), while the corresponding z v.s.iteration curves are plotted in Figure 5(d)-(f).

3. https://github.com/Jaewan-Yun/optimizer-visualization

We can see that all the baseline curves are circuitous, i.e.,a sharp turn at the ridge region between two local minima.Moreover, different learning rates lead to different local minima.It implies that the training process in this case is influenceable andfickle in terms of the direction of the convergence. The proposedDCL method noticeably improves the convergence direction bychoosing a relatively straightforward path over the three optimiza-tion algorithms. Note that as the objective function (5) implies, ifwe do not take any the accumulated gradients (i.e., no constraints),or take the gradient for the coming update as the accumulatedgradient (i.e., gri = g), the proposed DCL method would becomethe baseline (i.e., g = g).

4.6 DCL in Continual Learning

In previous subsections, we introduce the proposed DCL methodin mini-batch learning. By its very nature, it can also work incontinual learning manner. GEM [33] is a recent method proposedfor continual learning. The objective function of GEM is the sameas the proposed DCL method, whereas the constraints of GEM andthe proposed DCL method are devised for respective purposes. Toapply the proposed DCL method in continual learning, we canmerge the constraints of the proposed method with the ones ofGEM. Hence, we have a new A as follows

A =

(w − r1)>

...(w − rNr )>

−g(xS1, yS1

)>

...−g(xSNm

, ySNm)>

, Si ∈M (19)


https://github.com/Jaewan-Yun/optimizer-visualization




DCL GEMMemory Sample A, B

Figure 6: An illustration demonstrating the difference betweenDCL (left) and GEM [33] (right). The search direction in DCLis determined by the accumulated gradient while the adjustedgradient (solid line) of GEM is optimized by avoiding the violationbetween the gradient (dashed line) and memory samples’ gradients(green line). Since the weights are iteratively updated and thememory samples are preserved, the direction of the adjustedgradient of the memory samples could be dynamically varying.

whereM is the memory and Nm is the size of the memory. Withthe proposed DCL constraints, the corrected g is forced to beconsistent with both the accumulated gradients and the directionsof gradients generated by the samples in memory.

4.7 Comparison with Memory-based Constraints

Now, we discuss the difference between the proposed DCL con-straints and the memory-based constraints used in GEM [33].

There are two main differences between the DCL constraintsand the GEM constraints. First, as shown in Figure 6, the descentdirection in the proposed DCL method is regulated by the accu-mulated gradient, whereas the gradient for an update in GEM isregulated to avoid the violation with the gradients of the memorysamples (i.e., images and the corresponding ground-truths). Sincethe weights are iteratively updated and the memory samples arepreserved, the gradients of the memory samples could be changedat each iteration so the direction of the adjusted gradient couldbe dynamically varying. Second, the proposed DCL method onlyneeds to memorize the references, whereas GEM memorizes theimages and the corresponding ground-truths. The proposed DCLconstraints are efficiently computed by a subtraction in Eq. (6),other than by computing the corresponding gradients like GEM.

Although the proposed DCL constraints are different fromGEM constraints in terms of definition, they are able to workwith each other in continual learning. We will dive into the detailsin the following experiment section. Moreover, GEM computesthe gradients on all the parameters of a DNN. This works in thesituations that input image resolution is relatively small, e.g., 784for MNIST [29] or 3072 for CIFAR-10/100 [26]. The networksused to classify these images have small number of weights likeMLP and ResNet-18. However, the number of parameters in aDNN could be huge. For example, ResNeXt-29 (16 × 64) [56]has 68 million parameters. Although GEM applies primal-dualmethod to reduce the computation in optimization, the overallcomputation is still considerably high. In this work, we insteadcompute the gradients on the highest-level layer to generalize theproposed DCL method to any general DNN.

5 EXPERIMENTS

5.1 Experimental Setup

To comprehensively evaluate the proposed DCL method, we con-duct experiments on three tasks, i.e., saliency prediction, continuallearning, and classification.

5.1.1 Datasets

For saliency prediction task, we use SALICON [20] (the 2017version), MIT1003 [22], and OSIE [58]. For continual learningtask, we follow the same experimental settings in GEM [33] to useMNIST Permutations (MNIST-P), MNIST Rotations (MNIST-R),and incremental CIFAR-100 (iCIFAR-100). For classification, weuse CIFAR [26], Tiny ImageNet, and ImageNet [6].

5.1.2 Models

For saliency prediction, we adopt an improved SALICON saliencymodel [17] and DINet [59] as the baselines. Both the baselinemodels takes ResNet-50 [13] as the backbone architecture.

For continual learning, we adopt the same models used inGEM, i.e., Multiple Layer Perceptron (MLP) and ResNet-18, aswell as EfficientNet-B1 [47] as the backbone architecture forevaluation. EWC [24] and GEM are used for comparison.

For classification, we use the state-of-the-art model withoutany architecture modifications for a fair evaluation. ResNeXt [56](i.e., ResNeXt-29), DenseNet [16] (i.e., DenseNet-100-12), andEfficientNet-B1 [47] are used in the evaluation of CIFAR-10 andCIFAR-100. ResNet (i.e., ResNet-101), DenseNet (i.e., DenseNet-169-32), and EfficientNet-B1 [47] are used in the experiments onTiny ImageNet. ResNet (i.e., ResNet-34 and ResNet-50) is usedin the experiments on ImageNet.

5.1.3 Notation

For convenience, we notate model name + optimizer name + DCL-βw-Nr for key experimental details in Table 1, 2, 6 and 7. βw =∞ indicates it never resets the references when the initializationof references is finished.

5.1.4 Evaluation Metrics

For saliency prediction, we report the performance using thecommonly use metrics, namely area under curve (AUC) [1], [21],shuffled AUC (sAUC) [1], [45], normalized scanpath saliency(NSS) [18], [43], and correlation coefficient (CC) [38]. Humanfixations are used to form the positive set while the points fromthe saliency map are sampled to form the negative set. With thetwo sets, an ROC curve of true positive rate v.s. false positiverate would be plotted by thresholding over the saliency map. Ifthe points are sampled in a uniform distribution, it is AUC. If thepoints are sampled from the human fixation points, it is sAUC.NSS would average the response values at human eye positions inan predicted saliency map which has been normalized to be zero-mean and with unit standard deviation. CC measures the strengthof a linear correlation between a ground-truth map and a predictedsaliency map. For continual learning, we use the same metricsused in GEM [33], i.e., accuracy, backward transfer (BWT),and forward transfer (FWT). For classification, we evaluate theproposed DCL method with top 1 error rate metric on the CIFARexperiments while both top 1 and top 5 error rate are reported inthe experiments of Tiny ImageNet and ImageNet.





TABLE 1: Saliency prediction performance of the models which are trained on SALICON 2017 training set and evaluated on SALICON2017 validation set. Higher score is better in all the metrics. Each experiment is repeated for 3 times and the mean and std of the scoresare reported. We follow [59] to only use Adam as the optimizer for DINet.

NSS sAUC AUC CC

ResNet-50 RMSP 1.7933±0.0083 0.8311±0.0017 0.8393±0.0039 0.8472±0.0048ResNet-50 RMSP GEM 1.7522±0.0150 0.8267±0.0017 0.8341±0.0016 0.8291±0.0033ResNet-50 RMSP DCL-∞-1 1.8226±0.0014 0.8376±0.0017 0.8445±0.0016 0.8569±0.0032

ResNet-50 Adam 1.7978±0.0019 0.8328±0.0007 0.8405±0.0011 0.8495±0.0004ResNet-50 Adam GEM 1.7962±0.0034 0.8344±0.0021 0.8399±0.0009 0.8494±0.0034ResNet-50 Adam DCL-∞-1 1.8019±0.0024 0.8360±0.0023 0.8430±0.0023 0.8548±0.0038

DINet Adam [59] 1.8786±0.0063 0.8426±0.0008 0.8489±0.0008 0.8799±0.0010DINet Adam GEM 1.8746±0.0067 0.8423±0.0014 0.8492±0.0012 0.8791±0.0030DINet Adam DCL-500-1 1.8857±0.0006 0.8430±0.0002 0.8493±0.0002 0.8804±0.0009

TABLE 2: Saliency prediction performance of the models which are trained on OSIE and tested on MIT1003. Each experiment isrepeated for 3 times and the mean and std of the scores are reported.

NSS sAUC AUC CC

ResNet-50 RMSP 2.4047±0.0055 0.7612±0.0019 0.8455±0.0028 0.7595±0.0002ResNet-50 RMSP GEM 2.3960±0.0057 0.7566±0.0045 0.8412±0.0055 0.7500±0.0037ResNet-50 RMSP DCL-∞-1 2.4252±0.0053 0.7620±0.0018 0.8469±0.0027 0.7658±0.0016

ResNet-50 Adam 2.4064±0.0015 0.7597±0.0012 0.8429±0.0021 0.7618±0.0005ResNet-50 Adam GEM 2.3685±0.0065 0.7594±0.0007 0.8427±0.0017 0.7524±0.0011ResNet-50 Adam DCL-∞-1 2.4108±0.0063 0.7613±0.0007 0.8442±0.0008 0.7617±0.0007

DINet Adam 2.4406±0.0058 0.7570±0.0005 0.8442±0.0016 0.7534±0.0005DINet Adam GEM 2.4456±0.0037 0.7571±0.0005 0.8432±0.0003 0.7540±0.0006DINet Adam DCL-120-1 2.4566±0.0007 0.7611±0.0011 0.8476±0.0008 0.7597±0.0008

5.1.5 Experimental & Training Details

In the experiments of saliency prediction, we use Adam [23] andRMSProp (RMSP) [14] optimizers. In the setting with Adam, weuse η = 0.0002, weight decay 1e-5 while η = 0.0005, weightdecay 1e-5 are used within the setting of RMSP. The momentumis set to 0.9 for both Adam and RMSP. η would be adjustedalong with the epochs, i.e., ηk+1 ← η0 × 0.5k−1, where k is thecurrent epoch. The batch size is 8 by default. To fairly evaluatethe performances of the models, we use cross-dataset validationtechnique, i.e., the models are trained on the SALICON 2017training set and evaluated on the SALICON 2017 validation set,and trained on OSIE and evaluated on MIT1003.

We follow the experimental settings in [33] for continuallearning. Specifically, MNIST-P and MNIST-R have 20 tasksand each task has 1000 examples from 10 different classes. OniCIFAR-100, there are 20 tasks and each task has 2500 examplesfrom 5 different classes. For each task, the first 256 trainingsamples will be selected and stored as the memory on MNIST-P, MNIST-R, and iCIFAR-100. In this work, GEM constraintsare concatenated with the DCL constraints by Eq. (19). As thedifferent concepts are learned across the episodes, i.e., the tasks,we only consider that the accumulation of gradients would takeplace in each episode.

In the classification task, we evaluate the models with SGDoptimizer [42]. The hyperparameters are kept by default, i.e.,weight decay 5e-4, initial η = 0.1, the number of total epochs300. η would be changed to 0.01 and 0.001 at epoch 150 and225, respectively. For the Tiny ImageNet experiments, we willtrain the models in 30 epochs with weight decay 1e-4, initialη = 0.001. η would be changed to 1e-4 and 1e-5 at epoch 11and 21, respectively. The momentum is 0.9 by default. The batchsize is 128 in the CIFAR experiments and 64 in the Tiny ImageNet

experiments. In the ImageNet experiments, we use batch size of512 to train ResNet-50.

In addition, we present the performance of GEM for referenceas well. Note that more samples in memory may lead to inconsis-tent constraints. We set memory size to 1 and reset the memory ateach epoch beginning, which is analogous to the case that GEMfor continual learning would reset the memory at each beginningof the episode. The implementations of this work are built uponPyTorch4 and quadprog package is employed to solve quadraticprogramming problems.

5.2 Performance Evaluation5.2.1 Saliency PredictionTable 1 reports the mean and standard deviation (std) of the scoresin NSS, sAUC, AUC, and CC over 3 runs on the SALICON 2017validation set. We can see that the proposed DCL method overallimproves the saliency prediction performance with both ResNet-50 and DINet over all the metrics. Moreover, small values ofstds w.r.t. the proposed DCL method show that the randomnesscaused by the stochastic process does not contribute much tothe improvement. Table 2 shows that the proposed DCL methodtrained on OSIE consistently improves the saliency predictionperformance on MIT1003.

Note that Adam and RMSP optimizer are different algorithmsto compute effective step sizes based on the gradients. Theconsistency of the improvement with both optimizers shows thatthe proposed DCL method generally works with these optimizers.

5.2.2 Continual LearningAs introduced in Section 4, we apply the proposed DCL methodto enhance the congruency of the learning process for continual

4. https://github.com/pytorch/pytorch


https://github.com/pytorch/pytorch




TABLE 3: Performances on MNIST-R in continual learning set-ting using SGD [42] as the optimizer. The reported accuracy is inpercentage. MEM indicates that the constraints of GEM [33] areconcatenated to use as Eq. (19) describes.

Accuracy BWT FWT

EWC 54.61 -0.2087 0.5574GEM 83.35 -0.0047 0.6521MLP DCL-30-1 MEM 84.08 0.0094 0.6423MLP DCL-40-1 MEM 84.02 0.0127 0.6351MLP DCL-50-1 MEM 82.77 0.0238 0.6111

TABLE 4: Performances on MNIST-P in continual learning settingusing SGD as the optimizer.

Accuracy BWT FWT

EWC 59.31 -0.1960 -0.0075GEM 82.44 0.0224 -0.0095MLP DCL-3-1 MEM 82.30 0.0248 -0.0038MLP DCL-4-1 MEM 82.58 0.0402 -0.0092MLP DCL-5-1 MEM 82.10 0.0464 -0.0095

TABLE 5: Performances on iCIFAR-100 in continual learningsetting using SGD as the optimizer. EffNet stands for EfficientNet[47].

Accuracy BWT FWT

EWC 48.33 -0.1050 0.0216iCARL 51.56 -0.0848 0.0000

ResNet GEM 66.67 0.0001 0.0108ResNet DCL-4-1 MEM 67.92 0.0063 0.0102ResNet DCL-8-1 MEM 67.27 0.0104 0.0190ResNet DCL-12-1 MEM 66.58 0.0089 0.0139ResNet DCL-20-1 MEM 66.56 0.0030 0.0102ResNet DCL-24-1 MEM 64.97 0.0082 0.0238ResNet DCL-32-1 MEM 66.10 0.0305 0.0176ResNet DCL-50-1 MEM 64.86 0.0244 0.0125

EffNet GEM 80.80 0.0318 -0.0050EffNet DCL-4-1 MEM 81.55 0.0383 -0.0048EffNet DCL-8-1 MEM 80.84 0.0367 0.0068EffNet DCL-12-1 MEM 79.45 0.0322 0.0011EffNet DCL-20-1 MEM 79.33 0.0316 -0.0095EffNet DCL-24-1 MEM 79.05 0.0375 -0.0006EffNet DCL-32-1 MEM 79.97 0.0452 -0.0145EffNet DCL-50-1 MEM 77.87 0.0602 -0.0101

learning. Specifically, following Eq. (19), we concatenate theDCL constraints with the GEM constraints [33]. As reported inTable 3, the proposed DCL method improves the classificationaccuracy by 0.7% on MNIST-R. Similarly, the proposed DCLmethod improves the classification accuracy on MNIST-P as well(see Table 4). The marginal improvement may results from thedifference between MNIST-R and MNIST-P. Permuting the pixelsof the digits is harder to recognize than rotating the digits by afixed angle, and makes the accumulated gradient less informativein terms of leading to the solution. We observe that shortereffective window size is helpful to improve the accuracy in thecontinual learning task. This is because the training process ofcontinual learning is one-off and a fast variation could be causedby the limited images with brand new labels in each episode. Theexperiments on iCIFAR-100 in Table 5 confirm this pattern. Theproposed DCL method with ResNet and βw = 4 improves theaccuracy by 1.25% on iCIFAR-100.

TABLE 6: Top 1 error rate (in %) on CIFAR with various models.

CIFAR-10 CIFAR-100

ResNeXt-29 SGD 3.53 17.30ResNeXt-29 SGD GEM 7.70 32.70ResNeXt-29 SGD DCL-∞-1 3.33 17.02

DenseNet-100-12 SGD 4.54 22.88DenseNet-100-12 SGD GEM 6.92 33.72DenseNet-100-12 SGD DCL-90-1 4.32 22.16

EfficientNet-B1 SGD [47] 1.91 11.81EfficientNet-B1 SGD GEM 3.06 19.48EfficientNet-B1 SGD DCL-5-1 1.79 11.65

There are another two metrics for continual learning, i.e.,forward transfer (FWT) and backward transfer (BWT). FWT isthat learning a task is helpful in learning for the future tasks.Particularly, positive FWT is correlated to n-shot learning. Sincethe proposed DCL method utilizes the directional information ofthe past updates, it has less influence/correlation to FWT. Hence,we will focus on BWT. BWT is the influence that learning atask has on the performance on the previous tasks. Positive BWTis correlated to congruency in the learning process, while largenegative BWT is referred as catastrophic forgetting. Table 3 and4 show that the proposed DCL method is useful in improvingBWT on MNIST-R and MNIST-P. The BWT of GEM is negative(-0.0047) and the proposed DCL method improves it to 0.0238 onMNIST-R. Similarly, the BWT of GEM is 0.0224 and the proposedDCL method improves it to 0.0464 on MNIST-P. Similarly, inTable 5, the proposed DCL method with ResNet improves BWTof GEM from 0.0001 to 0.0305, while the proposed DCL methodwith EfficientNet [47] improves BWT to 0.0602.

5.2.3 Classification

Table 6 reports the top 1 error rates on CIFAR-10 and CIFAR-100with ResNeXt, DenseNet, and EfficientNet. In all cases, theproposed DCL method outperforms the baseline, i.e., ResNeXt-29SGD, DenseNet-100-12 SGD, and EfficientNet-B1 SGD. Specif-ically, the proposed DCL method with ResNeXt decreases theerror rate by 0.2% on CIFAR-10 and by 0.28% on CIFAR-100,while the proposed DCL method with EfficientNet decreases theerror rate by 0.12% on CIFAR-10 and by 0.16% on CIFAR-100. Similar improvements can be found in the experimentswith DenseNet and this shows that the proposed DCL methodis generally able to work with various models. Moreover, it can beseen in Table 6 that GEM has a higher error rate than the baselinein the experiments with ResNeXt, DenseNet, and EfficientNet.Because of the dynamical update process in learning, the gradientof the samples in memory does not guarantee that the directionleads to the solution. The direction can be even worse, e.g., it ispossible to go in an opposite way to the solution.

A consistent improvement w.r.t. the proposed DCL methodcan be found in the experiments on Tiny ImageNet (see Table7). The proposed DCL method decreases top 1 error rate by0.45% with ResNet, by 0.69% with DenseNet, and by 0.12%with EfficientNet. Also, the performance degradation caused byGEM [33] can be observed that top 1 error rate generated byGEM with ResNeXt is increased by almost 4.44%, comparing tothe baseline ResNet.

Table 8 reports the mean and std of 1-crop validation errorof ResNet-50 on ImageNet. Comparing to Tiny ImageNet andCIFAR, ImageNet has more categories and more high resolution





0.4155

0.3699

0.6377

0.2731

0.6696

0.5075

0.1667

-0.1965

Ref.Cong.Sim.

Ref.Cong.Sim.

Ref.Cong.Sim.

Ref.Cong.Sim.

-0.0066

0.1908

0.1364

-0.1109

-0.1189

-0.0341

-0.1788

-0.5603

Figure 7: The congruencies (Cong.) generated by the given references (Ref.) and samples with the baseline ResNet-50 RMSP in Table2. The cosine similarities (Sim.) between referred images and sample images are provided for comparison purposes. Source imagesand the corresponding ground-truths, i.e., fixation maps, are displayed along with the congruencies. The first and second block are theresults of subset that contains persons in various scenes. The third block is examples of food subset. The rightmost block shows subsetwith mixed image categories, i.e., contain objects of various categories in various scenes.

TABLE 7: Top 1 and top 5 error rate (in %) on the validation setof Tiny ImageNet with various models.

Top 1 error Top 5 error

ResNet-101 SGD 17.34 4.82ResNet-101 SGD GEM 21.78 7.21ResNet-101 SGD DCL-60-1 16.89 4.50

DenseNet-169-32 SGD 20.24 6.11DenseNet-169-32 SGD GEM 26.81 9.43DenseNet-169-32 SGD DCL-50-1 19.55 6.09

EfficientNet-B1 SGD 15.73 3.90EfficientNet-B1 SGD GEM 28.74 11.31EfficientNet-B1 SGD DCL-8-1 15.61 3.75

images. Given such difficulties, the proposed DCL method reducesthe mean of top 1 errors by 0.24% over three runs. In summary,the improvement gained by the proposed DCL method is benefitedfrom the better solution searched by optimizing DCL quadraticprogramming problem (5).

6 ANALYSIS

In this section, we first validate the defined congruency bycomparing through qualitative examples. Then, an ablation studyw.r.t. β and Nr is presented. Moreover, we provide a congruencyanalysis in the training processes for the three tasks. In the end,the comparison between training from scratch and fine-tuning, aswell as the computational cost are provided.

6.1 Validity of Congruency MetricIn this subsection, we conduct a sanity check on the validityof the defined congruency. To do this, we consider a simplecase where we directly take the gradients (i.e., gS1

and gS2)

of two samples (i.e., S1 and S2) to compute the corresponding

congruency, i.e., ν =g>S1

gS2

‖gS1‖‖gS2

‖ . For comparison purposes, thecosine similarity, Sim, between raw image S1 and S2 is alsocomputed by Sim =

S>1 S2

‖S1‖‖S2‖. Note that congruency is semantics-

aware, whereas cosine similarity between the two raw images issemantics-blind. This is because the gradients are computed byimages and its semantic ground truth, e.g., the class label in theclassification task or human fixation in the saliency prediction task.

For the analysis in the saliency prediction task, we sample3 subsets, where 20 training samples w.r.t. person, 20 training

TABLE 8: Top 1 and top 5 1-crop validation error (in %) onImageNet with SGD optimizer. βw = 5 and Nr = 1 are used forResNet-50 DCL. Within the same experimental settings, ResNet-50 GEM does not converge in this experiment. The mean and stdof errors are computed over three runs.


ResNet-50 [13] 24.70 7.80

ResNet-50 (reproduced) 24.33±0.08 7.30±0.07ResNet-50 DCL 24.09±0.03 7.23±0.02

samples w.r.t. food, and 20 training samples w.r.t. various scenesand categories were sampled from SALICON. For the analysisin the classification task, 3 subsets were sampled from Tiny Ima-geNet, which comprised of 100 images of tabby cat and Egyptiancat to form a intra-similar-class subset, 100 images of tabby catand German shepherd dog to form a inter-class subset, and 50images from various classes to form a mixed subset. In this way,we can analyze the correlation between the samples in terms ofcongruency. With these subsets, we use the baselines, i.e., ResNet-50 for saliency prediction and ResNet-101 for classification, toyield the samples gradients without updating the model.

Figure 7 demonstrates the congruencies w.r.t. the referencesand various samples (image + fixation map). In contrast to thedeterministic nature in the classification task, saliency is context-related and semantics-based. It implies that the same objectswithin two different scenarios may have different saliency labels.Hence, we select the examples of same/similar objects for thisexperiment. In Figure 7, the first and second block on left arebased on the person subset within various scenarios. The firstblock consists of the images of person and dining table. Takingthe first row sample as reference, the sample in the secondrow has higher congruency (0.4155) when compared to bottomrow sample (0.3699). Although all the fixation maps of all thesamples are different, pizza in the second image is more similarto the reference image whereas food in the bottom sample isinconspicuous. In the second block, both the portrait of the fisher(reference) and the portrait of the baseball player (second sample)are similar in terms of the layout, comparing to the persons indining room (third sample). Their fixation maps are similar aswell.

The congruency of the reference and second sample (0.6377)





... ...Ref.

0.9537 0.9536 0.9530 0.9529 0.9527 0.9523 0.9418 0.0290 0.0153

Tabby cat, (50) Egyptian cat, (50)

... ...

0.9547 0.9536 0.9527 0.9525 0.9522 0.9520 0.9447 0.0153 0.0035

Tabby cat, (50) Dog, (50)

0.9513 0.9473 0.9440 0.0238 0.02160.0217 0.0190 0.01820.0222

Dugong Penguin GooseSea fan Bear SeashoreHolothurian Chain

-0.0082

Cong.Sim.

Ref.

Cong.Sim.

Ref.

Cong.Sim.

-0.0013 -0.1284 -0.0022 0.1975 0.0118 0.0421 0.0246 0.1181 0.2686

0.0062 0.2294 0.0915 0.2268 0.2129 0.3734 -0.3592 -0.1248 0.2092

0.7993 0.6399 0.3395 0.2545 0.1596-0.0590 0.1615 0.28810.4553 0.0198

Figure 8: The congruencies (Cong.) generated by the given references (Ref.) and samples with the baseline ResNet-101 SGD in Table 7.The images with its labels are displayed along with the congruencies. The cosine similarities (Sim.) between referred images andsample images are provided for comparison purposes. The first block is the results of the intra-similar-class subset consisting of imagesof tabby cat and Egyptian cat. The middle block is the results of the inter-class subset consisting of images of tabby cat and Germanshepherd dog. The value in bracket indicates number of images. The bottom block is the results of images of various labels.

are higher than the one of the reference and third sample (0.2731).In the third block, the image of the reference is three hot dogsand its fixation maps is similar to the fixation maps of the secondsample. The two hog dog samples have similar visual appearanceand layout of fixations to yield a higher congruency (0.6696). Incontrast, third sample is different from the reference in terms ofvisual appearance and layout of fixations, which yields a lowercongruency (0.5075). The rightmost block shows an interestingfact that two outdoor samples yield a positive congruency 0.1667,whereas the outdoor reference and the indoor sample yield anegative congruency −0.1965. One possible reason is that thefixation pattern are different between the reference and the bottomindoor sample. In addition, the visual appearance like illuminationmay be the another factor causing such the discrepancy.

For classification, Figure 8 shows the congruencies w.r.t. thereferences and given samples in each subset. In all cases, we firstobserve that images with same genuine class as references yieldhigh congruency, i.e., larger than 0.94 for all cases. These showthat the gradients of the same labels are similar in the direction ofthe updates. Another observation is that the congruency of pairswith different labels are significantly smaller than the matchedlabel counterpart. In Figure 8, the congruencies of the reference(Tabby cat) and Egyptian cat images are below 0.03, while thecongruencies of the reference and German shepherd dog imagesare below 0.016 in the middle block. These demonstrate that thegradients of inter-class samples are nearly perpendicular to eachother. The reference of class ‘Dugong’ has positive congruenciesw.r.t. all the images that fall in the category of animal, exceptfor the image of chain, which falls into a non-animal category.Last but not least, given the images with different labels, similarvisual appearance would lead to relatively higher congruency. For

example, the congruencies between tabby cat and Egyptian catare overall higher than the ones between tabby cat and Germanshepherd dog. In summary, the labels are an important factor toinfluence the direction of the gradient in the classification task.Secondly, the visual appearance is another factor for congruency.

In contrast with congruency, cosine similarity between tworaw images make less sense in the context of a specific task.For example, two similar dining scenes in the first column inFigure 7 yield a negative cosine similarity−0.0066 in the saliencyprediction task. Similarly, the first two cat images in the firstrow in Figure 8, which are cast to the same category, yield anegative cosine similarity−0.0013. The negative cosine similaritybetween two images with the same or similar ground truth arecounterintuitive. It results from the fact that cosine similaritybetween two images only focuses on the difference between twosets of pixels and ignores the semantics associated to the pixels.

6.2 Ablation StudyIn this subsection, we study the effects of effective window sizeβw and reference number Nr on saliency prediction task (withSALICON) and classification task (with Tiny ImageNet).

In the saliency prediction experiment, Figure 9a shows thecurve of sAUC vs. βw based on DCL-βw-1, while Figure 9b showsthe curve of sAUC vs. Nr based on DCL-∞-Nr . Note that forthe reference number study, the training process on SALICONconsists of 12500 iterations so βw ≥ 12500 is equivalent to βw =∞, which means that it never resets the references in the wholelearning process. It can be observed that different βw and Nr

yield relatively similar performance in sAUC. This aligns with thenature of saliency prediction, where it maps features to the salientlabel and the non-salient label. The features w.r.t. the salient label





(a) sAUC vs. βw (b) sAUC vs. Nr

(c) Top 1 error vs. βw (d) Top 1 error vs. Nr

Figure 9: Ablation study w.r.t. effective window size βw andreferences number Nr. (a) and (b) are the experimental resultson the SALICON validation set, while (c) and (d) are with theTiny ImageNet validation set. βw =∞ in (b) and βw = 50 in (d).

are highly related to each other so βw and Nr would pervasivelyhelp the learning process make use of congruency.

In the classification experiment, Figure 9c shows the curve oftop 1 error vs. βw based on DCL-βw-1. We can see that only asmall βw range, i.e., between 20 and 70, yields relatively lowererrors than the other βw values. On the other hand, Figure 9dshows that only using one reference is helpful in the learningprocess for classification. Different from the pattern shown inFigure 9a and 9b, where the curves are relatively flat, the pattern inFigure 9c and 9d implies that the gradients in the learning processfor classification are dramatically changed in angle to satisfy the200-way prediction. Hence, the learning process for classificationdoes not prefer large βw and Nr.

In summary, the nature of the task should be taken into accountto determine the values of βw and Nr. Both parameters can lead tosignificantly different performances if the task-specific semanticsin the data are highly varying. Specifically, as Nr increases, thefeasible region for searching a local minimum possibly becomesnarrow as shown in Figure 3. If the local minimum is not inthe narrowed feasible region, large Nr could lead to a slowerconvergence or even a divergence.

6.3 Congruency Analysis

In this section, we focus on analyzing the patterns of congruencyon saliency prediction, continual learning, and classification. Forsaliency prediction and classification, to study how the gradients ofGEM and the proposed DCL method vary in the training process,we compute the congruency of each epoch in the training processby Eq. (4). Specifically, it turns to be νwes→wee|w1

, where wes

and wee is the weights at the first and last iteration of each epoch,respectively. Here, w0 is randomly initialized and w1 representsthe starting point of the training. For convenience, we simplify thenotation of the average congruency νwes→wee|w1

for each epochas νw1

. Correspondingly, we define the average magnitude dw1of

the accumulated gradients over the iterations in an epoch, i.e.,

dw1= 1

sub(wee)−sub(wes)+1

sub(wee)∑i=sub(wes)

‖wi − w1‖2 (20)

where dw1 indicates the measurement of magnitudes of the accu-mulated gradients takes w1 as the reference. Note that Eq. (20)does work not only with an absolute reference (e.g., w1), but canwork with a relative reference (e.g., wi−1) as well. Specifically,we can substitute wi−1 for w1 in Eq. (20) to compute dwi−1 .Eq. (20) can allow us to peek into the convergence process in thehigh dimensional weight space, where it is difficult to visualizethe convergence. By taking an absolute reference (e.g., w1) as thereference, it is able to provide an overview about how the learningprocess converges to the local minimum from the fixed reference,while a relative reference (e.g., wi−1) is helpful to reveal theiterative pattern.

For the experiments of continual learning, since GEM usesthe samples in memory to regulate the optimization direction,we follow this setting to check the effect of the proposed DCLmethod on the cosine similarities between the corrected gradientand the gradients generated by the samples in memory for anal-ysis. More concretely, the average cosine similarity is defined as

1‖Nit‖

1‖M‖

∑Niti=1

∑s∈M cos(gs, gGEMi), where Nit is the number

of iterations in an epoch and gGEMi is the gradient of GEM ati-th iteration.

6.3.1 Saliency Prediction

We analyze the models from Table 2, i.e., ResNet-50 Adam (base-line), ResNet-50 Adam GEM (GEM), and ResNet-50 Adam DCL-∞-1 (DCL). As the training samples sequence is affected by thestochastic process and it may be a factor influencing the proposedDCL method, we present two settings, i.e., within the independentstochastic process and within the same stochastic process amidthe training of the three models, to gauge the influence of thestochastic process on the proposed DCL method. Specifically,Figure 10a and 10b are the curves with the independent stochasticprocess on OSIE and SALICON, respectively, whereas the samepermuted samples sequences are used in the trainings of the threemodels in Figure 10c and 10d. We can see that they are similar inpattern and it implies that the permutation of the training sampleshas less influence on the proposed DCL method. Moreover, theproposed DCL method consistently gives rise to a more congruentlearning process than the baseline and GEM.

6.3.2 Continual Learning

Figure 11a,11b, and 11c shows the congruency along the taskswhich are the episodes to learn the new classes. It can be seenthat the proposed DCL method significantly enhances the cosinesimilarities between the gradients for updates and the gradientsgenerated by the samples in memory on MNIST-R. There areimprovements made by the proposed DCL method on early taskson MNIST-P. Moreover, an overall consistent improvement of theproposed DCL method can be observed on iCIFAR-100. Overall,the corrected updates for the model are computed by proposedDCL method to be more congruent with its previous updates. Thisconsistently results in the improvement of BWT in Table 3, 4, and5.





(a) Stochastical permutationson OSIE.

(b) Stochastical permutationson SALICON.

(c) Fixed permutations onOSIE.

(d) Fixed permutations onSALICON.

Figure 10: Congruencies along the epochs in saliency prediction learning, as defined in Eq. (4). The samples sequences for trainingmodels are determined by independent stochastic processes in Figure 10a and 10b, while the permuted samples sequences are pre-determined and fixed for all models in Figure 10c and 10d. The baseline, GEM, and DCL are ResNeXt-29 SGD, ResNeXt-29 SGDGEM, and ResNeXt-29 SGD DCL-∞-1 (see Table 6), respectively.

(a) MNIST-R (b) MNIST-P

(c) iCIFAR-100Figure 11: The average congruencies over epochs in training onthe three datasets for continual learning.

6.3.3 Classification

We analyze the models from Table 6, i.e., ResNeXt-29 SGD(baseline), ResNeXt-29 SGD GEM (GEM), and ResNeXt-29 SGDDCL-∞-1 (DCL), in term of the resulting congruency of eachepoch in the learning process on CIFAR. Similarly, ResNet-101SGD, ResNet-101 SGD GEM, and ResNet-101 SGD DCL-50-1in Table 7 are used for analysis on Tiny ImageNet. The curves ofthe average congruencies are shown in Figure 12a, 12d, and 12g,while Figure 12b, 12e, and 12h show the average magnitudes.

As shown in Figure 12a, 12d, and 12g, the congruency of theproposed DCL method is significantly higher than the baselineand GEM along all epochs on CIFAR-10 and CIFAR-100. Highercongruency indicates the convergence path would be flatter andsmoother. For example, if all the congruencies of each epoch are0, the convergence path would be a straight line.

On the other hand, the average magnitudes of the proposedDCL method are relatively flat and smooth in Figure 12b, 12e,and 12h, comparing to the baseline and GEM. Connecting themagnitudes with the congruencies in Figure 12a, 12d, and 12g,we can infer two points. First, the proposed DCL method finds anearer local minimum to its initialized weights on CIFAR-10 andCIFAR-100. Because the magnitudes of the proposed DCL methodis the smallest among the three methods. Second, the convergence

(a) Avg. cong. onCIFAR-10.

(b) dw1 on CIFAR-10. (c) dwi−1 on CIFAR-10.

(d) Avg. cong. onCIFAR-100.

(e) dw1 on CIFAR-100. (f) dwi−1 on CIFAR-100.

(g) Avg. cong. on TinyImageNet.

(h) dw1 on Tiny Ima-geNet.

(i) dwi−1 on Tiny Ima-geNet.

Figure 12: Analyses of the congruencies and magnitudes along theepochs in classification task, as defined in Eq. (4) and (20).

path of the proposed DCL method is the least oscillatory becauseits congruencies are overall higher than the other two methods andits magnitudes are the lowest among the three methods.

We take a further look at the training error vs. iteration curvesto better understand the convergences in Figure 13. To give anoverview along all epochs, we compute the mean and standarddeviation of the training errors at each epoch and plot them at alogarithm scale in Figure 13a and 13b, respectively. The resultsshow that the proposed DCL method yields lower training errorsfrom epoch 1 to epoch 16. From epoch 15 onwards, the proposedDCL method is little different from the baseline in terms of themean because they are both around 0.1. Therefore, we plot therepresentative curve at epoch 1, 5, 10, and 15 in Figure 13c-13f.

6.4 Empirical ConvergenceFigure 14 shows the validation losses w.r.t. the three tasks, i.e.,saliency prediction (a), continual learning (b), and classification(c). In general, the proposed DCL method achieves lower loss





(a) Mean of errors over epochs (b) Std of errors over epochs

(c) Training error vs. iter. at epoch 1 (d) Training error vs. iter. at epoch 5

(e) Training error vs. iter. at epoch 10 (f) Training error vs. iter. at epoch 15

Figure 13: Training error vs. iteration on Tiny ImageNet withResNet-101. (a) and (b) plot the mean and standard deviation oftraining errors at each epoch, respectively. Specifically, we showfour representative curves of training error vs. iteration at epoch1, 5, 10, and 15 in (c) - (f), respectively.

than the baseline and GEM, which is aligned with the fact that theproposed DCL method outperforms the baseline and GEM. Notethat classification losses of GEM are above 1.0 so they are notshown in Figure 14c.

6.5 Training from Scratch vs. Fine-tuning

We analyze the proposed approach with two types of trainingscheme on the validation set of Tiny ImageNet. The first trainingscheme train the models from scratch using the training set ofthe target dataset, whereas the second training scheme fine-tunesthe pre-trained ImageNet models on Tiny ImageNet. For ease ofcomparison, the experimental results of training the models fromscratch on Tiny ImageNet as the fine-tuning results are shown inTable 9. Similar to the results of fine-tuning, the proposed DCLmethod achieves lower top 1 error (i.e., 67.56%) and top 5 error(i.e., 40.74%) than the baseline and GEM.

6.6 Computational Cost

We report computational cost on Tiny ImageNet, SALICON andImageNet in Table 10 and 11, respectively. Specifically, the num-ber of parameters of the models and the corresponding processingtime per image are presented. The processing time per image iscomputed by (batch time − data time)/batch size, where batchtime is the time to complete the process of a batch of images,and data time is the time to load a batch of images. Note that theprocesses of gradient descent with or without the proposed DCL

(a) Saliency prediction on SALICON. (b) Continual learning on iCIFAR-100.

(c) Classification on Tiny ImageNet.

Figure 14: Validation loss vs. epoch/task. In (c), the dashed bluecurve indicates that the classification losses of GEM on TinyImageNet are all above 1.0 so they are not shown in the figurefor clarity.

TABLE 9: Top 1 and top 5 error rate (in %) on the validation setof Tiny ImageNet. We compare of (1) training the models fromscratch (TFS) on Tiny ImageNet, and (2) fine-tuning (FT) the pre-trained ImageNet models on Tiny ImageNet. ResNet-101 SGDDCL is with βw = 60 and Nr = 1. The validation errors of FTare from Table 7.


TFS FT TFS FT

ResNet-101 SGD 68.21 17.34 42.72 4.82ResNet-101 SGD GEM 77.20 21.78 53.18 7.21ResNet-101 SGD DCL 67.56 16.89 40.74 4.50

method are the same in the testing phase. We train the models on3 NVIDIA 1080 Ti graphics cards for the experiments on TinyImageNet and SALICON, and on 8 NVIDIA V100 graphics cardsfor the experiment on ImageNet.

ResNet-101 DCL on Tiny ImageNet is with βw = 60 andNr = 1. ResNet-50 DCL is with βw = ∞ and Nr = 1 onSALICON, and βw = 1 and Nr = 1 on ImageNet. ResNet-50 GEM is with Nr = 1 on all the datasets. The difference ofthe numbers of parameters between the baseline and the proposedDCL method (or GEM) lies in the final layer, i.e., 1× 1 convolu-tional layer for saliency prediction and the fully connected layerfor classification. The proposed DCL method has more parametersto store the weights of the final layer for the references.

In the experiment on Tiny ImageNet, the proposed DCLmethod with ResNet-101 takes 2 more milliseconds than the base-line to solve the constrained quadratic problem (5). Similarly, withResNet-50, it takes 1 and 2 more milliseconds than the baseline onSALICON and ImageNet, respectively. This shows that quadraticproblems with high dimensional input can be efficiently solved bythe tool quadprog. Hence, the proposed DCL method is practicallyaccessible. On the other hand, GEM [33] is less efficient than theother two methods across the three datasets. This is because GEM





TABLE 10: Computational cost of training models on Tiny Ima-geNet. The processing time (proc time) per image is calculated by(batch time− data time)/batch size.

# params proc time

ResNet-101 42.50M 47 msResNet-101 GEM 42.91M 78 msResNet-101 DCL 42.91M 49 ms

has to compute the gradients according to the memory, i.e., theinput features of the final layer, at each iteration. Instead, theproposed DCL method uses a subtraction operation (i.e., Eq. 6)to compute the accumulated gradient. Thus, it is faster than GEM.

Moreover, we discuss the effects of βw and Nr on thecomputational cost here. The computational cost w.r.t. βw andNr with ResNet on Tiny ImageNet is reported in Table 12. As βwindicates the effective window, it is implemented by a subtractionoperation according to Eq. (6) and updating the reference pointis a copying operation in RAM which is fast. Therefore, βwwould not affect computational cost. On the other hand, the timedifference between various Nr is small because we only applythe proposed DCL method to the downstream layer, i.e., thefinal layer, where the parameters are much fewer than the onesused by the whole network. For example, there are only 2304parameters in the final convolutional layer for saliency prediction.Any quadratic programming solver like quadprop can efficientlyhandle the corresponding dual problem (8) in a small scale.

6.7 Discussion of GeneralizationIncongruency is ubiquitous in the learning process. It results fromthe diversity of the input data, e.g., real-world images, and richtask-specific semantics. The proposed DCL method can effectivelyalleviate the incongruency problem in saliency prediction, contin-ual learning, and classification. Specifically, saliency predictioncan be seen as a typical regression problem while continuallearning and classification can be seen as a typical learningproblem that aims to predict a discrete label. In this sense, theinput-output mapping and the learning settings of the three tasksare fundamental to other vision tasks.

From the point of view of task-dependent incongruency, herewe consider general vision tasks to be cast into three groupsaccording to the form of input and output. The first group consistsof visual tasks that take images as input for classification or regres-sion, e.g., object detection [41] and visual sentiment analysis [61].In object detection, visual appearance of a region of interest couldbe diverse in terms of its label and location, while an arbitrarysentiment class can have a number of visual representations invisual sentiment analysis. Since tasks in this group has similarincongruency as that in image classification, i.e., the diversity ofraw image features w.r.t. a certain label, the proposed DCL methodis expected to boost this type of vision tasks. The second groupconsists of visual tasks that have complex outputs of regressionor classification, e.g., visual relationship detection [30], [34] andhuman object interaction [31], [57] whose output can involvemultiple possible relationships among two or more objects thatbelong to various visual concepts. The incongruency of tasks inthis group lies in the diversity of raw image features w.r.t. ahigher dimensional variable, e.g., a relationship which involvesmultiple objects and corresponding predicates. Last but not least,

TABLE 11: Computational cost of training models on SALICONand ImageNet.

SALICON ImageNet

# params proc time # params proc time

ResNet-50 23.51M 64 ms 23.50M 6 msResNet-50 GEM 23.51M 102 ms 25.55M 10 msResNet-50 DCL 23.51M 65 ms 25.55M 8 ms

TABLE 12: The effect of βw and Nr on computational cost(i.e., proc time) with ResNet on Tiny ImageNet. Note that βwwould not affect computational cost because βw indicates theeffective window and resetting the references is implemented as asubtraction operation according to Eq. (6).

βw (Nr = 1) proc time

10 49 ms20 49 ms30 49 ms40 49 ms50 49 ms

Nr (βw = 50) proc time

1 49 ms5 51 ms10 53 ms15 54 ms20 54 ms

the third group consists of visual tasks that take a series of images,e.g., action recognition [54]. Usually, it takes a clip of videos asinput and incorporates temporal information. The incongruencyof tasks in this group lies in the diversity of temporal raw imagefeatures w.r.t. a certain label, and the feature space with clips isoften more complicated than that in static images. Therefore, theincongruency of tasks in the second and third groups could bemore remarkable than that of tasks in the first group. Note thatthe proposed DCL method is gradient-based and not restricted tospecific forms of input or output. Therefore, it could naturallygeneralize or be used as a starting point to alleviate incongruencyfor tasks with different forms of input and output in the threegroups.

7 CONCLUSION

In this work, we define congruency as the agreement betweennew information and the learned knowledge in a learning process.We propose a Direction Concentration Learning (DCL) method totake into account the congruency in a learning process to searchfor a local minimum. We study the congruency in the three tasks,i.e., saliency prediction, continual learning, and classification. Theproposed DCL method generally improves the performances ofthe three tasks. More importantly, our analysis shows that theproposed DCL method improves catastrophic forgetting.

REFERENCES

[1] A. Borji, D. N. Sihite, and L. Itti. Quantitative analysis of human-modelagreement in visual saliency modeling: A comparative study. IEEETransactions on Image Processing, 22(1):55–69, 2013.

[2] N. Bruce and J. Tsotsos. Attention based on information maximization.Journal of Vision, 7(9):950–950, 2007.

[3] G. A. Carpenter, S. Grossberg, and G. W. Lesher. The what-and-wherefilter: a spatial mapping neural network for object recognition and imageunderstanding. Computer Vision and Image Understanding, 69(1):1–22,1998.

[4] X. Chang, Y.-L. Yu, Y. Yang, and E. P. Xing. Semantic pooling forcomplex event analysis in untrimmed videos. IEEE transactions onpattern analysis and machine intelligence, 39(8):1617–1632, 2017.





[5] M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara. Predicting human eyefixations via an lstm-based saliency attentive model. IEEE Transactionson Image Processing, 27(10):5142–5154, 2018.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:A Large-Scale Hierarchical Image Database. In IEEE Conference onComputer Vision and Pattern Recognition, pages 248–255, 2009.

[7] W. S. Dorn. Duality in quadratic programming. Quarterly of AppliedMathematics, 18(2):155–162, 1960.

[8] C. Fernandez-Granda. Lecture notes: optimization algorithms, Febru-ary 2016. Available at https://math.nyu.edu/∼cfgranda/pages/OBDAspring16/material/optimization algorithms.pdf.

[9] R. M. French. Catastrophic forgetting in connectionist networks. Trendsin cognitive sciences, 3(4):128–135, 1999.

[10] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. Anempirical investigation of catastrophic forgetting in gradient-based neuralnetworks. In International Conference on Learning Representations,2014.

[11] R. D. Gordon. Attentional allocation during the perception of scenes.Journal of Experimental Psychology: Human Perception and Perfor-mance, 30(4):760, 2004.

[12] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola,A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: TrainingImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning forimage recognition. In IEEE conference on Computer Vision and PatternRecognition, pages 770–778, 2016.

[14] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for ma-chine learning: Lecture 6a - Overview of mini-batch gradient descent,2012. Available at https://www.cs.toronto.edu/∼tijmen/csc321/slides/lecture slides lec6.pdf.

[15] S. C. H. Hoi, J. Wang, and P. Zhao. LIBOL: a library for online learningalgorithms. Journal of Machine Learning Research, 15(1):495–499,2014.

[16] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Denselyconnected convolutional networks. In IEEE Conference on ComputerVision and Pattern Recognition, pages 4700–4708, 2017.

[17] X. Huang, C. Shen, X. Boix, and Q. Zhao. Salicon: Reducing thesemantic gap in saliency prediction by adapting deep neural networks.In IEEE International Conference on Computer Vision, pages 262–270,2015.

[18] L. Itti, N. Dhavale, and F. Pighin. Realistic Avatar Eye and HeadAnimation Using a Neurobiological Model of Visual Attention. InProceedings of SPIE 48th Annual International Symposium on OpticalScience and Technology, Aug. 2003.

[19] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attentionfor rapid scene analysis. IEEE Transactions on Pattern Analysis andMachine Intelligence, 20(11):1254–1259, 1998.

[20] M. Jiang, S. Huang, J. Duan, and Q. Zhao. SALICON: Saliency in con-text. In IEEE Conference on Computer Vision and Pattern Recognition,pages 1072–1080, 2015.

[21] T. Judd, F. Durand, and A. Torralba. A benchmark of computationalmodels of saliency to predict human fixations. Technical report, 2012.MIT-CSAIL-TR-2012-001.

[22] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning to predictwhere humans look. In IEEE International Conference on ComputerVision, pages 2106–2113, 2009.

[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.In International Conference on Learning Representations, 2015.

[24] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al.Overcoming catastrophic forgetting in neural networks. Proceedings ofthe National Academy of Sciences, 114(13):3521–3526, 2017.

[25] E. Knight and O. Lerner. Natural gradient deep q-learning. CoRR,abs/1803.07482, 2018.

[26] A. Krizhevsky and G. Hinton. Learning multiple layers of features fromtiny images. Technical Report 4, University of Toronto, 2009.

[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In Advances in neural infor-mation processing systems, pages 1097–1105, 2012.

[28] M. Kummerer, T. S. A. Wallis, L. A. Gatys, and M. Bethge. Understand-ing low- and high-level contributions to fixation prediction. In IEEEInternational Conference on Computer Vision, pages 4789–4798, 2017.

[29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[30] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli. Dual-glance model fordeciphering social relationships. In ICCV, pages 2669–2678, 2017.

[31] Y.-L. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H.-S. Fang, Y. Wang,and C. Lu. Transferable interactiveness knowledge for human-objectinteraction detection. In The IEEE Conference on Computer Vision and

Pattern Recognition, pages 3585–3594, 2019.[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects incontext. In European Conference on Computer Vision, volume 8693 ofLecture Notes of Computer Science, pages 740–755, 2014.

[33] D. Lopez-Paz and M. A. Ranzato. Gradient episodic memory for con-tinual learning. In Advances in Neural Information Processing Systems,pages 6470–6479, 2017.

[34] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationshipdetection with language priors. In European Conference on ComputerVision, volume 9905 of Lecture Notes in Computer Science, pages 852–869, 2016.

[35] M. McCloskey and N. J. Cohen. Catastrophic interference in connection-ist networks: The sequential learning problem. Psychology of learningand motivation, 24:109–165, 1989.

[36] Y. Nesterov. Introductory Lectures on Convex Optimization: A BasicCourse. Springer Publishing Company, Incorporated, 1st edition, 2014.

[37] Y. E. Nesterov. A method for solving the convex programming problemwith convergence rate o (1/kˆ 2). In Dokl. Akad. Nauk SSSR, volume 269,pages 543–547, 1983.

[38] N. Ouerhani, R. Von Wartburg, H. Hugli, and R. Muri. Empiricalvalidation of the saliency-based model of visual attention. ElectronicLetters on Computer Vision and Image Analysis, 3(1):13–24, 2004.

[39] R. Ratcliff. Connectionist models of recognition memory: constraintsimposed by learning and forgetting functions. Psychological review,97(2):285, 1990.

[40] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. iCaRL:Incremental classifier and representation learning. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5533–5542, 2017.

[41] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-timeobject detection with region proposal networks. In Advances in neuralinformation processing systems, pages 91–99, 2015.

[42] H. Robbins and S. Monro. A stochastic approximation method. TheAnnals of Mathematical Statistics, 22(3):400–407, 09 1951.

[43] A. L. Rothenstein and J. K. Tsotsos. Attention links sensing to recogni-tion. Image and Vision Computing, 26(1):114–126, 2008.

[44] T. Salimans and D. P. Kingma. Weight normalization: A simple reparam-eterization to accelerate training of deep neural networks. In Advancesin Neural Information Processing Systems, pages 901–909, 2016.

[45] H. J. Seo and P. Milanfar. Static and space-time visual saliency detectionby self-resemblance. Journal of vision, 9(12):15–15, 2009.

[46] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In International Conference on LearningRepresentations, 2015.

[47] M. Tan and Q. Le. EfficientNet: Rethinking model scaling for con-volutional neural networks. In Proceedings of the 36th InternationalConference on Machine Learning, volume 97 of Proceedings of MachineLearning Research, pages 6105–6114, 2019.

[48] D. Tao, X. Li, X. Wu, and S. J. Maybank. Geometric mean forsubspace selection. IEEE Transactions on Pattern Analysis and MachineIntelligence, 31(2):260–274, 2009.

[49] R. Tibshirani. Lecture notes: optimization, September 2013. Available athttp://www.stat.cmu.edu/∼ryantibs/convexopt-F13/scribes/lec6.pdf.

[50] A. M. Treisman and G. Gelade. A feature-integration theory of attention.Cognitive psychology, 12(1):97–136, 1980.

[51] G. Underwood and T. Foulsham. Visual saliency and semantic incon-gruency influence eye movements when inspecting pictures. QuarterlyJournal of Experimental Psychology, 59(11):1931–1949, 2006.

[52] G. Underwood, L. Humphreys, and E. Cross. Congruency, saliency andgist in the inspection of objects in natural scenes. In Eye Movements,pages 563–VII. Elsevier, 2007.

[53] V. N. Vapnik. An overview of statistical learning theory. IEEETransactions on Neural Networks, 10(5):988–999, 1999.

[54] H. Wang and C. Schmid. Action recognition with improved trajectories.In Proceedings of the IEEE International Conference on ComputerVision, pages 3551–3558, 2013.

[55] J. M. Wolfe and T. S. Horowitz. Five factors that guide attention in visualsearch. Nature Human Behaviour, 1(3):0058, 2017.

[56] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residualtransformations for deep neural networks. In IEEE Conference onComputer Vision and Pattern Recognition, pages 5987–5995, 2017.

[57] B. Xu, Y. Wong, J. Li, Q. Zhao, and M. S. Kankanhalli. Learning to detecthuman-object interactions with knowledge. In The IEEE Conference onComputer Vision and Pattern Recognition, pages 2019–2028, 2019.

[58] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao. Predictinghuman gaze beyond pixels. Journal of vision, 14(1):28–28, 2014.

[59] S. Yang, G. Lin, Q. Jiang, and W. Lin. A dilated inception network forvisual saliency prediction. IEEE Transactions on Multimedia, 2019.

[60] Y. Yang, Z. Ma, A. G. Hauptmann, and N. Sebe. Feature selection formultimedia analysis by sharing information among multiple tasks. IEEETransactions on Multimedia, 15(3):661–669, 2012.


https://math.nyu.edu/~cfgranda/pages/OBDA_spring16/material/optimization_algorithms.pdf

https://math.nyu.edu/~cfgranda/pages/OBDA_spring16/material/optimization_algorithms.pdf

https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

http://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf




[61] Q. You, H. Jin, and J. Luo. Visual sentiment analysis by attendingon local image regions. In Thirty-First AAAI Conference on ArtificialIntelligence, pages 231–237, 2017.

Yan Luo is currently pursuing the Ph.D. degreewith the Department of Computer Science andEngineering, University of Minnesota at TwinCities, Minneapolis, MN, USA. He received theB.Sc. degree in computer science from Xi’anUniversity of Science and Technology. In 2013,he joined the Sensor-enhanced Social Media(SeSaMe) Centre, Interactive and Digital MediaInstitute, National University of Singapore, asa Research Assistant. In 2015, he joined theVisual Information Processing Laboratory at the

National University of Singapore as a Ph.D. Student. He worked in theindustry for several years on distributed system. His research interestsinclude computer vision, computational visual cognition, and deep learn-ing.

Yongkang Wong is a senior research fellow atthe School of Computing, National University ofSingapore. He is also the Assistant Director ofthe NUS Centre for Research in Privacy Tech-nologies (N-CRiPT). He obtained his BEng fromthe University of Adelaide and PhD from theUniversity of Queensland. He has worked asa graduate researcher at NICTA’s Queenslandlaboratory, Brisbane, OLD, Australia, from 2008to 2012. His current research interests are inthe areas of Image/Video Processing, Machine

Learning, Action Recognition, and Human Centric Analysis. He is amember of the IEEE since 2009.

Mohan Kankanhalli is the Provost’s Chair Pro-fessor at the Department of Computer Scienceof the National University of Singapore. He isthe director with the N-CRiPT and also theDean, School of Computing at NUS. Mohan ob-tained his BTech from IIT Kharagpur and MS &PhD from the Rensselaer Polytechnic Institute.His current research interests are in MultimediaComputing, Multimedia Security and Privacy, Im-age/Video Processing and Social Media Anal-ysis. He is on the editorial boards of several

journals. Mohan is a Fellow of IEEE.

Qi Zhao is an assistant professor in the De-partment of Computer Science and Engineer-ing at the University of Minnesota, Twin Cities.Her main research interests include computer vi-sion, machine learning, cognitive neuroscience,and mental disorders. She received her Ph.D.in computer engineering from the University ofCalifornia, Santa Cruz in 2009. She was a post-doctoral researcher in the Computation & NeuralSystems, and Division of Biology at the Califor-nia Institute of Technology from 2009 to 2011.

Prior to joining the University of Minnesota, Qi was an assistant profes-sor in the Department of Electrical and Computer Engineering and theDepartment of Ophthalmology at the National University of Singapore.She has published more than 50 journal and conference papers in topcomputer vision, machine learning, and cognitive neuroscience venues,and edited a book with Springer, titled Computational and CognitiveNeuroscience of Vision, that provides a systematic and comprehensiveoverview of vision from various perspectives, ranging from neuroscienceto cognition, and from computational principles to engineering develop-ments. She is a member of the IEEE since 2004.


direction concentration learning: enhancing congruency in ...qzhao/publications/... · dataset, the...

Documents