adversarial visual robustness by causal intervention

14
Adversarial Visual Robustness by Causal Intervention Kaihua Tang Nanyang Technological University [email protected] Mingyuan Tao Alibaba Group [email protected] Hanwang Zhang Nanyang Technological University [email protected] Abstract Adversarial training is the de facto most promising de- fense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown at- tackers. To achieve a proactive defense, we need a more fun- damental understanding of adversarial examples, beyond the popular bounded threat model. In this paper, we pro- vide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning, where attackers are precisely exploiting the confounding ef- fect. Therefore, a fundamental solution for adversarial ro- bustness is causal intervention. As the confounder is unob- served in general, we propose to use the instrumental vari- able that achieves intervention without the need for con- founder observation. We term our robust training method as Causal intervention by instrumental Variable (CiiV). It has a differentiable retinotopic sampling layer and a con- sistency loss, which is stable and guaranteed not to suf- fer from gradient obfuscation. Extensive experiments on a wide spectrum of attackers and settings applied in MNIST, CIFAR-10, and mini-ImageNet datasets empirically demon- strate that CiiV is robust to adaptive attacks 1 . 1. Introduction Despite the remarkable progress achieved by Deep Neu- ral Networks (DNNs), adversarial vulnerability [20] keeps haunting the computer vision community since it has been spotted by Szegedy et al.[61]. Over the years, we have wit- nessed many defenders, who claim to be “well-rounded”, were soon found to lack fair benchmarking (e.g., adaptive adversary [64, 44]), or misconduct the attack (e.g., obfus- 1 Codes are available at the following Github project: https://github.com/KaihuaTang/ Adversarial-Robustness-by-Causal-Intervention. pytorch Image (X) Prediction (Y) for ‘1’ for ‘2’ Confounder (C) (a) Training on Confounded Data Image (X) Prediction (Y) Confounder (C) (b) Attacking by Tampered Confounders DNN Model Image (X) Prediction (Y) Adversarial Perturbation (d) Adversarial Attacking cat Attacked by Tampered Confounders Image (X) Prediction (Y) (c) Conventional Training dog Confounder DNN Model Figure 1. (a) A confounded “1/2” classifier. (b) Changing the stripe from horizontal to vertical attacks the model. (c) The conventional training based on association is confounded. (d) Adversarial per- turbation is essentially by tampering confounders. cated gradient [3, 7, 21]). Therefore, the most promis- ing defender still remains to be the intuitive Adversarial Training and its variants [29, 76]. Due to its “training” nature, the adversarial robustness is largely dependent on whether the training set contains sufficient adversarial sam- ples from various attackers as many as possible [65]— yet brute-forcely enumerating all attackers is prohibitively expensive—adversarial training is well-known overfitted to known attackers [57]. Particularly, in few-/zero-shot scenar- ios, it is even impossible to collect enough adversarial train- ing samples based on the out-of-distribution/unseen original samples [79]. In other words, adversarial training is a “passive immu- nization”, which cannot react to the ever-evolving attacks responsively. To proactively defend against all the known and unknown attackers, we have to find the “origin” of ad- versarial examples. Previous methods blame adversarial vulnerability for the inherent flaws in fitting models to high- dimensional and limited data [20, 19, 55]. However, sim- ply regarding adversarial samples as “bugs” cannot explain their well-generalizing behaviors [73, 28]. Recent stud- 1 arXiv:2106.09534v1 [cs.CV] 17 Jun 2021

Upload: others

Post on 02-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adversarial Visual Robustness by Causal Intervention

Adversarial Visual Robustness by Causal Intervention

Kaihua TangNanyang Technological University

[email protected]

Mingyuan TaoAlibaba Group

[email protected]

Hanwang ZhangNanyang Technological University

[email protected]

Abstract

Adversarial training is the de facto most promising de-fense against adversarial examples. Yet, its passive natureinevitably prevents it from being immune to unknown at-tackers. To achieve a proactive defense, we need a more fun-damental understanding of adversarial examples, beyondthe popular bounded threat model. In this paper, we pro-vide a causal viewpoint of adversarial vulnerability: thecause is the confounder ubiquitously existing in learning,where attackers are precisely exploiting the confounding ef-fect. Therefore, a fundamental solution for adversarial ro-bustness is causal intervention. As the confounder is unob-served in general, we propose to use the instrumental vari-able that achieves intervention without the need for con-founder observation. We term our robust training methodas Causal intervention by instrumental Variable (CiiV). Ithas a differentiable retinotopic sampling layer and a con-sistency loss, which is stable and guaranteed not to suf-fer from gradient obfuscation. Extensive experiments on awide spectrum of attackers and settings applied in MNIST,CIFAR-10, and mini-ImageNet datasets empirically demon-strate that CiiV is robust to adaptive attacks1.

1. IntroductionDespite the remarkable progress achieved by Deep Neu-

ral Networks (DNNs), adversarial vulnerability [20] keepshaunting the computer vision community since it has beenspotted by Szegedy et al. [61]. Over the years, we have wit-nessed many defenders, who claim to be “well-rounded”,were soon found to lack fair benchmarking (e.g., adaptiveadversary [64, 44]), or misconduct the attack (e.g., obfus-

1Codes are available at the following Githubproject: https://github.com/KaihuaTang/Adversarial-Robustness-by-Causal-Intervention.pytorch

Image (X) Prediction (Y)

for ‘1’

for ‘2’

Confounder (C)

(a) Training on Confounded Data

Image (X) Prediction (Y)

Confounder (C)

(b) Attacking by Tampered Confounders

DNN Model

Image (X) Prediction (Y)

Adversarial

Perturbation

(d) Adversarial Attacking

cat

Attacked by

Tampered

Confounders

Image (X) Prediction (Y)

(c) Conventional Training

dog

Confounder

DNN Model

Figure 1. (a) A confounded “1/2” classifier. (b) Changing the stripefrom horizontal to vertical attacks the model. (c) The conventionaltraining based on association is confounded. (d) Adversarial per-turbation is essentially by tampering confounders.

cated gradient [3, 7, 21]). Therefore, the most promis-ing defender still remains to be the intuitive AdversarialTraining and its variants [29, 76]. Due to its “training”nature, the adversarial robustness is largely dependent onwhether the training set contains sufficient adversarial sam-ples from various attackers as many as possible [65]—yet brute-forcely enumerating all attackers is prohibitivelyexpensive—adversarial training is well-known overfitted toknown attackers [57]. Particularly, in few-/zero-shot scenar-ios, it is even impossible to collect enough adversarial train-ing samples based on the out-of-distribution/unseen originalsamples [79].

In other words, adversarial training is a “passive immu-nization”, which cannot react to the ever-evolving attacksresponsively. To proactively defend against all the knownand unknown attackers, we have to find the “origin” of ad-versarial examples. Previous methods blame adversarialvulnerability for the inherent flaws in fitting models to high-dimensional and limited data [20, 19, 55]. However, sim-ply regarding adversarial samples as “bugs” cannot explaintheir well-generalizing behaviors [73, 28]. Recent stud-

1

arX

iv:2

106.

0953

4v1

[cs

.CV

] 1

7 Ju

n 20

21

Page 2: Adversarial Visual Robustness by Causal Intervention

ies [28, 53] show that adversarial examples are not “bugs”but predictive features that can only be exploited by ma-chines. Such results urge us to investigate the essential dif-ference between machines and humans—perhaps a never-ending philosophical debate per se.

However, we believe that it is too early for us to shirkresponsibility and leave it to the ever-elusive open problem,before we answer the following two key questions:Q1: Who are the non-robust but predictive features?Ilyas et al. [28] use the definition of adversarial samplesto define the robust and non-robust features. However, thiswill only allow us to recognize the features as “adversarialsamples” again, which is, unfortunately, circular reasoning.Therefore, we need a more fundamental definition of thefeatures beyond the conventional adversarial one.Q2: Why do complex systems (human vision) ignore the“good” predictive features that simple systems (DNNs)can capture instead? Given the fact that biological visionsare more complex than machines in terms of both neuronamount [25] and diversity [39, 31], there is no reason for hu-man vision to extract “less feature” than machines. There-fore, there must be a mechanism in human vision that de-liberately ignores the features.

In this paper, we answer the above two questions from acausal perspective [46]—a powerful lens seeing through thegenerative nature of adversarial attacks. For Q1, we pos-tulate that adversarial attacks are confounding effects,which are spurious correlations established by non-causalfeatures. Take Figure 1 (a) as an intuitive example, wherethe horizontal stripes co-occur with the digit “1” in mostsamples. As a result, a model trained by associating sampleswith labels will recklessly use the stripe pattern—the con-founding effect—to classify “1” and “2”. Therefore, oncethe stripes are tampered, which is a much easier way thanediting the digit directly, the confounding effect will mis-lead the model prediction shown in Figure 1 (b).

In general, any pattern co-occurred with certain labelscan constitute confounders (e.g., local textures, small edges,and faint shadows). Since almost all DNN models are basedon statistical association between input and output, they in-evitably learn these spurious correlations, which are “pre-dictive” when the confounders remain the same in trainingand testing, as shown in Figure 1 (c). However, when suchconfounders are tampered by adversarial perturbations, theconfounding effect will seriously attack the model as shownin Figure 1 (d). In Section 4.1 & 4.2, we will provide a for-mal revisit for existing adversarial attackers and defendersin the causal viewpoint, where we hope that readers mayunderstand why it is so hard to defend against adversar-ial attacks—removing the unobserved and unknown con-founders is challenging in practice [14].

Unlike machine vision that scans all the pixels in animage at once, human vision perceives the image using

R X

C

YSamplerConfounded

Image

Sampled

Image

……

Confounder

Vis

ual C

on

sis

ten

cy

De-confounded Causal Effect

(a) Retinotopic Sampling (b) The Proposed Causal Graph (c) Causal Intervention

Figure 2. The proposed CiiV framework (detailed in Section 4.4)illustrated with a toy example: (a) retinotopic sampling that servesas the instrumental variable; (b) the proposed causal graph to es-timate causal effect; (c) the overall causal intervention that elimi-nates non-robust confounding effects.

“retinotopic sampling” [2] via non-uniformly distributedretinal photoreceptors at each time frame as shown in Fig-ure 2 (a). We conjecture that such sampling is the answer toQ2, because it can be viewed as causal intervention byusing instrumental variable [46], denoted as R in Fig-ure 2 (b). With the help of R, the confounded image ob-servation X is no longer dictated only by the confounders,whose effects are thus mitigated even though C is unob-served, because the effect of C on Y is blocked from thepath R → X ← C → Y due to the colliding junc-tion R → X ← C [47]. Intuitively, the sampling willbreak the non-robust confounder patterns (e.g., stripes) intorandom sub-patterns, which become non-learnable for ma-chines. However, simply introducing R causes inconsistentobservations of neighboring samples in a highly non-linearsystem like DNN. To this end, in Section 4.4, we proposean effective consistency loss that minimizes the asymptoticvariance of features, and thus ensures the convergence ofde-confounded causal effect calculated from finite samples,as shown in Figure 2 (c).

Our key contributions are as follows:• We introduce a Causal intervention by instrumental Vari-

able (CiiV \si:v\) framework for training robust DNNagainst adversarial attacks. It does not only offer a proac-tive defender avoiding the endless adversarial training,but also opens a novel yet fundamental viewpoint of ad-versary research.

• CiiV has a fully differentiable retinotopic sampling layerthat conducts instrumental variable estimation with min-imized asymptotic variance. Therefore, CiiV is stableand guaranteed not to suffer from gradient obfuscation.

• We strictly follow the adversarial evaluation checklistfrom Carlini et al. [8] for sanity checks on a wide rangeof attacker settings of MNIST, CIFAR-10, and mini-ImageNet. This demonstrates that CiiV withstands adap-tive attacks.

2. Related WorkAdversarial Examples. Adversarial examples underminethe reliability and interpretability of DNN models in var-

2

Page 3: Adversarial Visual Robustness by Causal Intervention

ious domains [69, 48, 18, 75, 72, 12, 10, 27] and set-tings [15, 40, 34, 45, 9, 81]. A variety of defenders are pro-posed to improve the adversarial robustness. Some defend-ers [51, 31] are inspired by biological vision systems. Gen-erally, the defenders fall into the following five categories:adversarial training [61, 71], data augmentation [80], gen-erative classifier [37, 16], de-noising [7, 76], and certifieddefense [13]. In Section 4.2 we will revisit them and com-pare them to CiiV from a causal viewpoint.

Causality in Computer Vision. Recently, causal inferencehas been shown effective in mitigating data bias in com-puter vision tasks [62, 49, 78, 77, 42, 26]. They use con-founder adjustment for the biased causal effect measured indomain-specific tasks. However, the confounder that causesadversarial perturbations is generally unobserved and ubiq-uitous. Though recent surveys [56, 30] have mentioned thepotential of causality in adversarial robustness, to the bestof our knowledge, we are the first to establish a feasibleand formal causal framework for interpreting and tacklingadversarial attacks.

3. Preliminaries

Causal Graph. We adopt Pearl’s graphical model [46],where a directed edge indicates the causality between twonode variables. The causal graph of the proposed CiiVframework is illustrated in Figure 2 (b), where R,C,X, Yindicate retinotopic sampling, confounding perturbation,image, and prediction, respectively. X ← C → Y denotesthat there is a common cause—confounder C—that affectsthe distribution of X and Y , e.g., the stripe pattern in Fig-ure 1. X → Y denotes the desired causality that a robustprediction model is expected to learn.

Causal Intervention. The ultimate goal of causal interven-tion is to identify the causal effect of X → Y by removingall spurious correlations [47], denoted as P (Y |do(X = x)).It is well-defined as d-separation [47], by which observ-ing the confounder can block the spurious path, e.g., givenC = c, the path X ← C → Y is blocked. There arethree major interventions: backdoor adjustment, front-dooradjustment and instrumental variable. The first two assumethat confounders or mediators are observed, which are notpractical in general for adversarial robustness. Therefore,we are interested in the last one that does not require suchassumptions.

Instrumental Variable [22, 41]. As shown in Figure 2 (b),a valid instrumental variable should satisfy: 1) it is indepen-dent from the confounder variable; 2) it can affect Y onlythrough X . Therefore, the instrumental variable can help toextract the causal effect of X → Y from R → X → Y ,which is not confounded by C (d-separated), thanks to thecolliding junction R→ X ← C.

X: Input Image C: Confounding Perturbation

X

C

Y

(a) Adversarial Attack

X

C

Y

(b) Adversarial Training

X

C

Y

(c) Data Augmentation

X

C

Y

(d) Generative Classifier

𝜺 X

C

Y

(f) Randomized Smoothing

𝜺 ≫ 𝒄 + 𝜹

X

C

Y

(e) De-noising

Y: Prediction 𝜺: Gaussian Noise

Figure 3. The causal mechanisms of adversarial attack and defensemethods.

4. Approach4.1. Causal View of Adversarial Attack

In causality [47], the total effect and causal effect of Xto Y can be defined as P (Y |X), P (Y |do(X = x)), re-spectively. Given the proposed causal graph in Figure 1, theconfounding path X ← C → Y causes the inequality be-tween the above two, and thus the confounding effect canbe represented as their difference.

The general adversarial attack can be formulated as max-imizing the probability of a tampered prediction Y = ywithin the budget Dε as follows [52]:

maxδ∈Dε

P (Y = y|X = x+ δ) ∝∑i

yilog(efi(x+δ)∑j efj(x+δ)

), (1)

where δ is the adversarial perturbation added to image x,fi(·) and fj(·) denote the output of DNN model on cate-gories i and j, y = y′(y′ 6= y) for targeted attack, y = −yfor untargeted attack, the calculation of budget Dε varieswith threat models [81, 34], e.g., the most popular defini-tion is an enclosing ball under l2/l∞ norm within radius ε.

Notably, since a valid Dε for attack requires to retainthe semantic patterns, i.e., the perturbation δ cannot changethe causal features, so P (Y |do(X = x)) should stay thesame; otherwise, it is called “poisoning” that is beyond ourscope [63]. Therefore, the adversarial attack that optimizesthe total effect P (Y |X) towards Y = y actually maximizesthe tampered confounding effect.

In general, all attacks, including the gradient-based [20,38], gradient-free [6, 11], and physical-based [34] attacks,can be viewed as Figure 3 (a) that violates X → Y towardsground-truth category y by maximizing the confounding ef-fect X ← C → Y . To intuitively demonstrate the abovecausal theories, we further demonstrate the equivalence be-tween the confounding effect and adversarial examples ona confounded toy dataset in Section 5.4.

4.2. Causal View of Adversarial Defense

Due to the unknown and unobserved confounderC in theinput image X , directly adjusting C for P (Y |do(X = x))is impractical [14]. Therefore, all the existing adversarial

3

Page 4: Adversarial Visual Robustness by Causal Intervention

defenders can be regarded as either removing the depen-dence between C → X or undermining the correlation be-tween C → Y . Specifically, they can be roughly summa-rized into the following five categories.Adversarial Training. As the most promising defenders,adversarial training and its variants [61, 71] (Figure 3 (b))use adversarial examples to maximize the overall predictionunder attack X = x + δ, where δ is calculated by Eq. (1).As long as the inference attack methods are similar to theattacks used to generate training samples, it prevents thetampered confounder δ from changing the original Y = y,and thus blocks C 6→ Y .Data Augmentation. Since some of the previous meth-ods blame adversarial vulnerability to the limited data [55],data augmentation is also utilized by defenders. For ex-ample, Mixup [80] enhances the adversarial robustness ofDNN models by augmenting training samples with the lin-ear combinations of different (x, y) pairs, forcing the effecton Y to be proportional to the magnitude of perturbation,i.e., small δ only causes little confounding effect. There-fore, it undermines the link between C → Y as shown inFigure 3 (c). However, there is no data augmentation thatcan enumerate all confounders, thus strong attackers likePGD [38] can easily defeat Mixup models.Generative Classifier. Instead of directly predicting Yfrom X , defenders based on generative classifiers [57, 37,16] attempt to find which particular category y = i can gen-erate xi that is most likely to represent the input x. Forexample, Schott et al. [57] used VAE to generate class-induced xi for each category y = i then predict Y based onthe highest joint probability p(xi, y). Although the gener-ated xi undermines the confounder c from the original im-age as C 6→ X , incomplete reconstruction may also hurtcausal features. Besides, their computation overhead lin-early increases with the number of categories, which is pro-hibitive under large-scale datasets like ImageNet [53].De-noising. De-noising methods (Figure 3 (e)) adopt eitherpre-network or in-network structures to prevent confound-ing effects from reaching the final prediction. The pre-network de-noising methods [74, 54, 7, 60, 21] usually con-duct non-differentiable transformations to remove the noisydetails C 6→ X , but Athalye et al. [3] recently found thatsome of them suffer from the obfuscated-gradient problem.Meanwhile, in-network de-noising methods [36, 76] purifythe feature maps, removing the spurious correlations withinthe network C 6→ Y , yet most of them have to be combinedwith adversarial training to achieve reliable robustness.Certified Defense. The last major category of defendersis certified defenses [50, 70, 59]. Among them, the mosttypical and related method to our causal intervention is therandomized smoothing [13]. It achieves certifiable robust-ness towards l2-norm adversarial perturbations by introduc-ing a larger Gaussian Noise E to override nature and ad-

R X

C

Y

𝑤𝑐𝑥 𝑤𝑐𝑦

𝑤𝑥𝑦𝑤𝑟𝑥X

C

Y

𝑤𝑐𝑥 𝑤𝑐𝑦

𝑤𝑥𝑦(a) Without Instrumental Variable (b) With Instrumental Variable

Figure 4. The causal graphs with/without instrumental variable onlinear models with one-dimensional variables.

versarial confounders c + δ. Although E undermines theconfounding path X ← C → Y (Figure 3 (f)), this noisynode lacks biological evidence and also doesn’t involve theconsistency loss, which cannot be regarded as a valid CiiV.Particularly, it’s significantly more vulnerable under anti-obfuscated-gradient attacks as reported in Table 5.

4.3. Instrumental Variable Estimation

To demonstrate the use of instrumental variable, we de-sign two causal graphs as illustrated in Figure 4, where eachnode is a one-dimensional variable, confounder C is sam-pled from the normal distribution N (0, 1), R is the instru-mental variable, all causal links can be modeled by a linearweight denoted as w∗.

In the given confounded model of Figure 4 (a), theoverall effect of model prediction can be formulated asP (Y |X) ∝ wxyx + wcyc. However, since x is dependenton the unknown confounder as x = wcxc + h, where h isthe causal component of x, we cannot directly estimate thecausal effect P (Y |do(X = x)) ∝ wxyx by simply observ-ing (x, y) pairs.

If C is observable, the causal intervention can be con-ducted using the backdoor adjustment: P (y|do(x)) =∑c P (y|x, c)P (c). Therefore, the causal effect is estimated

from the total effect with observed C:

P (Y |do(X = x)) ∝ wxyx+ wcy∑c

c · p(c) = wxyx, (2)

where∑c c · p(c) = 0 as c is sampled from N (0, 1).

However, if C is unobservable, both backdoor adjust-ment and the causal graph in Figure 4 (a) cannot remove theconfounding effect. To this end, the instrumental variableRis introduced as in Figure 4 (b), where X is now manipu-lated by both C and R as x = wcxc + wrxr + h. Due tothe fact that the confounding path R → X ← C → Y isblocked by the collider X [47], the R → X → Y path,i.e., wrxwxy , can be jointly estimated as wry by accessingthe (r, x, y) triplets. Hence, P (Y |do(X = x)) is obtainedwithout any knowledge of the C as follows:

P (Y |do(X = x)) ∝ w−1rx wryx = wxyx, (3)

where w−1rx is estimated from (r, x) pairs.Therefore, with the help of instrumental variable estima-

tion, the causal intervention can be equally conducted withand without the knowledge of C in linear models.

4

Page 5: Adversarial Visual Robustness by Causal Intervention

4.4. The Proposed CiiV

The CiiV framework is composed of two parts: 1)the implementation of retinotopic sampling and causal ef-fect estimation, and 2) the consistent loss minimizing theasymptotic variance. Note that we use functions grx(·) andgry(·) to denote the generalized wrx and wry in Eq. (3).Retinotopic Sampling. The biological retina is known toconsist of photoreceptors, bipolar cells and a variety ofother neurons [51]. Retinotopic sampling is the result ofthe non-uniformly spatial distribution of these photorecep-tors [33, 2]. As shown in Figure 5 (a), we build a samplingdistribution r, whose sampling frequency decreases froma random center to the peripheries (detailed in Appendix),then apply it to the input x. The sampling procedure is de-noted as xr = grx(x, r), which is implemented by the fol-lowing function in CiiV:

grx(x, r) = r � 1

N

∑i

ReLU([x+ εi|x+ εi|dt

;εi − x|εi − x|dt

]), (4)

where � is element-wise product, | · |dt means detachedmagnitude, ε and ε are uniformly sampled from (−ω, ω)and (1− ω, 1 + ω) respectively, ω is a hyper-parameter de-pending on the best exposure time of images shown as Fig-ure 5 (b), concatenation [; ] imitates the ON-/OFF-bipolarcells [39] that respond to the light and dark, respectively,N times sampling mimics the over-completeness of neuralrepresentations [43].

It’s worth noting that the proposed grx(x, r) is fully dif-ferentiable that won’t shatter the gradients [3]. As a result,the overall function gry(·) of R → X → Y can be mod-eled as any DNN models with a retinotopic sampling layer.Here, the input image can either be an original image x oran adversarial sample x + δ. Meanwhile, an intuitive im-plementation of the inverse retinotopic sampling g−1rx (·) isensembling. Similar to Eq. (3), the generalized causal ef-fect of X can be approximated as:

P (Y |do(X = x)) = g−1rx (gry(x, r)) ≈ 1

Tr

∑r

gry(x, r), (5)

where Tr is the total number of retinotopic sampling withindependent sampling distribution r.Asymptotic-Variance Convergence Loss. However, theremaining problem is that both grx(·) and gry(·) are non-linear functions, making the above estimation only providesa rough bound of de-confounded causal effects [5], becausethe Eq. (5) won’t naturally converge in limited samples un-der the non-linear model. In other words, the above estima-tion could be over sensitive to any tiny change on r and thuscause the CiiV to be unstable.

Inspired by the previous non-linear instrumental variableestimation methods [23, 22], we introduce the Asymptotic-Variance Convergence (AVC) loss as a part of our optimiza-tion targets, which aims to narrow the estimation bound andthus better approximates the de-confounded causal effect.

(a) Examples of Retinotopic Sampling with Distributions 𝑟𝑐𝑒𝑛𝑡𝑒𝑟 and 𝑟𝑐𝑜𝑟𝑛𝑒𝑟

𝜔 = 0.5 𝜔 = 0.6 𝜔 = 0.7 𝜔 = 0.8 𝜔 = 0.9

(b) Examples of Retinotopic Sampling with Different Sampling Parameter 𝜔

𝑥𝑟𝑐𝑒𝑛𝑡𝑒𝑟 𝑥𝑟 𝑥𝑟𝑐𝑜𝑟𝑛𝑒𝑟 𝑥𝑟

Figure 5. Examples of retinotopic sampling with different retino-topic distribution r and sampling parameter ω.

The proposed AVC loss can be formulated as:

min AV Cr(gry(x, r)) (6)

=min limTr→∞

1

Tr

∑r

(gry(x, r)− 1

Tr

∑r

gry(x, r))2 (7)

=min limTr→∞

1

Tr

∑r

‖h(x, r)− 1

Tr

∑r

h(x, r)‖1, (8)

where h(·) denotes DNN modules before the classifier.Since the common classifier is only a linear layer, we di-rectly apply the AVC loss on feature space for better opti-mization results. ‖ · ‖1 is used due to the sparse property offeature space. The biological explanation of the proposedAVC loss can be considered as the visual consistency overeye movement [35].

Overall, the proposed CiiV will jointly optimize Eq. (5)through both cross-entropy loss and AVC loss with a trade-off parameter β. It’s worth noting that although Tr is re-quired to be large (e.g., Tr = 10) during training due to thedefinition of asymptotic variance, small Tr (e.g., Tr = 1)can be used in the inference stage upon the convergence ofAVC loss, as shown in Figure 7.

5. Experiments5.1. Datasets and Settings

Datasets. We evaluated the proposed CiiV and other de-fenders on three benchmark datasets: MNIST, CIFAR-10,and mini-ImageNet [68]. MNIST and CIFAR-10 contain60k gray-scale and 50k colour training samples, respec-tively. The size of images are 28x28 for MNIST and 32x32for CIFAR-10. They all have 10 categories and 10k test im-ages. mini-ImageNet was originally proposed by Vinyals etal. [68] for few-shot recognition, which consists of 100classes and each has 600 images with the size of 84×84. Wesplit them into train/val/test sets with 42k/6k/12k images.Training Details. All models were trained using the SGDoptimizer with 0.9 momentum and 5e-4 weight decay. Ex-periments were conducted on 1 GTX 2080TI GPU with128 batch size and 100 total epochs. The learning rate lrand trade-off parameter β were both started at 0.1 and up-dated by the factor of 0.2 and 10, respectively, after every 25

5

Page 6: Adversarial Visual Robustness by Causal Intervention

epochs. ResNet18 [24] backbone was used for all modelsexcept for the black-box experiments.

5.2. Evaluation Checklist

To verify that CiiV doesn’t suffer from flawed or incom-plete evaluations, we followed Carlini et al. [8] to conducta series of sanity checks in our diagnosis:

• Iterative attacks are better than single-step attacks, e.g.,PGD/FGSM in Table 1&2 and PGD-1/-20 in Figure 6(b).

• Unbounded adversarial examples become random guess-ing or 0% accuracy: Figure 6(a). The accuracy con-verges with the increasing of attack steps: Figure 6(b).

• Transferred black-box adversarial examples are weakerthan white-box adversarial examples: Table 1&3.

• Gradient-based attacks are stronger than gradient-free at-tacks. Brute-force searching should not be as competi-tive as other attacks: Table 1&6.

• Investigate both targeted attacks and untargeted attacks,e.g., the attack success rates demonstrated in Table 4.

• BPDA-/EOT-version of attackers are evaluated for po-tential obfuscated gradients in the defender: Table 5.

5.3. Diagnosis of Adversarial Robustness

Details of Threat Models. We mainly evaluated the de-fenders on clean test together with four threat models:FGSM [20], PGD-20 [38], PGD-20 (l2) [38], and C&W [9].Other threat models involved in our experiments would beintroduced in the corresponding sections. For FGSM andPGD-20, the adversarial perturbation was created under l∞-norm, where the budget radius ε was 32/255 for MNISTand 8/255 for CIFAR-10 and mini-ImageNet. For PGD-20,it ran 20 steps with step size 2/255. For PGD-20 (l2) andC&W, the Dε was calculated under l2-norm. The step sizeand the ε of PGD-20 (l2) were 20/255 and 128/255 for alldatasets. C&W ran 1k steps with 0.1 learning rate and 1e-4box constraint. The implementations of these threat modelswere mostly modified from the project [32].Details of Defenders. We divided the defenders into Ad-versarial Training (AT) and non-AT approaches. For AT,we adopted three popular defenders: FGSM-AT, PGD-20-AT, and PGD-20-AT(l2), using the same parameters asthe corresponding threat models. We didn’t report theC&W-AT as its computation overhead was prohibitive dueto the 1k attack steps. For non-AT methods, we inves-tigated Mixup [80], BPFC [1], and randomized smooth-ing(RS) [13]. The implementations and hyper-parametersof Mixup and BPFC were adopted from their official githubrepositories. RS was re-implemented in our framework withσ = 0.25 and the number of test trials n = 10. The pro-posed CiiV is also a non-AT method, whose modules andhyper-parameters were discussed in ablative studies.

0

5

10

15

20

25

30

35

40

45

50

55

8 16 24 32 40 48 56 64

Accura

cy

Attacking Epsilon ε

CiiV (FGSM Test) CiiV (PGD-20 Test)

FGSM-AT (FGSM Test) FGSM-AT (PGD-20 Test)

PGD-20-AT (FGSM Test) PGD-20-AT (PGD-20 Test)

(a) Unbounded Attacking

25

30

35

40

45

50

55

60

1 2 5 10 20 50 100 200 500

Accura

cy

Attacking Iteration

CiiV FGSM-AT PGD-20-AT

(a) Attacking Convergence

Figure 6. (a) Unbounded Attack: the performances with un-bounded attack budgets on CIFAR-10. (b) Attack Convergence:the performances with increasing attack iterations on CIFAR-10.

Adversarial Robustness Against White-box Attack. Asreported in Table 1 and Table 2, we tested white-box at-tacks on MNIST, CIFAR-10 and mini-ImageNet datasets,which allowed full access to the architectures and weightsof DNN models. The proposed CiiV achieved the best over-all performance on CIFAR-10 and mini-ImageNet amongboth AT and non-AT methods. In MNIST, we found that alldefenders achieved good adversarial robustness even if weallowed larger attack budget radius. It is because the cleanbackground limited the number of potential confoundingpatterns, which further proved the proposed causal theoryof adversarial examples, as the number of co-occurred pat-terns significantly influenced the number of confoundingfeatures. Therefore, limited confounding effects would al-low intuitive AT defenders to exhaustively block all po-tential confounders and obtained outstanding performancesin MNIST, e.g., conventional PGD-20-AT achieved 0.21%higher overall performance than the proposed CiiV. Mean-while, in CIFAR-10 and mini-ImageNet datasets that arebased on real world images, the proposed CiiV achieved6.76% and 2.27% higher overall performances than therunner-up defender. Particularly, we found that the runner-up RS had significantly better adversarial robustness againstl2-norm attackers because it’s a certified l2-norm defender,yet its l∞ robustness was relatively worse. Among all threedatasets, the proposed CiiV consistently improved the ro-bustness on both l2 and l∞ settings.Adversarial Robustness Against Black-box Attack. Totest the adversarial robustness against transferred adversar-ial examples, we reported FGSM-based black-box attackin Table 3, where the network architectures and weightswere not obtainable for the attack methods. We inves-tigated three source models: ResNet18, ResNet50 [24],and VGG16 [58] on CIFAR-10 (ε = 8/255) and MNIST(ε = 32/255) datasets. The target defender models wereusing ResNet18, whose random seeds were different withthe source ResNet18 models. We noticed that the black-boxrobustness of all models were better than the correspondingwhite-box ones, which proved the validity of the proposedCiiV according to [8].Adversarial Robustness Against Targeted Attack. Theattack success rates of the proposed CiiV under untargetedand targeted white-box attackers were shown in Table 4 us-ing CIFAR-10 dataset. The attack success rate means the

6

Page 7: Adversarial Visual Robustness by Causal Intervention

Datasets CIFAR-10 (ε = 8/255) MNIST (ε = 32/255)Attackers Clean FGSM PGD-20 PGD-20(l2) C&W Overall Clean FGSM PGD-20 PGD-20(l2) C&W OverallBaseline 87.44 15.55 0.22 9.61 12.11 24.99 99.65 78.02 26.12 96.20 98.84 79.77

FGSM-AT 73.58 42.63 31.35 52.61 38.21 47.68 99.07 99.13 32.66 66.67 98.52 79.21PGD-20-AT 71.65 44.01 37.03 56.25 38.05 49.40 99.56 97.19 96.97 98.28 99.22 98.24

PGD-20-AT(l2) 80.25 34.45 18.75 55.51 28.61 43.51 99.64 93.36 78.94 98.63 99.16 93.95Mixup [80] 86.88 27.47 1.02 12.91 20.31 29.72 99.66 91.75 29.31 92.91 98.46 82.42BPFC [1] 67.31 39.21 33.01 53.66 33.91 45.42 99.07 93.26 91.22 97.49 98.40 95.89RS [13] 80.81 44.30 21.74 56.16 76.74 55.95 99.51 89.27 80.89 98.05 99.57 93.46

CiiV (ours) 75.00 50.73 38.03 74.54 75.23 62.71 99.47 96.10 95.80 99.28 99.52 98.03Table 1. White-box Attack: the performances of defenders on MNIST and CIFAR-10. The upper half contains the baseline and adversarialtraining methods, the bottom half reports adversary-free defenders, including the proposed CiiV. The budget radius ε is set to 32/255 and8/255 for all l∞ attackers on MNIST and CIFAR-10, respectively, and 128/255 for PGD-20(l2) in both datasets.

Attackers Clean FGSM PGD-20 PGD-20(l2) C&W OverallBaseline 64.27 2.21 0.0 4.02 1.59 14.42

FGSM-AT 45.75 22.48 12.67 25.92 17.97 24.96PGD-20-AT 44.22 21.02 17.71 32.92 15.69 26.31

PGD-20-AT(l2) 58.66 7.42 1.1 42.05 6.33 23.11Mixup 63.66 8.91 0.05 8.03 3.96 16.92BPFC 42.61 18.48 14.45 36.71 15.22 25.49

RS 54.43 16.46 2.93 41.38 47.2 32.48CiiV (ours) 45.25 22.89 15.61 45.05 44.95 34.75

Table 2. White-box Attack: experiments on mini-ImageNet.Datasets CIFAR-10 MNIST

Attack Arch ResNet18 ResNet50 VGG16 ResNet18 ResNet50 VGG16FGSM-AT 53.73 55.18 54.1 97.76 97.73 98.06

PGD-20-AT 51.93 52.56 53.55 97.05 97.34 95.03PGD-20-AT(l2) 52.12 54.5 53.49 95.57 95.83 94.99

Mixup 46.23 51.94 46.69 92.08 92.76 94.96BPFC 46.1 48.63 59.36 95.28 95.35 96.71

RS 57.06 58.13 66.71 94.36 95.49 94.2CiiV (ours) 57.41 58.21 61.06 96.96 97.15 96.34

Table 3. Black-box Attack: the performances for transferredblack-box FGSM attacks. The columns present different sourcemodels used for attack. The defenders were using ResNet18.

Attackers FGSM PGD-20 PGD-20(l2) C&Wuntargeted 49.27 61.97 25.46 24.77

targeted (most likely) 47.84 57.11 25.32 24.57targeted (least likely) 34.51 33.4 24.58 24.53

targeted (random) 36.93 39.09 24.75 24.55Table 4. Untargeted vs Targeted Attack: the attack success ratesof the proposed CiiV on CIFAR-10.

probability to predict a wrong label. The untargeted re-sults were using the same default settings as Table 1. Thetargeted results were divided into three protocols: 1) mostlikely category, 2) least likely category, and 3) random cat-egory, i.e., the targeted (not ground-truth) category had thehighest, lowest or arbitrary logit during the clean testing.The results revealed that targeted categories with higherconfounding effect would be easier to attack. Besides, un-targeted attacks were better than all targeted ones, as theywere able to make the use of all confounding effects.Adversarial Robustness Against Unbounded Attack.For evaluating the validity of the defenders, we demon-strated the performances of the proposed CiiV together withFGSM-AT and PGD-20-AT under unbounded attacks inFigure 6 (a). When the budget radius was increasing from8 to 64, all three defenders either converged to 0% accuracyfor the strong PGD-20 attack or converged to the randomguesses for the weak FGSM attack. This is an inevitablephenomenon, because with larger enough budget Dε, at-

Datasets CIFAR-10 MNISTAttackers BPDA-PGD EOT-PGD EOT BPDA-PGD EOT-PGD EOT

Mixup 1.02 1.02 4.07 29.31 29.31 11.21BPFC 33.01 33.01 32.60 91.22 91.22 85.60

RS 26.98 19.62 22.34 88.06 80.62 77.20CiiV (ours) 63.82 35.57 42.30 96.72 95.10 93.47

Table 5. Anti-Obfuscated-Gradient Attack: we tested EOT andBPDA [3] attackers for all AT-free defenders.

tackers poison the normal feature extraction, affecting thecausal effect X → Y . Successfully defeated by unboundedattacks shows that CiiV didn’t exploit any distribution bias.Adversarial Robustness Against Anti-Obfuscated-Gradient Attack. The biggest concern of any non-ATdefenders is no doubt the obfuscated gradient issue [3].Therefore, we utilized BPDA and EOT attackers to eval-uate all the investigated non-AT defenders in Table 5.BPDA-PGD followed the same PGD-20 settings that onlyinserted an identity mapping layer to skip all the noising orsampling layers. EOT-PGD was the same PGD-20 variantthat accumulated by 10 times gradients before updatingeach PGD iteration. EOT was directly adopted from theofficial implementation [4]. Based on the reported results,CiiV is effectively immune to obfuscated gradients whilethe randomized smoothing is significantly more vulnerableto these attacks.Adversarial Robustness Against Gradient-Free Attack.We investigated three gradient-free attackers: 1) GN(Gaussian Noise), 2) BFS (Brute-Force Search) [8], and3) SPSA [67]. GN added a gaussian noise withsigma=0.1/0.25 for CIFAR-10/MNIST, respectively, to in-put images. BFS ran 100 times of GN and reported themost vulnerable adversarial examples. Note that both GNand BFS were constrained by the l∞ budget, 8/255 and32/255 for CIFAR-10 and MNIST. As to the SPSA, whichconducted numerical approximation of gradients, the hyper-parameters were set as follows: δ = 0.1, step=20, lr=0.1,batch size = 16. All the gradient-free attackers were signif-icantly weaker than the gradient-based attackers as we ex-pected [8]. Among all the defenders, the proposed CiiV andRS had the closest GN vs. BFS performances, which provedthat they significantly narrowed the confounding features,making them hard to be found by brute-force algorithms.Adversarial Robustness Against Adaptive Attack. Since

7

Page 8: Adversarial Visual Robustness by Causal Intervention

30

35

40

45

50

55

60

65

70

75

80

1 3 5 7 9 11 13 15

Accura

cy

The Number of Tr

CiiV (FGSM Test)

CiiV (PGD-20 Test)

CiiV (PGD-20(l2) Test)

CiiV (C&W Test)

(a) Effects of Different 𝑇𝑟 During Training

30

35

40

45

50

55

60

65

70

75

80

1 3 5 7 9 11 13 15

Accura

cy

The Number of Tr

CiiV (FGSM Test)

CiiV (PGD-20 Test)

CiiV (PGD-20(l2) Test)

CiiV (C&W Test)

(a) Effects of Different 𝑇𝑟 During Inference

Figure 7. (a) The performances of CiiV trained on different Tr . (b)The performances of CiiV tested on different Tr .

Datasets CIFAR-10 MNISTAttackers GN BFS SPSA GN BFS SPSABaseline 86.28 75.77 70.46 99.49 98.17 99.59

FGSM-AT 73.66 70.54 71.73 98.97 91.72 99.06PGD-20-AT 71.31 68.82 70.52 99.51 99.15 99.47

PGD-20-AT(l2) 80.11 77.02 77.78 99.61 99.22 99.48Mixup 86.13 75.21 66.52 98.71 94.78 99.04BPFC 67.26 65.23 65.98 99.05 98.30 99.00

RS 78.42 78.08 77.95 99.59 99.41 99.59CiiV (ours) 74.89 74.24 74.94 99.45 99.35 99.40

Table 6. Gradient-Free Attack: the performances under GN(Gaussian Noise), BFS [8], SPSA [67] attackers.

Attackers Clean FGSM PGD-20 PGD-20(l2) C&W OverallReti-Sampling 72.53 44.81 30.87 71.55 72.18 58.39

CiiV-w/o Bipolar 65.39 49.77 41.91 64.90 65.16 57.43CiiV-N1 65.85 51.42 42.53 65.25 65.64 58.14CiiV-L2 73.46 50.05 38.63 72.66 73.53 61.67

CiiV (default) 75.00 50.73 38.03 74.54 75.23 62.71

Table 7. Ablative Studies: different modules of the proposed CiiVwere evaluated on CIFAR-10, where w/o Bipolar, N1 and L2 de-note the CiiV without ON-/OFF-bipolar duplication, using N = 1during sampling, using L2 version of AVC loss, respectively.

CiiV contains proposed novel modules and losses, otherthan the above studies, we further discussed two adap-tive strategies for better evaluating the adversarial robust-ness achieved by CiiV. 1) Due to the stochasticity of theretinotopic sampling layer, an identical mapping functioncould skipped this layer to generated deterministic adversar-ial samples. Therefore, we introduced BPDA-PGD in anti-obfuscated-gradient attack to replace the sampling layerwith an identical mapping during back-propagation. Asshown in Table 5, the proposed CiiV even achieved bet-ter performances than the original PGD-20, proving thatretinotopic sampling didn’t cause obfuscated gradients. 2)As to the AVC loss, we further applied it in the PGD-20attacker, dubbed Adaptive-PGD, which generated adversar-ial example by the joint combination of cross-entropy lossand AVC loss. We achieved 36.56 and 98.29 for CIFAR-10and MNIST datasets, respectively, where the original PGD-20 performances were 38.03 and 95.80 respectively. Theproposed adaptive attack only had limited influences on theCIFAR-10 performance and even improved on MNIST dueto its insufficient confounding features, proving the reliabil-ity of the proposed CiiV framework.Ablation Studies. In this section, we evaluated the effec-tiveness of different modules in the proposed CiiV. As re-ported in Table 7, 1) Reti-Sampling: simply using retino-

Bas

eli

ne

Pe

rtu

rba

tio

nC

lea

n Im

ag

eC

iiV

Pe

rtu

rba

tio

n

(a) Generated Perturbations on mini-ImageNet Dataset

Bas

eli

ne

Ad

v E

xa

mp

le

Tra

inin

g

Ima

ge

CiiV

Ad

v E

xa

mp

le

(b) Adversarial Examples for a Confounded Toy Dataset

Figure 8. (a) Generated perturbations of baseline model and CiiV.(b) Adversarial examples that deceive baseline model and CiiV.

topic sampling without AVC loss only achieves limitedrobustness against various attacks; 2) CiiV-w/o Bipolar:removing bipolar augmentation during sampling obtainshigher PGD-20 performance but became less robust onother settings; 3) CiiV-N1: setting N = 1 in the sam-pling increased the l∞ robustness but caused other settingsto be unstable; 4) CiiV-L2: `2 version of AVC loss wasslightly worse than the `1 version due to the sparsity of fea-ture space. We also investigated different Tr during trainingstage and inference stage as shown in Figure 7. Larger train-ing Tr leads to better convergence of AVC loss due to itsasymptotic nature, while the inference performances werenot sensitive to Tr. Overall, the default settings of CiiVwere N = 3, Tr = 15/20/10 and ω = 0.8/0.9/0.8 forCIFAR-10/MNIST/mini-ImageNet, respectively. More de-tailed studies will be given in Appendix.

5.4. Visualization

Perturbations for mini-ImageNet. We visualized the gen-erated PGD-20 perturbations for baseline model and CiiVin Figure 8 (a). It shows that the proposed CiiV success-fully removes the imperceptible confounders, as the gener-ated perturbations are trying to either remove or modify thesemantically-meaningful patterns, i.e., causal features.Adversarial Examples for A Confounded Toy Dataset.To illustrate the proposed confounding theory of adversar-ial examples, we also constructed a Confounded Toy (CToy)dataset that made by geometric shapes and color blocks,where the deterministic shapes were causal features andcolor blocks sampled from a biased distribution were con-founders. The details of CToy dataset will be introduced inAppendix. As shown in Figure 8 (b), the PGD-20 adversar-ial examples of the baseline aim at changing the confound-

8

Page 9: Adversarial Visual Robustness by Causal Intervention

ing colors; while those of CiiV change the causal shapes.

6. Conclusion

In this paper, we presented a CiiV framework that servesas a proactive defender without adversarial training. CiiVconsists of a differentiable retinotopic sampling layer andan AVC loss, performing causal intervention using instru-mental variables on DNN models. Following the checklistfrom Carlini et al. [8], we demonstrated the proposed CiiVto be a valid defender against various adaptive attacks. CiiVprovides a fundamental causal viewpoint of adversarial ex-amples, which has a great potential in guiding the design offuture defenders.

A. Details of CToy Dataset

In Section 5.4 of original paper, we introduced a Con-founded Toy (CToy) dataset to demonstrate the equivalencebetween the confounding effect and the adversarial pertur-bation. The proposed CToy is a three-way classification,containing triangles, squares and circles. It has 10k/1k/1kimages for train/val/test split, respectively. All samples are64x64 colour images. Except for the causal shapes, thereare also confounding patterns, i.e., red/blue/green blocks,with the size of 4x4 pixels. Different from the determin-istic shape, the color of each block is sampled from a bi-ased distribution. For the triangle, square and circle images,each co-occurred block has 80%/10%/10%, 10%/80%/10%and 10%/10%/80% probability to be blue/green/red, respec-tively. Therefore, if the confounding distribution stays thesame on both training and testing phases, these patterns areindeed “predictive” features. Yet, learning these confound-ing patterns would significantly reduce the generalizationability of the model, because there will always be samplesthat are dominated by rare color blocks. The causal graphsof data generation procedure and more examples of CToydataset are illustrated in Figure 9(a)&(b).

Based on the specifically designed CToy that only con-tains two patterns, causal shapes and confounding colors,we are able to understand which pattern causes the adver-sarial vulnerability. As we can see from Figure 9(c), thePGD-20 (ε = 128/255) adversarial examples generatedfrom baseline DNN models were mainly erasing the orig-inal color blocks for squares and circles, i.e., the adversarialperturbation is indeed trying to maximize the tampered con-founding effect. As to the adversarial examples of trianglesfrom baseline DNN models, although they contain percep-tible stains, blue blocks were all replaced by green back-grounds, which meant the confounding effect still domi-nated the adversarial perturbations. On the other hand, theadversarial examples of the proposed CiiV don’t changethe overall colors, they directly try to modify the shapes.It proves that CiiV successfully prevents the model from

Class 1

Sq

ua

res

Tri

an

gle

sC

irc

les

(b) Examples of CToy Dataset

Class 2 Class 3

(a) The Causal Graphs used to Generate CToy

(c) More Adversarial Examples on CToy

CiiV

Ba

se

lin

e

Figure 9. (a) The causal graphs that generate CToy dataset. (b)More examples from the proposed CToy. (c) More adversarial ex-amples of the baseline DNN model and CiiV model on CToy.

learning confounding effects, and thus attacker can onlyhack causal features.

With the help of CToy datasets, we are not only able toverify the proposed confounding theories for adversarial ex-amples of conventional DNN models but also visualize thethe working mechanisms of the proposed CiiV framework,i.e., forcing the model to learn from causal patterns ratherthan the confounding colors.

B. Details of Retinotopic SamplingThe proposed retinotopic sampling layer has a central-

ized sampling distribution r, whose sampling frequency de-creases from a random center to the peripheries. Sincehuman visions benefit from continuous eye movementover different observation centers [35], we use diversesampling centers to imitate the eye movement. With-out losing the generalizability, sampling center (x, y) israndomly selected from 9 candidates in the proposedCiiV: (w/6, h/6), (w/6, h/2), (w/6, 5h/6), (w/2, h/6),(w/2, h/2), (w/2, 5h/6), (5w/6, h/6), (5w/6, h/2) and(5w/6, 5h/6), where w and h are the width and height ofimages. These 9 candidate centers are illustrated in Fig-ure 10. Therefore, given the sampling center (x, y), aretinotopic sampling distribution r(x, y) is defined as fol-lows:

rij(x, y) = g(‖(i, j)− (x, y)‖2) + ε > τ, (9)

9

Page 10: Adversarial Visual Robustness by Causal Intervention

Attackers Clean FGSM PGD-20 PGD-20(l2) C&W Overallg1(·) 74.77 50.11 37.96 73.88 74.65 62.27g2(·) 75.22 50.41 37.20 74.50 75.18 62.50

default 75.00 50.73 38.03 74.54 75.23 62.71

Table 8. The performances on CIFAR-10 using different non-linear functions g(·) of retinotopic sampling, which proves that theproposed CiiV is not sensitive to the implementations of retino-topic sampling distribution r as long as the sampling frequencysmoothly decreases from the center to the peripheries.

𝒓 𝒙, 𝒚 = 𝒓(𝒘/𝟔, 𝒉/𝟔) 𝒓 𝒙, 𝒚 = 𝒓(𝒘/𝟔, 𝒉/𝟐) 𝒓 𝒙, 𝒚 = 𝒓(𝒘/𝟔, 𝟓𝒉/𝟔)

𝒓 𝒙, 𝒚 = 𝒓(𝒘/𝟐, 𝒉/𝟔) 𝒓 𝒙, 𝒚 = 𝒓(𝒘/𝟐, 𝒉/𝟐) 𝒓 𝒙, 𝒚 = 𝒓(𝒘/𝟐, 𝟓𝒉/𝟔)

𝒓 𝒙, 𝒚 = 𝒓(𝟓𝒘/𝟔, 𝒉/𝟔) 𝒓 𝒙, 𝒚 = 𝒓(𝟓𝒘/𝟔, 𝒉/𝟐) 𝒓 𝒙, 𝒚 = 𝒓(𝟓𝒘/𝟔, 𝟓𝒉/𝟔)

Figure 10. Nine candidate centers for retinotopic sampling distri-bution r(x, y).

where i ∈ [0, w], j ∈ [0, h] are the indexes of image pixels,g(·) is a non-linear mapping that can be implemented byvarious smoothing functions, ε is uniformly sampled from[0, 1], τ = 0.9 is the sampling threshold.

Note that non-linear function g(·) of r(x, y) can takevarious implementations, which won’t affect the perfor-mances of the proposed CiiV too much as long as thesampling frequency decreases from the center (x, y) to theperipheries as shown in Figure 10. We intuitively adopta normalized mapping g(x) = h((max(x) − x + α)γ),α = 10.0, γ = 0.3, h(x) = x/max(x) as our default set-ting, and we further tested two simpler non-linear functionsg1(x) = 1.0 − x/100 and g2(x) = 2.5/(0.5 × x0.5) inTable 8. According to the experimental results, various ver-sions of g(·) all perform very well on CIFAR-10 dataset,which proves that the CiiV provides a general frameworkthat is not sensitive to the detailed implementations. Themain reason for us to choose a more complex implemen-tation of g(x) is that it can dynamically fit the image size.Other two simpler functions g1(x) and g2(x) have to changeparameters for different size of images, which is less con-venient than our default function g(x).

Attackers Clean FGSM PGD-20 PGD-20(l2) C&W OverallN=1 65.58 50.72 41.54 65.21 65.50 57.71N=2 71.74 50.84 39.79 71.14 72.15 61.13N=4 76.01 49.19 35.32 74.99 76.22 62.35N=5 76.59 47.77 32.76 75.73 76.66 61.90

ω = 1.2 68.59 48.93 39.83 68.08 68.48 58.78ω = 1.1 70.22 48.93 38.76 69.52 70.25 59.54ω = 1.0 72.9 49.28 38.02 71.72 72.76 60.94ω = 0.8 76.21 51.44 37.25 75.35 76.35 63.32ω = 0.7 77.12 49.49 35.50 75.95 76.96 63.00β = 0.01 76.98 49.92 35.06 75.57 76.53 62.81β = 1.0 68.32 48.48 40.33 67.7 68.12 58.59

(T )Tr = 1 72.53 44.81 30.87 71.55 72.18 58.39(T )Tr = 2 73.46 47.89 36.30 72.35 73.30 60.66(T )Tr = 5 74.46 49.37 36.73 73.78 74.21 61.71(T )Tr = 10 74.94 49.85 37.76 74.02 74.74 62.26(E)Tr = 1 73.12 49.46 37.78 72.00 72.68 61.01(E)Tr = 2 74.32 50.11 38.01 72.99 74.10 61.91(E)Tr = 5 74.67 50.00 37.87 73.96 74.71 62.24(E)Tr = 10 75.36 50.47 38.00 74.55 75.26 62.73

default 75.00 50.73 38.03 74.54 75.23 62.71

Table 9. The ablation studies on the hyper-parameters of the pro-posed CiiV, where the default setting uses N = 3, ω = 0.9, β =0.1, (T )Tr = 15, (E)Tr = 15. (T) and (E) represent the trainingand evaluation stages, respectively.

C. Additional Ablation Studies

In this section, we demonstrate additional ablation stud-ies on hyper-parameters N,ω, β and Tr of the proposedCiiV.

As shown in Table 9, for the number of sampling timesN that mimics the over-completeness of neural represen-tations [43], the larger N will increase the clean accu-racy while reduces the l∞ adversarial robustness, becausethe proposed CiiV relies on the instrumental sampling tobreak the non-robust confounder patterns into random sub-patterns, and the larger N may approximate to the originalimage, i.e., lead to learnable sub-patterns. As to the sam-pling parameter ω that represents the best exposure timeof images, its value may vary from dataset to dataset. InCIFAR-10, our default setting is ω = 0.9. The larger andsmaller ω indicate the over-exposure and under-exposure asshown in Figure 5(b) of original paper, which will causepart of images being erased by the white or dark blocks.As to the initial value of trade-off parameter β, it balancesthe cross-entropy classification loss and AVC loss. Givena larger β can decrease the classification performance, yet,smaller β will also hurt the convergence of de-confoundedcausal effects. Therefore, the trade-off parameter β is setto 0.1 based on experiments shown in Table 9. At last,the number of independent retinotopic sampling Tr is setto 15 for both training (T) and evaluation (E) stages. Aswe discussed in the original paper, AVC loss requires largetraining Tr to converge based on its asymptotic nature. Theperformances keep increasing with the training Tr. How-ever, thanks to the convergence of AVC loss, evaluation Trcan take smaller value, e.g., (E)Tr = 10 even slightly bet-ter than (E)Tr = 15, which makes the evaluation of the

10

Page 11: Adversarial Visual Robustness by Causal Intervention

Bas

eli

ne

Ad

v E

xa

mp

leC

lea

n Im

ag

eB

as

eli

ne

Pe

rtu

rba

tio

n

CiiV

Ad

v E

xa

mp

le

CiiV

Pe

rtu

rba

tio

n

Figure 11. PGD-20 (ε = 32/255) adversarial examples and per-turbations of the proposed CiiV and baseline model on MNIST.

Bas

eli

ne

Ad

v E

xa

mp

leC

lea

n Im

ag

eB

as

eli

ne

Pe

rtu

rba

tio

n

CiiV

Ad

v E

xa

mp

le

CiiV

Pe

rtu

rba

tio

n

Figure 12. PGD-20 (ε = 8/255) adversarial examples and pertur-bations of the proposed CiiV and baseline model on CIFAR-10.

proposed CiiV more efficient.

D. Computation OverheadDue to the AVC loss, the computation overhead of the

proposed CiiV is more expensive than conventional train-ing. Take CIFAR-10 as an example, it takes 264.37s and17.72s to train each epoch on CiiV and baseline model,respectively, using one GTX 2080TI GPU and 128 batchsize. However, compared with the corresponding adversar-ial training, the proposed CiiV is still much faster, becausethe proposed CiiV only requires a single backward propaga-tion, e.g., it takes 296.35s to train one epoch on PGD-20-AT.Besides, if we adopt parallel implementation of multiple Tr,the training speed of CiiV could be even as faster as conven-tional training, while the PGD-20-AT has to run each iter-ation sequentially. In the evaluation stage, since small Tris sufficient, the computation overhead of inference stageis much cheaper, e.g., it takes 5e-3s and 2e-3s to evaluateeach batch (128 images) for (E)Tr = 1 CiiV and baselinemodel.

Bas

eli

ne

Ad

v E

xa

mp

leC

lea

n Im

ag

eB

as

eli

ne

Pe

rtu

rba

tio

n

CiiV

Ad

v E

xa

mp

le

CiiV

Pe

rtu

rba

tio

n

Figure 13. PGD-20 (ε = 8/255) adversarial examples and per-turbations of the proposed CiiV and baseline model on mini-ImageNet.

E. Additional Visualized Results

This section provides more PGD-20 adversarial exam-ples and perturbations on MNIST, CIFAR-10 and mini-ImageNet datasets for the baseline model and the proposedCiiV. As shown in Figure 11, for MNIST with clean back-grounds, since the co-occurred confounders are limited,both baseline model and CiiV can only exploit random pat-terns to override the original numbers in adversarial exam-ples. For the CIFAR-10 and mini-ImageNet datasets, as wecan see from Figure 12&13, baseline models try to changethe texture of images, i.e., use tampered confounding ef-fects to attack the model. Meanwhile, the adversarial per-turbations for CiiV contain significantly more structure pat-terns, i.e., causal patterns, which means the proposed CiiVsuccessfully prevents the model from learning confoundingeffects and thus the attackers can only modify the causalfeatures. Besides, it’s well known that adversarial perturba-tions can be used to partially reveal the interpretability ofDNN models [17, 66]. Therefore, the proposed CiiV alsoincreases the interpretability of the DNN models.

References

[1] Sravanti Addepalli, Arya Baburaj, Gaurang Sriramanan, andR Venkatesh Babu. Towards achieving adversarial robustnessby enforcing feature consistency across bit planes. In CVPR,2020. 6, 7

[2] Michael J Arcaro, Stephanie A McMains, Benjamin DSinger, and Sabine Kastner. Retinotopic organization of hu-man ventral visual cortex. Journal of neuroscience, 2009. 2,5

[3] Anish Athalye, Nicholas Carlini, and David Wagner. Obfus-cated gradients give a false sense of security: Circumventingdefenses to adversarial examples. In International Confer-ence on Machine Learning, 2018. 1, 4, 5, 7

11

Page 12: Adversarial Visual Robustness by Causal Intervention

[4] Anish Athalye, Logan Engstrom, Andrew Ilyas, and KevinKwok. Synthesizing robust adversarial examples. In ICML.PMLR, 2018. 7

[5] Alexander Balke and Judea Pearl. Bounds on treatment ef-fects from studies with imperfect compliance. Journal of theAmerican Statistical Association, 1997. 5

[6] Wieland Brendel, Jonas Rauber, and Matthias Bethge.Decision-based adversarial attacks: Reliable attacks againstblack-box machine learning models. ICLR, 2018. 3

[7] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfel-low. Thermometer encoding: One hot way to resist adversar-ial examples. In ICLR, 2018. 1, 3, 4

[8] Nicholas Carlini, Anish Athalye, Nicolas Papernot, WielandBrendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow,Aleksander Madry, and Alexey Kurakin. On evaluating ad-versarial robustness. arXiv, 2019. 2, 6, 7, 8, 9

[9] Nicholas Carlini and David Wagner. Towards evaluating therobustness of neural networks. In Symposium on Securityand Privacy (SP). IEEE, 2017. 3, 6

[10] Nicholas Carlini and David Wagner. Audio adversarial ex-amples: Targeted attacks on speech-to-text. In Security andPrivacy Workshops (SPW), 2018. 3

[11] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, andCho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training sub-stitute models. In Workshop on artificial intelligence andsecurity, 2017. 3

[12] Moustapha Cisse, Yossi Adi, Natalia Neverova, and JosephKeshet. Houdini: Fooling deep structured prediction models.arXiv, 2017. 3

[13] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certifiedadversarial robustness via randomized smoothing. In ICML,2019. 3, 4, 6, 7

[14] Alexander D’Amour. On multi-cause causal inference withunobserved confounding: Counterexamples, impossibility,and alternatives. In AISTATS, 2019. 2, 3

[15] Yunfeng Diao, Tianjia Shao, Yong-Liang Yang, Kun Zhou,and He Wang. Basar:black-box attack on skeletal actionrecognition. In CVPR, 2021. 3

[16] Xinshuai Dong, Hong Liu, Rongrong Ji, Liujuan Cao, Qixi-ang Ye, Jianzhuang Liu, and Qi Tian. Api-net: Robust gen-erative classifier via a single discriminator. In ECCV, 2020.3, 4

[17] Christian Etmann, Sebastian Lunz, Peter Maass, and CarolaSchonlieb. On the connection between adversarial robust-ness and saliency map interpretability. In ICML, 2019. 11

[18] Samuel G Finlayson, John D Bowers, Joichi Ito, Jonathan LZittrain, Andrew L Beam, and Isaac S Kohane. Adversarialattacks on medical machine learning. Science, 2019. 3

[19] Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoen-holz, Maithra Raghu, Martin Wattenberg, and Ian Goodfel-low. Adversarial spheres. Workshop of ICLR, 2018. 1

[20] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and harnessing adversarial examples. In ICML,2015. 1, 3, 6

[21] Chuan Guo, Mayank Rana, Moustapha Cisse, and LaurensVan Der Maaten. Countering adversarial images using inputtransformations. In ICLR, 2018. 1, 4

[22] Zijian Guo and Dylan S Small. Control function instrumentalvariable estimation of nonlinear causal effect models. TheJournal of Machine Learning Research, 2016. 3, 5

[23] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and MattTaddy. Deep iv: A flexible approach for counterfactual pre-diction. In ICML, 2017. 5

[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In CVPR,2016. 6

[25] Suzana Herculano-Houzel. The remarkable, yet not extraor-dinary, human brain as a scaled-up primate brain and its as-sociated cost. Proceedings of the National Academy of Sci-ences, 2012. 2

[26] Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua,and Hanwang Zhang. Distilling causal effect of data in class-incremental learning. In CVPR, 2021. 3

[27] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan,and Pieter Abbeel. Adversarial attacks on neural networkpolicies. arXiv, 2017. 3

[28] Andrew Ilyas, Shibani Santurkar, Logan Engstrom, BrandonTran, and Aleksander Madry. Adversarial examples are notbugs, they are features. NeurIPS, 2019. 1, 2

[29] Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Ad-versarial logit pairing. arxiv, 2018. 1

[30] Niki Kilbertus, Giambattista Parascandolo, and BernhardScholkopf. Generalization in anti-causal learning. arXivpreprint arXiv:1812.00524, 2018. 3

[31] Edward Kim, Jocelyn Rego, Yijing Watkins, and Garrett TKenyon. Modeling biological immunity to adversarial ex-amples. In CVPR, 2020. 2, 3

[32] Hoki Kim. Torchattacks: A pytorch repository for adversar-ial attacks. arxiv, 2020. 6

[33] Helga Kolb, Eduardo Fernandez, and Ralph Nelson. Webvi-sion: the organization of the retina and visual system. book,1995. 5

[34] Alexey Kurakin, Ian Goodfellow, Samy Bengio, et al. Ad-versarial examples in the physical world, 2016. 3

[35] Michael Land. Eye movements in man and other animals.Vision research, 2019. 5, 9

[36] Guanlin Li, Shuya Ding, Jun Luo, and Chang Liu. Enhancingintrinsic adversarial robustness via feature pyramid decoder.In CVPR, 2020. 4

[37] Yingzhen Li, John Bradshaw, and Yash Sharma. Are genera-tive classifiers more robust to adversarial attacks? In ICML,2019. 3, 4

[38] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt,Dimitris Tsipras, and Adrian Vladu. Towards deep learningmodels resistant to adversarial attacks, 2018. 3, 4, 6

[39] Richard H Masland. The fundamental plan of the retina. Na-ture Neuroscience, 2001. 2, 5

[40] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, andPascal Frossard. Deepfool: a simple and accurate methodto fool deep neural networks. In CVPR, 2016. 3

[41] Whitney K Newey and James L Powell. Instrumental vari-able estimation of nonparametric models. Econometrica,2003. 3

12

Page 13: Adversarial Visual Robustness by Causal Intervention

[42] Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. Counterfactual vqa: A cause-effect look at language bias. CVPR, 2021. 3

[43] Bruno A Olshausen and David J Field. How close are we tounderstanding v1? Neural computation, 2005. 5, 10

[44] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow,Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practi-cal black-box attacks against machine learning. In Asia con-ference on computer and communications security, 2017. 1

[45] Nicolas Papernot, Patrick McDaniel, Somesh Jha, MattFredrikson, Z Berkay Celik, and Ananthram Swami. Thelimitations of deep learning in adversarial settings. In Eu-roS&P, 2016. 3

[46] Judea Pearl. Causality. Cambridge university press, 2009.2, 3

[47] Judea Pearl, Madelyn Glymour, and Nicholas P Jewell.Causal inference in statistics: A primer. John Wiley & Sons,2016. 2, 3, 4

[48] Gege Qi, Lijun Gong, Yibing Song, Kai Ma, and YefengZheng. Stabilized medical image attacks. In ICLR, 2021.3

[49] Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang.Two causal principles for improving visual dialog. In CVPR,2020. 3

[50] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Cer-tified defenses against adversarial examples. ICLR, 2018. 4

[51] Manish Vuyyuru Reddy, Andrzej Banburski, Nishka Pant,and Tomaso Poggio. Biologically inspired mechanisms foradversarial robustness. NeurIPS, 2020. 3, 5

[52] Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. Adver-sarial attacks and defenses in deep learning. Engineering,2020. 3

[53] Hadi Salman, Andrew Ilyas, Logan Engstrom, AshishKapoor, and Aleksander Madry. Do adversarially ro-bust imagenet models transfer better? arXiv preprintarXiv:2007.08489, 2020. 2, 4

[54] Pouya Samangouei, Maya Kabkab, and Rama Chellappa.Defense-gan: Protecting classifiers against adversarial at-tacks using generative models. ICLR, 2018. 4

[55] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, KunalTalwar, and Aleksander Madry. Adversarially robust gener-alization requires more data. NeurIPS, 2018. 1, 4

[56] Bernhard Scholkopf, Francesco Locatello, Stefan Bauer,Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, andYoshua Bengio. Toward causal representation learning. Pro-ceedings of the IEEE, 2021. 3

[57] Lukas Schott, Jonas Rauber, Matthias Bethge, and WielandBrendel. Towards the first adversarially robust neural net-work model on mnist. In ICLR, 2019. 1, 4

[58] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. ICLR,2015. 6

[59] Aman Sinha, Hongseok Namkoong, Riccardo Volpi, andJohn Duchi. Certifying some distributional robustness withprincipled adversarial training. ICLR, 2018. 4

[60] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Er-mon, and Nate Kushman. Pixeldefend: Leveraging genera-

tive models to understand and defend against adversarial ex-amples. ICLR, 2018. 4

[61] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, JoanBruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. In-triguing properties of neural networks. ICLR, 2013. 1, 3,4

[62] Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, andHanwang Zhang. Unbiased scene graph generation from bi-ased training. In CVPR, 2020. 3

[63] Florian Tramer, Jens Behrmann, Nicholas Carlini, NicolasPapernot, and Jorn-Henrik Jacobsen. Fundamental tradeoffsbetween invariance and sensitivity to adversarial perturba-tions. In ICML, 2020. 3

[64] Florian Tramer, Nicholas Carlini, Wieland Brendel, andAleksander Madry. On adaptive attacks to adversarial ex-ample defenses. arxiv, 2020. 1

[65] Florian Tramer, Alexey Kurakin, Nicolas Papernot, IanGoodfellow, Dan Boneh, and Patrick McDaniel. Ensembleadversarial training: Attacks and defenses. ICLR, 2018. 1

[66] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom,Alexander Turner, and Aleksander Madry. Robustness maybe at odds with accuracy. In ICLR, 2018. 11

[67] Jonathan Uesato, Brendan O’donoghue, Pushmeet Kohli,and Aaron Oord. Adversarial risk and the dangers of evalu-ating against weak attacks. In ICML, 2018. 7, 8

[68] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, DaanWierstra, et al. Matching networks for one shot learning.NeurIPS, 2016. 5

[69] He Wang, Feixiang He, Zhexi Peng, Yong-Liang Yang, Tian-jia Shao, Kun Zhou, and David Hogg. Understanding therobustness of skeleton-based action recognition under adver-sarial attack. In CVPR, 2021. 3

[70] Eric Wong and Zico Kolter. Provable defenses against adver-sarial examples via the convex outer adversarial polytope. InICML, 2018. 4

[71] Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better thanfree: Revisiting adversarial training. In ICLR, 2019. 3, 4

[72] Chong Xiang, Charles R Qi, and Bo Li. Generating 3d ad-versarial point clouds. In CVPR, 2019. 3

[73] Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang,Alan L Yuille, and Quoc V Le. Adversarial examples im-prove image recognition. In CVPR, 2020. 1

[74] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, andAlan Yuille. Mitigating adversarial effects through random-ization. ICLR, 2018. 4

[75] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou,Lingxi Xie, and Alan Yuille. Adversarial examples for se-mantic segmentation and object detection. In ICCV, 2017.3

[76] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan LYuille, and Kaiming He. Feature denoising for improvingadversarial robustness. In CVPR, 2019. 1, 3, 4

[77] Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-ShengHua. Interventional few-shot learning. NeurIPS, 2020. 3

[78] Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-ShengHua, and Qianru Sun. Causal intervention for weakly-supervised semantic segmentation. NeurIPS, 2020. 3

13

Page 14: Adversarial Visual Robustness by Causal Intervention

[79] Huan Zhang, Hongge Chen, Zhao Song, Duane Boning, In-derjit S Dhillon, and Cho-Jui Hsieh. The limitations of ad-versarial training and the blind-spot attack. In ICLR, 2019.1

[80] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, andDavid Lopez-Paz. mixup: Beyond empirical risk minimiza-tion. ICLR, 2018. 3, 4, 6, 7

[81] Tianhang Zheng, Changyou Chen, and Kui Ren. Distribu-tionally adversarial attack. In AAAI, 2019. 3

14