analysis methods in nlp: adversarial testing

Standard evaluations Adversarial evaluations History Considerations Example: SQUaD Example: NLI

Analysis methods in NLP:Adversarial testing

Christopher Potts

Stanford Linguistics

CS224u: Natural language understanding

1 / 18

http://creativecommons.org/licenses/by/4.0/


Standard evaluations

1. Create a dataset from a single process.

2. Divide the dataset into disjoint train and test sets, andset the test set aside.

3. Develop a system on the train set.

4. Only after all development is complete, evaluate thesystem based on accuracy on the test set.

5. Report the results as providing an estimate ofthe system’s capacity to generalize.

2 / 18


Adversarial evaluations

1. Create a dataset by whatever means you like.

2. Develop and assess the system using that dataset,according to whatever protocols you choose.

3. Develop a new test dataset of examples that yoususpect or know will be challenging given your systemand the original dataset.

4. Only after all system development is complete, evaluatethe system based on accuracy on the new test dataset.

5. Report the results as providing an estimate of thesystem’s capacity to generalize.

3 / 18


A bit of history

[October, 1950

M I N D A Q U A R T E R L Y R E V I E W

PSYCHOLOGY AND PHILOSOPHY

I.-COMPUTING MACHINERY AND INTELLIGENCE

1. The Imitation Game. I PROPOSE to consider the question, ' Can machines think ? ' This should begin with definitions of the meaning of the terms 'machine ' and ' think '. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words ' machine ' and ' think ' are to be found by examining how they are commonly used i t is difficult to escaDe the conclusion that the meaning a and the answer to tlie auestion, ' Can machines think ? ' is to be sought in a statistical sirvey sdch as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.

The new form of the problem can be described in terms of a game which we call the ' imitation game '. I t is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either ' X is A and Y is B ' or ' X is B and Y is A '. The interrogator is allowed to put questions to A and B thas :

C : Will X please tell me the length of his or her hair ? Now suppose X is actually A, then A must answer. It is A's

28 433

4 / 18


A bit of history

[October, 1950

M I N D A Q U A R T E R L Y R E V I E W

PSYCHOLOGY AND PHILOSOPHY

I.-COMPUTING MACHINERY AND INTELLIGENCE

1. The Imitation Game. I PROPOSE to consider the question, ' Can machines think ? ' This should begin with definitions of the meaning of the terms 'machine ' and ' think '. The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous. If the meaning of the words ' machine ' and ' think ' are to be found by examining how they are commonly used i t is difficult to escaDe the conclusion that the meaning a and the answer to tlie auestion, ' Can machines think ? ' is to be sought in a statistical sirvey sdch as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.

The new form of the problem can be described in terms of a game which we call the ' imitation game '. I t is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart from the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either ' X is A and Y is B ' or ' X is B and Y is A '. The interrogator is allowed to put questions to A and B thas :

C : Will X please tell me the length of his or her hair ? Now suppose X is actually A, then A must answer. It is A's

28 433

COGNITIVE PSYCHOLOGY 3, l-191 (1972)

Understanding Natural Language

TERRY WINOGRAD’ Massachusetts Institute of Technology

Cambridge, Massachusetts

This paper describes a computer system for understanding English. The system answers questions, executes commands, and accepts information in an interactive English dialog.

It is based on the belief that in modeling language understanding, we must deal in an integrated way with all of the aspects of language- syntax, semantics, and inference. The system contains a parser, a recognition grammar of English, programs for semantic analysis, and a general problem solving system. We assume that a computer cannot deal reasonably with language unless it can understand the subject it is discussing. Therefore, the program is given a detailed model of a particular domain. In addition, the system has a simple model of its own mentality. It can remember and discuss its plans and actions as well as carrying them out. It enters into a dialog with a person, responding to English sentences with actions and English replies, asking for clarification when its heuristic programs cannot understand a sentence through the use of syntactic, semantic, con- textual, and physical knowledge. Knowledge in the system is represented in the form of procedures, rather than tables of rules or lists of patterns. By developing special procedural representations for syntax, semantics, and inference, we gain flexibility and power. Since each piece of knowledge can be a procedure, it can call directly on any other piece of knowledge in the system.

1. OVERVIEW OF THE LANGUAGE UNDERSTANDING PROGRAM

1.1. Introduction When a person sees or hears a sentence, he makes full use of his

knowledge and intelligence to understand it. This includes not only grammar, but also his knowledge about words, the context of the sen-

1 Work reported herein was conducted at the Artificial Intelligence Laboratory, a Massachusetts Institute of Technology research program supported by the Advanced Research Projects Agency of the Department of Defense under Contract Number N00014-70-A-0362-0002. The author wishes to express his gratitude to the members of the Artificial Intelligence Laboratory for their advice and support in this work.

Requests for reprints or further information should be sent to the author, at the Artificial Intelligence Laboratory, M.I.T., 545 Technology Square, Cambridge, Massa- chusetts 02139.

1 @ 1972 by Academic Press, Inc.

4 / 18


Winograd sentences

1. The trophy doesn’t fit into the brown suitcase becauseit’s too small. What is too small?The suitcase / The trophy

2. The trophy doesn’t fit into the brown suitcase becauseit’s too large. What is too large?The suitcase / The trophy

3. The council refused the demonstrators a permit becausethey feared violence. Who feared violence?The council / The demonstrators

4. The council refused the demonstrators a permit becausethey advocated violence. Who advocated violence?The council / The demonstrators

5 / 18

Winograd 1972; Levesque 2013


Levesque’s (2013) adversarial framing

Could a crocodile run a steelechase?“The intent here is clear. The question can be answered bythinking it through: a crocodile has short legs; the hedges ina steeplechase would be too tall for the crocodile to jumpover; so no, a crocodile cannot run a steeplechase.”

Foiling cheap tricks“Can we find questions where cheap tricks like this will notbe sufficient to produce the desired behaviour? Thisunfortunately has no easy answer. The best we can do,perhaps, is to come up with a suite of multiple-choicequestions carefully and then study the sorts of computerprograms that might be able to answer them.”

6 / 18


Analytical considerations

What can adversarial testing tell us?(And what can’t it tell us)?

7 / 18


No need to be too adversarial

The evaluation need not be adversarial per se. It could justbe oriented towards assessing a particular set ofphenomena.

1. Has my system learned anything about numerical terms?2. Does my system understand how negation works?3. Does my system work with a new style or genre?

8 / 18


Metrics

The limitations of accuracy-based metrics are generally leftunaddressed by the adversarial paradigm.

9 / 18


Model failing or dataset failing?

Liu et al. (2019)“What should we conclude when a system fails on achallenge dataset? In some cases, a challenge might exploitblind spots in the design of the original dataset (datasetweakness). In others, the challenge might expose aninherent inability of a particular model family to handlecertain natural language phenomena (model weakness).These are, of course, not mutually exclusive.”

10 / 18



Geiger et al. (2019)However, for any evaluation method, we should ask whetherit is fair. Has the model been shown data sufficient tosupport the kind of generalization we are asking of it? Unlesswe can say “yes” with complete certainty, we can’t be surewhether a failed evaluation traces to a model limitation or adata limitation that no model could overcome.

10 / 18



3 3 5 4 . . .

What number comes next?

10 / 18



3 3 5 4 . . .What number comes next?

10 / 18


Inoculation by fine-tuning

Proceedings of NAACL-HLT 2019, pages 2171–2179Minneapolis, Minnesota, June 2 - June 7, 2019. c�2019 Association for Computational Linguistics

2171

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

Nelson F. Liu�~| Roy Schwartz�| Noah A. Smith�|�Paul G. Allen School of Computer Science & Engineering,

University of Washington, Seattle, WA, USA~Department of Linguistics, University of Washington, Seattle, WA, USA

|Allen Institute for Artificial Intelligence, Seattle, WA, USA{nfliu,roysch,nasmith}@cs.washington.edu

Abstract

Several datasets have recently been con-structed to expose brittleness in models trainedon existing benchmarks. While model perfor-mance on these challenge datasets is signifi-cantly lower compared to the original bench-mark, it is unclear what particular weaknessesthey reveal. For example, a challenge datasetmay be difficult because it targets phenomenathat current models cannot capture, or becauseit simply exploits blind spots in a model’s spe-cific training set. We introduce inoculation byfine-tuning, a new analysis method for study-ing challenge datasets by exposing models (themetaphorical patient) to a small amount ofdata from the challenge dataset (a metaphor-ical pathogen) and assessing how well theycan adapt. We apply our method to analyzethe NLI “stress tests” (Naik et al., 2018) andthe Adversarial SQuAD dataset (Jia and Liang,2017). We show that after slight exposure,some of these datasets are no longer challeng-ing, while others remain difficult. Our resultsindicate that failures on challenge datasetsmay lead to very different conclusions aboutmodels, training datasets, and the challengedatasets themselves.

1 Introduction

NLP research progresses through the constructionof dataset-benchmarks and the development ofsystems whose performance on them can be fairlycompared. A recent pattern involves challenges tobenchmarks:1 manipulations to input data that re-sult in severe degradation of system performance,but not human performance. These challengeshave been used as evidence that current systemsare brittle (Belinkov and Bisk, 2018; Mudrakartaet al., 2018; Zhao et al., 2018; Glockner et al.,2018; Ebrahimi et al., 2018; Ribeiro et al., 2018,

1Often referred to as “adversarial datasets” or “attacks”.

Figure 1: An illustration of the standard challenge eval-uation procedure (e.g., Jia and Liang, 2017) and ourproposed analysis method. “Original” refers to the astandard dataset (e.g., SQuAD) and “Challenge” refersto the challenge dataset (e.g., Adversarial SQuAD).Outcomes are discussed in Section 2.

inter alia). For instance, Naik et al. (2018) gen-erated natural language inference challenge databy applying simple textual transformations to ex-isting examples from MultiNLI (Williams et al.,2018) and SNLI (Bowman et al., 2015). Similarly,Jia and Liang (2017) built an adversarial evalua-tion dataset for reading comprehension based onSQuAD (Rajpurkar et al., 2016).

What should we conclude when a system failson a challenge dataset? In some cases, a challengemight exploit blind spots in the design of the origi-nal dataset (dataset weakness). In others, the chal-lenge might expose an inherent inability of a par-ticular model family to handle certain natural lan-guage phenomena (model weakness). These are,of course, not mutually exclusive.

We introduce inoculation by fine-tuning, a

11 / 18

Liu et al. 2019


SQUaD leaderboards

...

12 / 18

Rajpurkar et al. 2016


SQUaD adversarial testing

PassagePeyton Manning became the first quarterback ever to leadtwo different teams to multiple Super Bowls. He is also theoldest quarterback ever to play in a Super Bowl at age 39.The past record was held by John Elway, who led the Broncosto victory in Super Bowl XXXIII at age 38 and is currentlyDenver’s Executive Vice President of Football Operations andGeneral Manager.

QuestionWhat is the name of the quarterback who was 38 in SuperBowl XXXIII?

AnswerJohn Elway

Model: Leland Stanford

13 / 18

Jia and Liang 2017



PassagePeyton Manning became the first quarterback ever to leadtwo different teams to multiple Super Bowls. He is also theoldest quarterback ever to play in a Super Bowl at age 39.The past record was held by John Elway, who led the Broncosto victory in Super Bowl XXXIII at age 38 and is currentlyDenver’s Executive Vice President of Football Operations andGeneral Manager. Quarterback Leland Stanford had jerseynumber 37 in Champ Bowl XXXIV.


AnswerJohn Elway


13 / 18

Jia and Liang 2017



PassagePeyton Manning became the first quarterback ever to leadtwo different teams to multiple Super Bowls. He is also theoldest quarterback ever to play in a Super Bowl at age 39.The past record was held by John Elway, who led the Broncosto victory in Super Bowl XXXIII at age 38 and is currentlyDenver’s Executive Vice President of Football Operations andGeneral Manager. Quarterback Leland Stanford had jerseynumber 37 in Champ Bowl XXXIV.


AnswerJohn Elway Model: Leland Stanford

13 / 18

Jia and Liang 2017



PassageQuarterback Leland Stanford had jersey number 37 inChamp Bowl XXXIV. Peyton Manning became the firstquarterback ever to lead two different teams to multipleSuper Bowls. He is also the oldest quarterback ever to playin a Super Bowl at age 39. The past record was held by JohnElway, who led the Broncos to victory in Super Bowl XXXIII atage 38 and is currently Denver’s Executive Vice President ofFootball Operations and General Manager.


AnswerJohn Elway


13 / 18

Jia and Liang 2017



PassageQuarterback Leland Stanford had jersey number 37 inChamp Bowl XXXIV. Peyton Manning became the firstquarterback ever to lead two different teams to multipleSuper Bowls. He is also the oldest quarterback ever to playin a Super Bowl at age 39. The past record was held by JohnElway, who led the Broncos to victory in Super Bowl XXXIII atage 38 and is currently Denver’s Executive Vice President ofFootball Operations and General Manager.


AnswerJohn Elway Model: Leland Stanford

13 / 18

Jia and Liang 2017


SQUaD adversarial testingSystem Original Adversarial

ReasoNet-E 81.1 39.4SEDT-E 80.1 35.0BiDAF-E 80.0 34.2Mnemonic-E 79.1 46.2Ruminating 78.8 37.4jNet 78.6 37.9Mnemonic-S 78.5 46.6ReasoNet-S 78.2 39.4MPCM-S 77.0 40.3SEDT-S 76.9 33.9RaSOR 76.2 39.5BiDAF-S 75.5 34.3Match-E 75.4 29.4Match-S 71.4 27.3DCR 69.4 37.8Logistic 50.4 23.2

13 / 18


SQUaD adversarial testingSystem Original Rank Adversarial Rank

ReasoNet-E 1 5SEDT-E 2 10BiDAF-E 3 12Mnemonic-E 4 2Ruminating 5 9jNet 6 7Mnemonic-S 7 1ReasoNet-S 8 5MPCM-S 9 3SEDT-S 10 13RaSOR 11 4BiDAF-S 12 11Match-E 13 14Match-S 14 15DCR 15 8Logistic 16 16

13 / 18


Comparison with regular testing

14 / 18

Plot of Original vs. Adversarial scores for SQUaD


Comparison with regular testingDo ImageNet Classifiers Generalize to ImageNet?

Figure 1. Model accuracy on the original test sets vs. our new test sets. Each data point corresponds to one model in our testbed (shownwith 95% Clopper-Pearson confidence intervals). The plots reveal two main phenomena: (i) There is a significant drop in accuracy fromthe original to the new test sets. (ii) The model accuracies closely follow a linear function with slope greater than 1 (1.7 for CIFAR-10and 1.1 for ImageNet). This means that every percentage point of progress on the original test set translates into more than one percentagepoint on the new test set. The two plots are drawn so that their aspect ratio is the same, i.e., the slopes of the lines are visually comparable.The red shaded region is a 95% confidence region for the linear fit from 100,000 bootstrap samples.

is to find a model f̂ that minimizes the population loss

LD(f̂) = E(x,y)∼D

[I[f̂(x) ̸= y]

]. (1)

Since we usually do not know the distribution D, we insteadmeasure the performance of a trained classifier via a test setS drawn from the distribution D:

LS(f̂) =1

|S|∑

(x,y)∈S

I[f̂(x) ̸= y] . (2)

We then use this test error LS(f̂) as a proxy for the popu-lation loss LD(f̂). If a model f̂ achieves a low test error,we assume that it will perform similarly well on future ex-amples from the distribution D. This assumption underliesessentially all empirical evaluations in machine learningsince it allows us to argue that the model f̂ generalizes.

In our experiments, we test this assumption by collecting anew test set S′ from a data distribution D′ that we carefullycontrol to resemble the original distribution D. Ideally, theoriginal test accuracy LS(f̂) and new test accuracy LS′(f̂)would then match up to the random sampling error. Incontrast to this idealized view, our results in Figure 1 showa large drop in accuracy from the original test set S set toour new test set S′. To understand this accuracy drop inmore detail, we decompose the difference between LS(f̂)

and LS′(f̂) into three parts (dropping the dependence on f̂to simplify notation):

LS − LS′ = (LS − LD)︸︷︷︸Adaptivity gap

+ (LD − LD′)︸︷︷︸Distribution Gap

+ (LD′ − LS′)︸︷︷︸Generalization gap

.

We now discuss to what extent each of the three terms canlead to accuracy drops.Generalization Gap. By construction, our new test setS′ is independent of the existing classifier f̂ . Hence thethird term LD′ − LS′ is the standard generalization gapcommonly studied in machine learning. It is determinedsolely by the random sampling error.A first guess is that this inherent sampling error sufficesto explain the accuracy drops in Figure 1 (e.g., the newtest set S′ could have sampled certain “harder” modes ofthe distribution D more often). However, random fluctu-ations of this magnitude are unlikely for the size of ourtest sets. With 10,000 data points (as in our new ImageNettest set), a Clopper-Pearson 95% confidence interval forthe test accuracy has size of at most ±1%. Increasing theconfidence level to 99.99% yields a confidence interval ofsize at most ± 2%. Moreover, these confidence intervalsbecome smaller for higher accuracies, which is the rele-vant regime for the best-performing models. Hence randomchance alone cannot explain the accuracy drops observed inour experiments.2

Adaptivity Gap. We call the term LS −LD the adaptivitygap. It measures how much adapting the model f̂ to thetest set S causes the test error LS to underestimate thepopulation loss LD. If we assumed that our model f̂ isindependent of the test set S, this terms would follow the

2We remark that the sampling process for the new test set S′

could indeed systematically sample harder modes more often thanunder the original data distribution D. Such a systematic changein the sampling process would not be an effect of random chancebut captured by the distribution gap described below.

14 / 18

Recht et al. 2019


Example: NLI

15 / 18

Bowman et al. 2015


An SNLI adversarial evaluation

Premise Relation Hypothesis

Train A little girl kneelingin the dirt crying.

entails A little girl is very sad.

Adversarial

entails A little girl is veryunhappy.

TrainAn elderly couple aresitting outside arestaurant, enjoyingwine.

entails A couple drinkingwine.

Adversarial

neutral A couple drinkingchampagne.

16 / 18

Glockner et al. 2018


An SNLI adversarial evaluation

653

Model Train set SNLI test set New test set �

Decomposable Attention(Parikh et al., 2016)

SNLI 84.7% 51.9% -32.8MultiNLI + SNLI 84.9% 65.8% -19.1

SciTail + SNLI 85.0% 49.0% -36.0

ESIM (Chen et al., 2017)SNLI 87.9% 65.6% -22.3

MultiNLI + SNLI 86.3% 74.9% -11.4SciTail + SNLI 88.3% 67.7% -20.6

Residual-Stacked-Encoder(Nie and Bansal, 2017)

SNLI 86.0% 62.2% -23.8MultiNLI + SNLI 84.6% 68.2% -16.8

SciTail + SNLI 85.0% 60.1% -24.9

WordNet Baseline - - 85.8% -KIM (Chen et al., 2018) SNLI 88.6% 83.5% -5.1

Table 3: Accuracy of various models trained on SNLI or a union of SNLI with another dataset (MultiNLI,SciTail), and tested on the original SNLI test set and the new test set.

We chose models which are amongst the bestperforming within their approaches (excluding en-sembles) and have available code. All modelsare based on pre-trained GloVe embeddings (Pen-nington et al., 2014), which are either fine-tunedduring training (RESIDUAL-STACKED-ENCODER

and ESIM) or stay fixed (DECOMPOSABLE AT-TENTION). All models predict the label using aconcatenation of features derived from the sen-tence representations (e.g. maximum, mean), forexample as in Mou et al. (2016). We use the rec-ommended hyper-parameters for each model, asthey appear in the provided code.

With External Knowledge. We provide a sim-ple WORDNET BASELINE, in which we classifya sentence-pair according to the WordNet relationthat holds between the original word wp and thereplaced word wh. We predict entailment if wp isa hyponym of wh or if they are synonyms, neutralif wp is a hypernym of wh, and contradiction if wp

and wh are antonyms or if they share a commonhypernym ancestor (up to 2 edges). Word pairswith no WordNet relations are classified as other.

We also report the performance of KIM

(Knowledge-based Inference Model, Chen et al.,2018), an extension of ESIM with external knowl-edge from WordNet, which was kindly providedto us by Qian Chen. KIM improves the attentionmechanism by taking into account the existenceof WordNet relations between the words. The lex-ical inference component, operating over pairs ofaligned words, is enriched with a vector encodingthe specific WordNet relations between the words.

4.2 Experimental Settings

We trained each model on 3 different datasets: (1)SNLI train set, (2) a union of the SNLI train set

and the MultiNLI train set, and (3) a union of theSNLI train set and the SciTail train set. The mo-tivation is that while SNLI might lack the trainingdata needed to learn the required lexical knowl-edge, it may be available in the other datasets,which are presumably richer.

4.3 ResultsTable 3 displays the results for all the models onthe original SNLI test set and the new test set. De-spite the task being considerably simpler, the dropin performance is substantial, ranging from 11 to33 points in accuracy. Adding MultiNLI to thetraining data somewhat mitigates this drop in ac-curacy, thanks to almost doubling the amount oftraining data. We note that adding SciTail to thetraining data did not similarly improve the perfor-mance; we conjecture that this stems from the dif-ferences between the datasets.

KIM substantially outperforms the other neuralmodels, demonstrating that lexical knowledge isthe only requirement for good performance on thenew test set, and stressing the inability of the othermodels to learn it. Both WordNet-informed mod-els leave room for improvement: possibly due tolimited WordNet coverage and the implications ofapplying lexical inferences within context.

5 Analysis

We take a deeper look into the predictions of themodels that don’t employ external knowledge, fo-cusing on the models trained on SNLI.

5.1 Accuracy by CategoryTable 4 displays the accuracy of each model perreplacement-word category. The neural modelstend to perform well on categories which are fre-quent in the training set, such as colors, and badly

Models that haveaccess to the resources used to create the adversarial examples

16 / 18


An SNLI adversarial evaluationRoBERTA-MNLI, off-the-shelf

KmHiBMHBn;HQ+FM2`n�/p2`b�`B�Hn`Q#2ì�

J�`+? kj- kyky

(R), BKTQì MHB- Qb- iQ`+?7`QK bFH2�`MXK2i`B+b BKTQì +H�bbB7B+�iBQMn`2TQì

(k), O �p�BH�#H2 7`QK ?iiTb,ff;Bi?m#X+QKf"Al@LGSf"`2�FBM;nLGA,#`2�FBM;nMHBnb`+n7BH2M�K2 4 QbXT�i?XDQBMU]XXfM2r@/�i�f/�i�f/�i�b2iXDbQMH]V`2�/2` 4 MHBXLGA_2�/2Ù#`2�FBM;nMHBnb`+n7BH2M�K2V

(j), 2tb 4 (UU2tXb2Mi2M+2R- 2tXb2Mi2M+2kV- 2tX;QH/nH�#2HV 7Q` 2t BM `2�/2`X`2�/UV)

(9), sni2binbi`- vni2bi 4 xBTU 2tbV

(8), KQ/2H 4 iQ`+?X?m#XHQ�/U^TviQ`+?f7�B`b2[^- ^`Q#2ì�XH�`;2XKMHB^Vn 4 KQ/2HX2p�HUV

lbBM; +�+?2 7QmM/ BM flb2`bf+;TQiibfX+�+?2fiQ`+?f?m#fTviQ`+?n7�B`b2[nK�bi2`

(e), sni2bi 4 (KQ/2HX2M+Q/2U 2tV 7Q` 2t BM sni2binbi`)

(d), T`2/nBM/B+2b 4 (KQ/2HXT`2/B+iU^KMHB^- 2tVX�`;K�tUV 7Q` 2t BM sni2bi)

(3), iQnbi` 4 &y, ^+QMi`�/B+iBQM^- R, ^M2mi`�H^- k, ^2Mi�BHK2Mi^'

(N), T`2/b 4 (iQnbi`(+XBi2KUV) 7Q` + BM T`2/nBM/B+2b)

(Ry), T`BMiU+H�bbB7B+�iBQMn`2TQìUvni2bi- T`2/bVV

T`2+BbBQM `2+�HH 7R@b+Q`2 bmTTQì

+QMi`�/B+iBQM yXNN yXNd yXN3 dRe92Mi�BHK2Mi yX3e RXyy yXNk N3k

M2mi`�H yXR8 yXR8 yXR8 9d

�++m`�+v yXNd 3RNjK�+`Q �p; yXed yXdR yXe3 3RNj

r2B;?i2/ �p; yXNd yXNd yXNd 3RNj

R

16 / 18


A MultiNLI adversarial evaluation

Category Premise Relation Hypothesis

Antonyms I love the Cinderellastory.

contradicts I hate the Cinderellastory.

Numerical Tim has 350 pounds ofcement in 100, 50,and 25 pound bags.

contradicts Tim has less than 750pounds of cement in100, 50, and 25 poundbags.

Word overlap Possibly no othercountry has had sucha turbulent history.

entails The country’s historyhas been turbulentand true is true

Negation Possibly no othercountry has had sucha turbulent history.

entails The country’s historyhas been turbulentand false is not true

17 / 18

Also ‘Length mismatch’ and ‘Spelling errors’; Naik et al. 2018



Category Examples

Antonym 1,561Length Mismatch 9815Negation 9,815Numerical Reasoning 7,596Spelling Error 35,421Word Overlap 9,815

17 / 18

Naik et al. 2018



2345

Original Competence Test Distraction Test Noise TestMultiNLI Word Length Spelling

System Dev Antonymy Numerical Overlap Negation Mismatch ErrorMat Mis Mat Mis Reasoning Mat Mis Mat Mis Mat Mis Mat Mis

NB 74.2 74.8 15.1 19.3 21.2 47.2 47.1 39.5 40.0 48.2 47.3 51.1 49.8CH 73.7 72.8 11.6 9.3 30.3 58.3 58.4 52.4 52.2 63.7 65.0 68.3 69.1RC 71.3 71.6 36.4 32.8 30.2 53.7 54.4 49.5 50.4 48.6 49.6 66.6 67.0IS 70.3 70.6 14.4 10.2 28.8 50.0 50.2 46.8 46.6 58.7 59.4 58.3 59.4

BiLSTM 70.2 70.8 13.2 9.8 31.3 57.0 58.5 51.4 51.9 49.7 51.2 65.0 65.1CBOW 63.5 64.2 6.3 3.6 30.3 53.6 55.6 43.7 44.2 48.0 49.3 60.3 60.6

Table 3: Classification accuracy (%) of state-of-the-art models on our constructed stress tests. Accuraciesshown on both genre-matched and mismatched categories for each stress set. For reference, randombaseline accuracy is 33%.

3.3 Noise Test Construction

This class consists of an adversarial example set which tests model robustness to spelling errors. Spellingerrors occur often in MultiNLI data, due to involvement of Turkers and noisy source text (Ghaeini etal., 2018), which is problematic as some NLI systems rely heavily on word embeddings. Inspired byBelinkov and Bisk (2017), we construct a stress test for “spelling errors” by performing two types ofperturbations on a word sampled randomly from the hypothesis: random swap of adjacent characterswithin the word (for example, “I saw Tipper with him at teh movie.”), and random substitution of a singlealphabetical character with the character next to it on the English keyboard. For example, “Agencies havebeen further restricted and given less choice in selecting contractimg methods”.

4 Experiments

4.1 Experimental Setup

We focus on the following sentence-encoder models, which achieve strong performance on MultiNLI:Nie and Bansal (2017) (NB): This model uses a sentence encoder consisting of stacked BiLSTM-RNNswith shortcut connections and fine-tuning of embeddings. It achieves the top non-ensemble result in theRepEval-2017 shared task (Nangia et al., 2017).Chen et al. (2017) (CH): This model also uses a sentence encoder consisting of stacked BiLSTM-RNNswith shortcut connections. Additionally, it makes use of character-composition word embeddingslearned via CNNs, intra-sentence gated attention and ensembling to achieve the best overall result in theRepEval-2017 shared task.Balazs et al. (2017) (RiverCorners - RC): This model uses a single-layer BiLSTM with mean poolingand intra-sentence attention.Conneau et al. (2017) (InferSent - IS): This model uses a single-layer BiLSTM-RNN with max-pooling. It is shown to learn robust universal sentence representations which transfer well across severalinference tasks.We also set up two simple baseline models:BiLSTM: The simple BiLSTM baseline model described by Nangia et al. (2017).CBOW: A bag-of-words sentence representation from word embeddings.

4.2 Model Performance on Stress Tests

Table 3 shows the classification accuracy of all six models on our stress tests and the original MultiNLIdevelopment set. We see that performance of all models drops across all stress tests. On competencestress tests, no model is a clear winner, with RC and CH performing best on antonymy and numericalreasoning respectively. On distraction tests, CH is the best-performing model, suggesting that theirgated-attention mechanism handles shallow word-level distractions to some extent. Interestingly, our

17 / 18



2174

Outcome 1 Outcome 2 Outcome 3

(a) Word Overlap (c) Spelling Errors (e) Numerical Reasoning

(b) Negation (d) Length Mismatch (f) Adversarial SQuAD

Figure 3: Inoculation by fine-tuning results. (a–e): NLI accuracy for the ESIM and decomposable attention (DA)models. (f): Reading comprehension F1 scores for the BiDAF and QANet models.Fine-tuning on a small number of word overlap (a) and negation (b) examples erases the performance gap (Outcome1). Fine-tuning does not yield significant improvement on spelling errors (c) and length mismatch (d), but does notdegrade original performance either (Outcome 2). Fine-tuning on numerical reasoning (e) closes the gap entirely,but also reduces performance on the original dataset (Outcome 3). On Adversarial SQuAD (f), around 60% of theperformance gap is closed after fine-tuning, though performance on the original dataset decreases (Outcome 3).On each challenge dataset, we observe similar trends between different models.

produced by running each token through a charac-ter bidirectional GRU (Cho et al., 2014).

Adversarial SQuAD Jia and Liang (2017) cre-ated a challenge dataset for reading comprehen-sion by appending automatically-generated dis-tractor sentences to SQuAD passages. The ap-pended distractor sentences are crafted to looksimilar to the question while not contradicting thecorrect answer or misleading humans (Figure 2).The authors released model-independent Adver-sarial SQuAD examples, which we analyze. Forour analysis, we use the BiDAF model (Seo et al.,2017) and the QANet model (Yu et al., 2018).

3.2 ResultsWe refer to difference between a model’s pre-inoculation performance on the original test setand the challenge test set as the performance gap.

NLI Stress Tests Figure 3 presents NLI accu-racy for the ESIM and DA models on the wordoverlap, negation, spelling errors, length mis-

match and numerical reasoning challenge datasetsafter fine-tuning on a varying number of challengeexamples.

For the word overlap and negation challengedatasets, both ESIM and DA quickly close theperformance gap when fine-tuning (Outcome 1).For instance, on both of the aforementioned chal-lenge datasets, ESIM requires only 100 exam-ples to close over 90% of the performance gapwhile maintaining high performance on the orig-inal dataset. Since these performance gaps areclosed after seeing a few challenge dataset exam-ples (< 0.03% of the original MultiNLI trainingdataset), these challenges are likely difficult be-cause they exploit easily-recoverable gaps in themodels’ training dataset rather than highlightingtheir inability to capture semantic phenomena.

In contrast, on spelling errors and length mis-match, fine-tuning does not allow either modelto close a substantial portion of the performancegap, while performance on the original dataset

(Dataset weakness) (Model weakness) (Dataset artifacts or other problem)

17 / 18

Liu et al. 2019;Antonym not tested because its label is always ‘contradiction’

References

References I

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus forlearning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 632–642, Stroudsburg, PA. Association for Computational Linguistics.

Atticus Geiger, Ignacio Cases, Lauri Karttunen, and Christopher Potts. 2019. Posing fair generalization tasks for naturallanguage inference. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4475–4485,Stroudsburg, PA. Association for Computational Linguistics.

Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI systems with sentences that require simple lexicalinferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.

Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031. Association forComputational Linguistics.

Hector J. Levesque. 2013. On our best behaviour. In Proceedings of the Twenty-third International Conference on ArtificialIntelligence, Beijing.

Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019. Inoculation by fine-tuning: A method for analyzing challengedatasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2171–2179, Minneapolis,Minnesota. Association for Computational Linguistics.

Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluationfor natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics,pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machinecomprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,pages 2383–2392. Association for Computational Linguistics.

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet classifiers generalize toImageNet? In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings ofMachine Learning Research, pages 5389–5400, Long Beach, California, USA. PMLR.

Terry Winograd. 1972. Understanding natural language. Cognitive Psychology, 3(1):1–191.

18 / 18

https://doi.org/10.18653/v1/D19-1456

https://doi.org/10.18653/v1/D19-1456

https://doi.org/10.18653/v1/P18-2103

https://doi.org/10.18653/v1/P18-2103

https://doi.org/10.18653/v1/D17-1215

https://doi.org/10.18653/v1/N19-1225

https://doi.org/10.18653/v1/N19-1225

https://www.aclweb.org/anthology/C18-1198

https://www.aclweb.org/anthology/C18-1198

https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264

http://proceedings.mlr.press/v97/recht19a.html

http://proceedings.mlr.press/v97/recht19a.html

analysis methods in nlp: adversarial testing

Documents