shi feng, eric wallace, alvin grissom ii, pedro rodriguez, mohit...

11
Shi Feng , Eric Wallace , Alvin Grissom II, Pedro Rodriguez , Mohit Iyyer, and Jordan Boyd-Graber. Patholo- gies of Neural Models Make Interpretation Difficult. Empirical Methods in Natural Language Processing, 2018, 9 pages. @inproceedings{Feng:Wallace:II:Rodriguez:Iyyer:Boyd-Graber-2018, Title = {Pathologies of Neural Models Make Interpretation Difficult}, Author = {Shi Feng and Eric Wallace and Alvin Grissom II and Pedro Rodriguez and Mohit Iyyer and Jordan Bo Booktitle = {Empirical Methods in Natural Language Processing}, Year = {2018}, Location = {Brussels, Belgium}, Url = {http://umiacs.umd.edu/~jbg//docs/2018_emnlp_rs.pdf} } Links: Blog Post [https://medium.com/uci-nlp/summary-pathologies-of-neural-models-make-interpretations-difficult- Downloaded from http://umiacs.umd.edu/ ~ jbg/docs/2018_emnlp_rs.pdf Contact Jordan Boyd-Graber ([email protected]) for questions about this paper. 1

Upload: others

Post on 04-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit Iyyer, and Jordan Boyd-Graber. Patholo-gies of Neural Models Make Interpretation Difficult. Empirical Methods in Natural Language Processing, 2018,9 pages.

@inproceedings{Feng:Wallace:II:Rodriguez:Iyyer:Boyd-Graber-2018,Title = {Pathologies of Neural Models Make Interpretation Difficult},Author = {Shi Feng and Eric Wallace and Alvin Grissom II and Pedro Rodriguez and Mohit Iyyer and Jordan Boyd-Graber},Booktitle = {Empirical Methods in Natural Language Processing},Year = {2018},Location = {Brussels, Belgium},Url = {http://umiacs.umd.edu/~jbg//docs/2018_emnlp_rs.pdf}}

Links:

• Blog Post [https://medium.com/uci-nlp/summary-pathologies-of-neural-models-make-interpretations-difficult-emnlp-2018-62280abe9df8]

Downloaded from http://umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf

Contact Jordan Boyd-Graber ([email protected]) for questions about this paper.

1

Page 2: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

Pathologies of Neural Models Make Interpretations Difficult

Shi Feng1 Eric Wallace1 Alvin Grissom II2 Mohit Iyyer3,4Pedro Rodriguez1 Jordan Boyd-Graber11University of Maryland 2Ursinus College

3UMass Amherst 4Allen Institute for Artificial Intelligence{shifeng,ewallac2,entilzha,jbg}@umiacs.umd.edu,

[email protected], [email protected]

Abstract

One way to interpret neural model predic-tions is to highlight the most important in-put features—for example, a heatmap visu-alization over the words in an input sen-tence. In existing interpretation methods forNLP, a word’s importance is determined byeither input perturbation—measuring the de-crease in model confidence when that word isremoved—or by the gradient with respect tothat word. To understand the limitations ofthese methods, we use input reduction, whichiteratively removes the least important wordfrom the input. This exposes pathologicalbehaviors of neural models: the remainingwords appear nonsensical to humans and donot match the words interpretation methodsdeem important. As we confirm with humanexperiments, the reduced examples lack infor-mation to support the prediction of any la-bel, but models still make the same predic-tions with high confidence. To explain thesecounterintuitive results, we draw connectionsto adversarial examples and confidence cali-bration: pathological behaviors reveal difficul-ties in interpreting neural models trained withmaximum likelihood. To mitigate their defi-ciencies, we fine-tune the models by encourag-ing high entropy outputs on reduced examples.Fine-tuned models become more interpretableunder input reduction without accuracy loss onregular examples.

1 Introduction

Many interpretation methods for neural networksexplain the model’s prediction as a counterfactual:how does the prediction change when the input ismodified? Adversarial examples (Szegedy et al.,2014; Goodfellow et al., 2015) highlight the insta-bility of neural network predictions by showinghow small perturbations to the input dramaticallychange the output.

SQUADContext In 1899, John Jacob Astor IV invested

$100,000 for Tesla to further developand produce a new lighting system. In-stead, Tesla used the money to fund hisColorado Springs experiments.

Original What did Tesla spend Astor’s money on ?Reduced didConfidence 0.78→ 0.91

Figure 1: SQUAD example from the validation set.Given the original Context, the model makes the samecorrect prediction (“Colorado Springs experiments”)on the Reduced question as the Original, with evenhigher confidence. For humans, the reduced question,“did”, is nonsensical.

A common, non-adversarial form of model in-terpretation is feature attribution: features that arecrucial for predictions are highlighted in a heatmap.One can measure a feature’s importance by inputperturbation. Given an input for text classification,a word’s importance can be measured by the dif-ference in model confidence before and after thatword is removed from the input—the word is im-portant if confidence decreases significantly. Thisis the leave-one-out method (Li et al., 2016b). Gra-dients can also reveal feature importance: a featureinfluences predictions if its gradient is a large pos-itive value. Both perturbation and gradient-basedmethods can generate heatmaps, implying that themodel’s prediction is highly influenced by the high-lighted, important words.

Instead, we study how the model’s predictionis influenced by unimportant words. We use in-put reduction, iteratively removing unimportantwords from the input while maintaining the model’sprediction. Intuitively, the words remaining afterinput reduction should be important for prediction.Moreover, the words should match the leave-one-out method’s selections, which closely align with

Page 3: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

human perception (Li et al., 2016b; Murdoch et al.,2018). However, rather than providing explana-tions of the original prediction, our reduced exam-ples more closely resemble adversarial examples.The reduced input is meaningless to a human butretains the same model prediction with high confi-dence (Figure 1). Gradient-based input reductionexposes pathological model behaviors that contra-dict what one expects based on existing interpreta-tion methods.

In Section 2, we construct these counterintu-itive examples by augmenting input reduction withbeam search and experiment with three tasks:SQUAD (Rajpurkar et al., 2016) for reading com-prehension, SNLI (Bowman et al., 2015) for tex-tual entailment, and VQA (Antol et al., 2015) forvisual question answering. Input reduction withbeam search consistently reduces the input sen-tence to very short lengths—often only one ortwo words—without lowering model confidenceon its original prediction. The reduced examplesappear nonsensical to humans, which we verifywith crowdsourced experiments. In Section 3, wedraw connections to adversarial examples and con-fidence calibration; we explain why the observedpathologies are a consequence of the overconfi-dence of neural models. This elucidates limitationsof interpretation methods that rely on model con-fidence. In Section 4, we encourage high modeluncertainty on reduced examples with entropy reg-ularization. The pathological model behavior underinput reduction is mitigated, leading to more rea-sonable reduced examples.

2 Input Reduction

To explain model predictions using a set of impor-tant words, we must first define importance. Af-ter defining input perturbation and gradient-basedapproximation, we describe input reduction withthese importance metrics. Input reduction drasti-cally shortens inputs without causing the model tochange its prediction or significantly decrease itsconfidence. Crowdsourced experiments confirmthat reduced examples appear nonsensical to hu-mans: input reduction uncovers pathological modelbehaviors.

2.1 Importance from Input Gradient

Ribeiro et al. (2016) and Li et al. (2016b) define im-portance by seeing how confidence changes whena feature is removed: if removing a feature drasti-

cally changes a prediction, it must have been im-portant. A natural approximation is to use the gra-dient (Baehrens et al., 2010; Simonyan et al., 2014).We formally define these importance metrics in nat-ural language contexts and introduce the efficientgradient-based approximation. For each word inan input sentence, we measure its importance bythe change in the confidence of the original predic-tion when we remove that word from the sentence.We switch the sign so that when the confidencedecreases, the importance value is positive.

Formally, let x = 〈x1, x2, . . . xn〉 denote theinput sentence, f(y |x) the predicted probabilityof label y, and y = argmaxy′ f(y

′ |x) the originalpredicted label. The importance is then

g(xi | x) = f(y |x)− f(y |x−i). (1)

To calculate the importance of each word in a sen-tence with n words, we need n forward passes ofthe model, each time with one of the words leftout. This is highly inefficient, especially for longersentences. Instead, we approximate the importancevalue with the input gradient. For each word in thesentence, we calculate the dot product of its wordembedding and the gradient of the output with re-spect to continuous vector that represents the word:the embedding. The importance of n words canthus be computed with a single forward-backwardpass. This gradient approximation has been usedfor various interpretation methods for natural lan-guage classification models (Li et al., 2016a; Arraset al., 2016); see Ebrahimi et al. (2017) for furtherdetails on the derivation. We use this approxima-tion in all our experiments as it selects the samewords for removal as an exhaustive search (no ap-proximation).

2.2 Removing Unimportant Words

Instead of looking at the words with high impor-tance values—what interpretation methods com-monly do—we take a complementary approach andstudy how the model behaves when the supposedlyunimportant words are removed. Intuitively, the im-portant words should remain after the unimportantones are removed.

Our input reduction process iteratively removesthe unimportant words. At each step, we removethe word with the lowest importance value until themodel changes its prediction. We experiment withthree popular datasets: SQUAD (Rajpurkar et al.,2016) for reading comprehension, SNLI (Bowman

Page 4: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

et al., 2015) for textual entailment, and VQA (An-tol et al., 2015) for visual question answering. Wedescribe each of these tasks and the model we usebelow, providing full details in the Supplement.

In SQUAD, each example is a context paragraphand a question. The task is to predict a span inthe paragraph as the answer. We reduce only thequestion while keeping the context paragraph un-changed. The model we use is the DRQA Docu-ment Reader (Chen et al., 2017).

In SNLI, each example consists of two sen-tences: a premise and a hypothesis. The task is topredict one of three relationships: entailment, neu-tral, or contradiction. We reduce only the hypoth-esis while keeping the premise unchanged. Themodel we use is Bilateral Multi-Perspective Match-ing (BIMPM) (Wang et al., 2017).

In VQA, each example consists of an imageand a natural language question. We reduce onlythe question while keeping the image unchanged.The model we use is Show, Ask, Attend, and An-swer (Kazemi and Elqursh, 2017).

During the iterative reduction process, we ensurethat the prediction does not change (exact span forSQUAD); consequently, the model accuracy onthe reduced examples is identical to the original.The predicted label is used for input reduction andthe ground-truth is never revealed. We use thevalidation set for all three tasks.

Most reduced inputs are nonsensical to humans(Figure 2) as they lack information for any reason-able human prediction. However, models makeconfident predictions, at times even more confidentthan the original.

To find the shortest possible reduced inputs (po-tentially the most meaningless), we relax the re-quirement of removing only the least importantword and augment input reduction with beamsearch. We limit the removal to the k least im-portant words, where k is the beam size, and de-crease the beam size as the remaining input is short-ened.1 We empirically select beam size five as itproduces comparable results to larger beam sizeswith reasonable computation cost. The requirementof maintaining model prediction is unchanged.

With beam search, input reduction finds ex-tremely short reduced examples with little to no de-crease in the model’s confidence on its original pre-dictions. Figure 3 compares the length of input sen-

1We set beam size to max(1,min(k, L− 3)) where k ismaximum beam size and L is the current length of the inputsentence.

SNLIPremise Well dressed man and woman dancing in

the streetOriginal Two man is dancing on the streetReduced dancingAnswer ContradictionConfidence 0.977→ 0.706VQA

Original What color is the flower ?Reduced flower ?Answer yellowConfidence 0.827→ 0.819

Figure 2: Examples of original and reduced inputswhere the models predict the same Answer. Reducedshows the input after reduction. We remove words fromthe hypothesis for SNLI, questions for SQUAD andVQA. Given the nonsensical reduced inputs, humanswould not be able to provide the answer with high con-fidence, yet, the neural models do.

tences before and after the reduction. For all threetasks, we can often reduce the sentence to only oneword. Figure 4 compares the model’s confidenceon original and reduced inputs. On SQUAD andSNLI the confidence decreases slightly, and onVQA the confidence even increases.

2.3 Humans Confused by Reduced Inputs

On the reduced examples, the models retain theiroriginal predictions despite short input lengths. Thefollowing experiments examine whether these pre-dictions are justified or pathological, based on howhumans react to the reduced inputs.

For each task, we sample 200 examples that arecorrectly classified by the model and generate theirreduced examples. In the first setting, we com-pare the human accuracy on original and reducedexamples. We recruit two groups of crowd work-ers and task them with textual entailment, readingcomprehension, or visual question answering. Weshow one group the original inputs and the otherthe reduced. Humans are no longer able to give thecorrect answer, showing a significant accuracy losson all three tasks (compare Original and Reducedin Table 1).

The second setting examines how random thereduced examples appear to humans. For each

Page 5: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

0 5 10 15 200

0.05

0.10

0.15

0.20

Freq

uenc

y

11.52.3SQuAD

OriginalReduced

Mean Length

0 5 10 15 20

7.51.5SNLI

0 5 10 15 20Example Length

6.22.3VQA

Figure 3: Distribution of input sentence length before and after reduction. For all three tasks, the input is oftenreduced to one or two words without changing the model’s prediction.

SQuAD Start

OriginalReduced

SQuAD End

SNLI

0 0.25 0.50 0.75 1Confidence

VQA

Figure 4: Density distribution of model confidence onreduced inputs is similar to the original confidence. InSQUAD, we predict the beginning and the end of theanswer span, so we show the confidence for both.

of the original examples, we generate a versionwhere words are randomly removed until the lengthmatches the one generated by input reduction. Wepresent the original example along with the two re-duced examples and ask crowd workers their pref-erence between the two reduced ones. The workers’choice is almost fifty-fifty (the vs. Random in Ta-ble 1): the reduced examples appear almost randomto humans.

These results leave us with two puzzles: whyare the models highly confident on the nonsensicalreduced examples? And why, when the leave-one-out method selects important words that appearreasonable to humans, the input reduction processselects ones that are nonsensical?

3 Making Sense of Reduced Inputs

Having established the incongruity of our definitionof importance vis-a-vis human judgements, we nowinvestigate possible explanations for these results.

Dataset Original Reduced vs. Random

SQUAD 80.58 31.72 53.70SNLI-E 76.40 27.66 42.31SNLI-N 55.40 52.66 50.64SNLI-C 76.20 60.60 49.87VQA 76.11 40.60 61.60

Table 1: Human accuracy on Reduced examples dropssignificantly compared to the Original examples, how-ever, model predictions are identical. The reduced ex-amples also appear random to humans—they do notprefer them over random inputs (vs. Random). ForSQUAD, accuracy is reported using F1 scores, othernumbers are percentages. For SNLI, we report resultson the three classes separately: entailment (-E), neutral(-N), and contradiction (-C).

We explain why model confidence can empowermethods such as leave-one-out to generate reason-able interpretations but also lead to pathologiesunder input reduction. We attribute these results totwo issues of neural models.

3.1 Model Overconfidence

Neural models are overconfident in their predic-tions (Guo et al., 2017). One explanation for over-confidence is overfitting: the model overfits thenegative log-likelihood loss by learning to outputlow-entropy distributions over classes. Neural mod-els are also overconfident on examples outside thetraining data distribution. As Goodfellow et al.(2015) observe for image classification, samplesfrom pure noise can sometimes trigger highly confi-dent predictions. These so-called rubbish examplesare degenerate inputs that a human would triviallyclassify as not belonging to any class but for whichthe model predicts with high confidence. Goodfel-low et al. (2015) argue that the rubbish examplesexist for the same reason that adversarial examples

Page 6: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

SQUADContext: The Panthers used the San Jose State practice facility and stayedat the San Jose Marriott. The Broncos practiced at Stanford University andstayed at the Santa Clara Marriott.

Question:(0.90, 0.89) Where did the Broncos practice for the Super Bowl ?(0.92, 0.88) Where did the practice for the Super Bowl ?(0.91, 0.88) Where did practice for the Super Bowl ?(0.92, 0.89) Where did practice the Super Bowl ?(0.94, 0.90) Where did practice the Super ?(0.93, 0.90) Where did practice Super ?(0.40, 0.50) did practice Super ?

Figure 5: A reduction path for a SQUAD example. Themodel prediction is always correct and its confidencestays high (shown on the left in parentheses) through-out the reduction. Each line shows the input at that stepwith an underline indicating the word to remove next.The question becomes unanswerable immediately after“Broncos” is removed in the first step. However, in thecontext of the original question, “Broncos” the inputgradient considers it the least important word.

do: the surprising linear nature of neural models.In short, the confidence of a neural model is not arobust estimate of its prediction uncertainty.

Our reduced inputs satisfy the definition of rub-bish examples: humans have a hard time makingpredictions based on the reduced inputs (Table 1),but models make predictions with high confidence(Figure 4). Starting from a valid example, inputreduction transforms it into a rubbish example.

The nonsensical, almost random results are bestexplained by looking at a complete reduction path(Figure 5). In this example, the transition fromvalid to rubbish happens immediately after the firststep: following the removal of “Broncos”, humanscan no longer determine which team the question isasking about, but model confidence remains high.Not being able to lower its confidence on rubbishexamples—as it is not trained to do so—the modelneglects “Broncos” and eventually the process gen-erates nonsensical results.

In this example, the leave-one-out method willnot highlight “Broncos”. However, this is not a fail-ure of the interpretation method but of the model it-self. The model assigns a low importance to “Bron-cos” in the first step, causing it to be removed—leave-one-out would be able to expose this by nothighlighting “Broncos”. However, in cases wherea similar issue only appears after a few unimpor-tant words are removed, leave-one-out would failto expose the unreasonable model behavior.

Input reduction can expose deeper issues ofmodel overconfidence and can stress test a model’suncertainty estimation and interpretability.

SQUADContext: QuickBooks sponsored a “Small Business Big Game” contest,in which Death Wish Coffee had a 30-second commercial aired free ofcharge courtesy of QuickBooks. Death Wish Coffee beat out nine othercontenders from across the United States for the free advertisement.

Question:What company won free advertisement due to QuickBooks contest ?What company won free advertisement due to QuickBooks ?What company won free advertisement due to ?What company won free due to ?What won free due to ?What won due to ?What won due toWhat won dueWhat wonWhat

Figure 6: Heatmap generated with leave-one-out shiftsdrastically despite only removing the least importantword (underlined) at each step. For instance, “adver-tisement”, is the most important word in step two butbecomes the least important in step three.

3.2 Second-order SensitivityThus far, we have seen that a neural model’s outputchanges wildly with small changes in its input. Wecall this first-order sensitivity, because interpreta-tion based on input gradient is a first-order Taylorexpansion of the model near the input (Simonyanet al., 2014). However, the interpretation (e.g.,which words are important) also shifts drasticallywith small input changes (Figure 6). We call thissecond-order sensitivity.

The shifting heatmap suggests a mismatch be-tween the model’s first- and second-order sensi-tivities. The heatmap shifts when, with respect tothe removed word, the model has low first-ordersensitivity but high second-order sensitivity.

Similar issues complicate comparable interpreta-tion methods for image classification models. Forexample, Ghorbani et al. (2017) modify image in-puts so the highlighted features in the interpretationchange while maintaining the same prediction. Toachieve this, they iteratively modify the input tomaximize changes in the distribution of feature im-portance. In contrast, our shifting heatmap occursby only removing the least impactful features with-out a targeted optimization. They speculate that thesteepest gradient direction for the first- and second-order sensitivity values are generally orthogonal.Loosely speaking, the shifting heatmap suggeststhat the direction of the smallest gradient valuecan sometimes align with very steep changes insecond-order sensitivity.

Page 7: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

When explaining individual model predictions,the heatmap suggests that the prediction is madebased on a weighted combination of words, asin a linear model, which is not true unless themodel is indeed taking a weighted sum such asin a DAN (Iyyer et al., 2015). When the modelcomposes representations by a non-linear combina-tion of words, a linear interpretation oblivious tosecond-order sensitivity can be misleading.

4 Mitigating Model Pathologies

The previous section explains the observed patholo-gies from the perspective of overconfidence: mod-els are too certain on rubbish examples when theyshould not make any prediction. Human experi-ments (Section 2.3) confirm that reduced examplesfit the definition of rubbish examples. Hence, anatural way to mitigate pathologies is maximizingmodel uncertainty on the reduced examples.

4.1 Regularization on Reduced Inputs

To maximize model uncertainty on reduced exam-ples, we use the entropy of the output distribu-tion as an objective. Given a model f trained ona dataset (X ,Y), we generate reduced examplesusing input reduction for all training examples X .Beam search often yields multiple reduced versionswith the same minimum length for each input x,and we collect all of these versions together to formX as the “negative” example set.

Let H (·) denote the entropy and f(y |x) theprobability of the model predicting y given x. Wefine-tune a model to maximize log-likelihood onregular examples and entropy on reduced examples:

(x,y)

log(f(y |x)) + λ∑

x∈XH (f(y | x)) , (2)

where hyperparameter λ trades-off between the twoterms. Similar entropy regularization is used byPereyra et al. (2017), but not in combination withinput reduction; their entropy term is calculated onregular examples rather than reduced examples.

4.2 Regularization Mitigates Pathologies

On regular examples, entropy regularization doesno harm to model accuracy, with a slight increasefor SQUAD (Accuracy in Table 2).

After entropy regularization, input reduction pro-duces more reasonable reduced inputs (Figure 7).

Accuracy Reduced length

Before After Before After

SQUAD 77.41 78.03 2.27 4.97SNLI 85.71 85.72 1.50 2.20VQA 61.61 61.54 2.30 2.87

Table 2: Model Accuracy on regular validation exam-ples remains largely unchanged after fine-tuning. How-ever, the length of the reduced examples (Reducedlength) increases on all three tasks, making them lesslikely to appear nonsensical to humans.

In the SQUAD example from Figure 1, the re-duced question changed from “did” to “spend Astormoney on ?” after fine-tuning. The average lengthof reduced examples also increases across all tasks(Reduced length in Table 2). To verify that modeloverconfidence is indeed mitigated—that the re-duced examples are less “rubbish” compared tobefore fine-tuning—we repeat the human experi-ments from Section 2.3.

Human accuracy increases across all three tasks(Table 3). We also repeat the vs. Random exper-iment: we re-generate the random examples tomatch the lengths of the new reduced examplesfrom input reduction, and find humans now preferthe reduced examples to random ones. The increasein both human accuracy and preference suggeststhat the reduced examples are more reasonable;model pathologies have been mitigated.

This is promising, but do we need input reduc-tion to reduce overconfidence? To provide a base-line, we fine-tune models using randomly reducedinputs (with the same lengths as input reduction).This baseline improves neither model accuracy onnor interpretability (i.e., the reduced examples arejust as short). Input reduction provides negativeexamples that curb overconfidence.

5 Discussion

Rubbish examples have been studied in im-ages (Goodfellow et al., 2015; Nguyen et al., 2015),but to our knowledge not for NLP. Our input re-duction gradually transforms a valid input into arubbish example. We can often determine whichword’s removal is the tipping point—for example,removing “Broncos” in Figure 5. These rubbish ex-amples are interesting, as they are also adversarial:they resemble valid examples. In contrast, imagerubbish examples generated from noise lie outsidethe training data distribution and resemble static.

Page 8: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

SQUADContext In 1899, John Jacob Astor IV invested

$100,000 for Tesla to further developand produce a new lighting system. In-stead, Tesla used the money to fund hisColorado Springs experiments.

Original What did Tesla spend Astor’s money on ?Answer Colorado Springs experimentsBefore didAfter spend Astor money on ?Confidence 0.78→ 0.91→ 0.52SNLIPremise Well dressed man and woman dancing in

the streetOriginal Two man is dancing on the streetAnswer ContradictionBefore dancingAfter two man dancingConfidence 0.977→ 0.706→ 0.717VQAOriginal What color is the flower ?Answer yellowBefore flower ?After What color is flower ?Confidence 0.847→ 0.918→ 0.745

Figure 7: SQUAD example from Figure 1, SNLI andVQA (image omitted) examples from Figure 2. We ap-ply input reduction to models both Before and After en-tropy regularization. The models still predict the sameAnswer, but the reduced examples after fine-tuning ap-pear more reasonable to humans.

The robustness of NLP models has been studiedextensively (Papernot et al., 2016; Jia and Liang,2017; Iyyer et al., 2018; Ribeiro et al., 2018), andmost studies define adversarial examples similarto the image domain: small perturbations to theinput lead to large changes in the output. Hot-Flip (Ebrahimi et al., 2017) uses a gradient-basedapproach, similar to image adversarial examples,to flip the model prediction by perturbing a fewcharacters or words. Our work and Belinkov andBisk (2018) both identify cases where noisy userinputs become adversarial by accident: commonmisspellings break neural machine translation mod-els; we show that incomplete user input can lead tounreasonably high model confidence.

Other failures of interpretation methods havebeen explored in the image domain. The sensi-tivity issue of gradient-based interpretation meth-ods, similar to our shifting heatmaps, are observedby Ghorbani et al. (2017) and Kindermans et al.(2017). They show that various forms of inputperturbation—from adversarial changes to simpleconstant shifts in the image input—cause signifi-cant changes in the interpretation. Ghorbani et al.

Accuracy vs. Random

Before After Before After

SQUAD 31.72 51.61 53.70 62.75SNLI-E 27.66 32.37 42.31 50.62SNLI-N 52.66 50.50 50.64 58.94SNLI-C 60.60 63.90 49.87 56.92VQA 40.60 51.85 61.60 61.88

Table 3: Human Accuracy increases after fine-tuningthe models. Humans also prefer gradient-based re-duced examples over randomly reduced ones, indicat-ing that the reduced examples are more meaningful tohumans after regularization.

(2017) make a similar observation about second-order sensitivity, that “the fragility of interpretationis orthogonal to fragility of the prediction”.

Previous work studies biases in the annotationprocess that lead to datasets easier than desiredor expected which eventually induce pathologicalmodels. We attribute our observed pathologiesprimarily to the lack of accurate uncertainty es-timates in neural models trained with maximumlikelihood. SNLI hypotheses contain artifacts thatallow training a model without the premises (Guru-rangan et al., 2018); we apply input reduction at testtime to the hypothesis. Similarly, VQA images aresurprisingly unimportant for training a model; wereduce the question. The recent SQUAD 2.0 (Ra-jpurkar et al., 2018) augments the original readingcomprehension task with an uncertainty modelingrequirement, the goal being to make the task morerealistic and challenging. Question answering taskssuch as Quizbowl combine knowing what to answerwith when there is sufficient information to answerfor the agent or an opponent (He et al., 2016).

Section 3.1 explains the pathologies from theoverconfidence perspective. One explanation foroverconfidence is overfitting: Guo et al. (2017)show that, late in maximum likelihood training, themodel learns to minimize loss by outputting low-entropy distributions without improving validationaccuracy. To examine if overfitting can explain theinput reduction results, we run input reduction us-ing DRQA model checkpoints from every trainingepoch. Input reduction still achieves similar re-sults on earlier checkpoints, suggesting that betterconvergence in maximum likelihood training can-not fix the issues by itself—we need new trainingobjectives with uncertainty estimation in mind.

Page 9: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

5.1 Methods for Mitigating PathologiesWe regularize the model using input reduction’sreduced examples and improve its interpretability.This resembles adversarial training (Goodfellowet al., 2015), where adversarial examples are addedto the training set to improve model robustness.The objectives are different: entropy regularizationencourages high uncertainty on rubbish examples,while adversarial training makes the model lesssensitive to adversarial perturbations.

Pereyra et al. (2017) apply entropy regulariza-tion on regular examples from the start of train-ing to improve model generalization. A similarmethod is label smoothing (Szegedy et al., 2016).In comparison, we fine-tune a model with entropyregularization on the reduced examples for betteruncertainty estimates and interpretations.

To mitigate overconfidence, Guo et al. (2017)use post-hoc fine-tuning a model’s confidence withPlatt scaling. This method adjusts the softmax func-tion’s temperature parameter using a small held-outdataset to align confidence with accuracy. However,because the output is calibrated using the entire con-fidence distribution, not individual values, this doesnot reduce overconfidence on specific inputs, suchas the reduced examples.

5.2 Generalizability of FindingsTo highlight the erratic model predictions on shortexamples and provide a more intuitive demonstra-tion, we present paired-input tasks. On these tasks,the short lengths of reduced questions and hypothe-ses obviously contradict the necessary number ofwords for a human prediction (further supportedby our human studies). We also apply input re-duction to single-input tasks including sentimentanalysis (Maas et al., 2011) and Quizbowl (Boyd-Graber et al., 2012), achieving similar results.

Interestingly, the reduced examples transferto other architectures. In particular, whenwe feed fifty reduced SNLI inputs from eachclass—generated with the BIMPM model (Wanget al., 2017)—through the Decomposable AttentionModel (Parikh et al., 2016),2 the same predictionis triggered 81.3% of the time.

2http://demo.allennlp.org/textual-entailment

6 Conclusion

We introduce input reduction, a process that iter-atively removes unimportant input words whilemaintaining a model’s prediction. Combined withgradient-based importance estimates, we exposepathological behaviors of neural models. Withoutlowering model confidence original predictions, aninput sentence can be reduced to the point whereit appears nonsensical, often consisting of one ortwo words. Humans cannot understand the reducedexamples, but neural models maintain their originalpredictions.

We explain these pathologies with known issuesof neural models: overconfidence and sensitivityto small input changes. Inaccurate uncertainty esti-mates cause the the nonsensical reduced examples:the model is cannot lower its confidence on ill-formed inputs. The second-order sensitivity alsoexplains why gradient-based interpretation meth-ods contradict human expectations: a small changein the input can cause a minor change in the pre-diction but a large change in the interpretation. In-put reduction’s many small perturbations—a stresstest—can expose deeper issues of model overconfi-dence and oversensitivity that other methods can-not.

To properly interpret neural models, it is impor-tant to understand their fundamental characteristics:the nature of their decision surfaces, robustnessagainst adversaries, and limitations of their train-ing objectives. We explain fundamental difficultiesof interpretation due to pathologies in neural mod-els trained with maximum likelihood. Our worksuggests several future directions to improve inter-pretability: more thorough evaluation of interpre-tation methods, better uncertainty and confidenceestimates, and interpretation beyond bag-of-wordheatmap.

More practically, NLP systems are often de-signed as pipelines: speech recognition or machinetranslation feeds into a downstream classificationtask. The classification is typically trained on un-corrupted English text. Input reduction suggeststhat both the outputs and confidences of such naıvepipelines should be treated with greater suspicion.

Page 10: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

Acknowledgments

Feng was supported under subcontract to RaytheonBBN Technologies by DARPA award HR0011-15-C-0113. JBG is supported by NSF GrantIIS1652666. Any opinions, findings, conclusions,or recommendations expressed here are those ofthe authors and do not necessarily reflect the viewof the sponsor. The authors would like to thankHal Daume III, Alexander M. Rush, Nicolas Paper-not, members of the CLIP lab at the University ofMaryland, and the anonymous reviewers for theirfeedback.

ReferencesStanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-

garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual question an-swering. In International Conference on ComputerVision.

Leila Arras, Franziska Horn, Gregoire Montavon,Klaus-Robert Muller, and Wojciech Samek. 2016.Explaining predictions of non-linear classifiers inNLP. In Workshop on Representation Learning forNLP.

David Baehrens, Timon Schroeter, Stefan Harmel-ing, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Muller. 2010. How to explain individual clas-sification decisions. Journal of Machine LearningResearch.

Yonatan Belinkov and Yonatan Bisk. 2018. Syntheticand natural noise both break neural machine transla-tion. In Proceedings of the International Conferenceon Learning Representations.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large an-notated corpus for learning natural language infer-ence. In Proceedings of Empirical Methods in Natu-ral Language Processing.

Jordan L. Boyd-Graber, Brianna Satinoff, He He, andHal Daume III. 2012. Besting the quiz master:Crowdsourcing incremental classification games. InProceedings of Empirical Methods in Natural Lan-guage Processing.

Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the Associationfor Computational Linguistics.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2017. HotFlip: White-box adversarial exam-ples for text classification. In Proceedings of the As-sociation for Computational Linguistics.

Amirata Ghorbani, Abubakar Abid, and James Y. Zou.2017. Interpretation of neural networks is fragile.arXiv preprint arXiv: 1710.10547.

Ian J. Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2015. Explaining and harnessing adversar-ial examples. In Proceedings of the InternationalConference on Learning Representations.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein-berger. 2017. On calibration of modern neural net-works. In Proceedings of the International Confer-ence of Machine Learning.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel R. Bowman, andNoah A. Smith. 2018. Annotation artifacts in nat-ural language inference data. In Conference of theNorth American Chapter of the Association for Com-putational Linguistics.

He He, Kevin Kwok, Jordan Boyd-Graber, and HalDaume III. 2016. Opponent modeling in deep re-inforcement learning. In Proceedings of the Interna-tional Conference of Machine Learning.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,and Hal Daume III. 2015. Deep unordered compo-sition rivals syntactic methods for text classification.In Proceedings of the Association for ComputationalLinguistics.

Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke S.Zettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In Conference of the North American Chapter of theAssociation for Computational Linguistics.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In Proceedings of Empirical Methods in NaturalLanguage Processing.

Vahid Kazemi and Ali Elqursh. 2017. Show, ask, at-tend, and answer: A strong baseline for visual ques-tion answering. arXiv preprint arXiv: 1704.03162.

Pieter-Jan Kindermans, Sara Hooker, Julius Ade-bayo, Maximilian Alber, Kristof T. Schutt, SvenDahne, Dumitru Erhan, and Been Kim. 2017. The(un)reliability of saliency methods. arXiv preprintarXiv: 1711.00867.

Jiwei Li, Xinlei Chen, Eduard H. Hovy, and Daniel Ju-rafsky. 2016a. Visualizing and understanding neuralmodels in NLP. In Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics.

Jiwei Li, Will Monroe, and Daniel Jurafsky. 2016b.Understanding neural networks through representa-tion erasure. arXiv preprint arXiv: 1612.08220.

Andrew Maas, Raymond Daly, Peter Pham, DanHuang, Andrew Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. In

Page 11: Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit …users.umiacs.umd.edu/~jbg/docs/2018_emnlp_rs.pdf · 2020-06-05 · Pathologies of Neural Models Make Interpretations

Proceedings of the Association for ComputationalLinguistics.

W. James Murdoch, Peter J. Liu, and Bin Yu. 2018. Be-yond word importance: Contextual decompositionto extract interactions from lstms. In Proceedings ofthe International Conference on Learning Represen-tations.

Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. 2015.Deep neural networks are easily fooled: High con-fidence predictions for unrecognizable images. InComputer Vision and Pattern Recognition.

Nicolas Papernot, Patrick D. McDaniel, AnanthramSwami, and Richard E. Harang. 2016. Crafting ad-versarial input sequences for recurrent neural net-works. IEEE Military Communications Conference.

Ankur P. Parikh, Oscar Tackstrom, Dipanjan Das, andJakob Uszkoreit. 2016. A decomposable attentionmodel for natural language inference. In Proceed-ings of Empirical Methods in Natural Language Pro-cessing.

Gabriel Pereyra, George Tucker, Jan Chorowski,Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Reg-ularizing neural networks by penalizing confidentoutput distributions. In Proceedings of the Interna-tional Conference on Learning Representations.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Proceedings of the Associationfor Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100, 000+ questionsfor machine comprehension of text. In Proceedingsof Empirical Methods in Natural Language Process-ing.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. Why should i trust you?”: Explain-ing the predictions of any classifier. In KnowledgeDiscovery and Data Mining.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversarialrules for debugging nlp models. In Proceedings ofthe Association for Computational Linguistics.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. 2014. Deep inside convolutional networks: Vi-sualising image classification models and saliencymaps. In Proceedings of the International Confer-ence on Learning Representations.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethinkingthe inception architecture for computer vision. InComputer Vision and Pattern Recognition.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, andRob Fergus. 2014. Intriguing properties of neural

networks. In Proceedings of the International Con-ference on Learning Representations.

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.Bilateral multi-perspective matching for natural lan-guage sentences. In International Joint Conferenceon Artificial Intelligence.