what question answering can learn from trivia nerds · 2020-04-23 · what question answering can...

What Question Answering can Learn from Trivia Nerds

iSchool, CS, UMIACS, LSCUniversity of Maryland

[email protected]

Jordan Boyd-Graber , † Benjamin Börschinger†

{jbg,† Google Research, Zürich

bboerschinger}@google.com

Abstract

In addition to machines answering questions,question answering (QA) research creates in-teresting, challenging questions that revealthe best systems. We argue that creating aQA dataset—and its ubiquitous leaderboard—closely resembles running a trivia tournament:you write questions, have agents—humans ormachines—answer questions, and declare awinner. However, the research community hasignored the lessons from decades of the triviacommunity creating vibrant, fair, and effectiveQA competitions. After detailing problemswith existing QA datasets, we outline severallessons that transfer to QA research: removingambiguity, discriminating skill, and adjudicat-ing disputes.

1 Introduction

This paper takes an unconventional analysis to an-swer “where we’ve been and where we’re going” inquestion answering (QA). Instead of approachingthe question only as ACL researchers, we apply thebest practices of trivia tournaments to QA datasets.

The QA community is obsessed with evalua-tion. Schools, companies, and newspapers hailnew SOTAs and topping leaderboards, giving rise toclaims that an “AI model tops humans” (Najberg,2018) because it ‘won’ some leaderboard, putting“millions of jobs at risk” (Cuthbertson, 2018). Butwhat is a leaderboard? A leaderboard is a statis-tic about QA accuracy that induces a ranking overparticipants.

Newsflash: this is the same as a trivia tourna-ment. The trivia community has been doing thisfor decades (Jennings, 2006); Section 2 details thisoverlap between the qualities of a first-class QA

dataset (and its requisite leaderboard). The ex-perts running these tournaments are imperfect, butthey’ve learned from their past mistakes (see Ap-pendix A for a brief historical perspective) and cre-

ated a community that reliably identifies those bestat question answering. Beyond the format of thecompetition, important safeguards ensure individ-ual questions are clear, unambiguous, and rewardknowledge (Section 3).

We are not saying that academic QA should sur-render to trivia questions or the community—farfrom it! The trivia community does not understandthe real world information seeking needs of usersor what questions challenge computers. However,they know how, given a bunch of questions, to de-clare that someone is better at answering questionsthan another. This collection of tradecraft and prin-ciples can help the QA community.

Beyond these general concepts that QA can learnfrom, Section 4 reviews how the “gold standard” oftrivia formats, Quizbowl, can improve traditionalQA. We then briefly discuss how research that usesfun, fair, and good trivia questions can benefit fromthe expertise, pedantry, and passion of the triviacommunity (Section 5).

2 Surprise, this is a Trivia Tournament!

“My research isn’t a silly trivia tournament,” yousay. That may be, but let us first tell you a littleabout what running a tournament is like, and per-haps you might see similarities.

First, the questions. Either you write them your-self or you pay someone to write them (sometimespeople on the Internet). There is a fixed number ofquestions you need to hit by a particular date.

Then, you advertise. You talk about your ques-tions: who is writing them, what subjects are cov-ered, and why people should try to answer them.

Next, you have the tournament. You keep yourquestions secure until test time, collect answersfrom all participants, and declare a winner. After-ward, people use the questions to train for futuretournaments.

arX

iv:1

910.

1446

4v3

[cs

.CL

] 2

1 A

pr 2

020

These have natural analogs to crowd sourcingquestions, writing the paper, advertising, and run-ning a leaderboard. The biases of academia putmuch more emphasis on the paper, but there arecomponents where trivia tournament best practicescould help. In particular, we focus on fun, well-calibrated, and discriminative tournaments.

2.1 Are we Having Fun?

Many datasets use crowdworkers to establish hu-man accuracy (Rajpurkar et al., 2016; Choi et al.,2018). However, these are not the only humanswho should answer a dataset’s questions. So shouldthe datasets’ creators.

In the trivia world, this is called a play test: getin the shoes of someone answering the questionsyourself. If you find them boring, repetitive, oruninteresting, so will crowdworkers; and if you,as a human, can find shortcuts to answer ques-tions (Rondeau and Hazen, 2018), so will a com-puter.

Concretely, Weissenborn et al. (2017) catalog ar-tifacts in SQuAD (Rajpurkar et al., 2018), arguablythe most popular computer QA leaderboard; andindeed, many of the questions are not particularlyfun to answer from a human perspective; they’refairly formulaic. If you see a list like “Along withCanada and the United Kingdom, what country. . . ”,you can ignore the rest of the question and just typeCtrl+F (Russell, 2020; Yuan et al., 2019) to findthe third country—Australia in this case—that ap-pears with “Canada and the UK”. Other times, aSQuAD playtest would reveal frustrating questionsthat are i) answerable given the information in theparagraph but not with a direct span,1 ii) answer-able only given facts beyond the given paragraph,2

iii) unintentionally embedded in a discourse, result-ing in linguistically odd questions with arbitrarycorrect answers,3 iv) or non-questions.

1A source paragraph says “In [Commonwealth coun-tries]. . . the term is generally restricted to. . . Private educationin North America covers the whole gamut. . . ”; thus, the ques-tion “What is the term private school restricted to in the US?”has the information needed but not as a span.

2A source paragraph says “Sculptors [in the collec-tion include] Nicholas Stone, Caius Gabriel Cibber, [...],Thomas Brock, Alfred Gilbert, [...] and Eric Gill[.]”, i.e.,a list of names; thus, the question “Which British sculptorwhose work includes the Queen Victoria memorial in frontof Buckingham Palace is included in the V&A collection?”should be unanswerable in traditional machine reading.

3A question “Who else did Luther use violent rhetorictowards?” has the gold answer “writings condemning theJews and in diatribes against Turks”.

Or consider SearchQA (Dunn et al., 2017), de-rived from the game Jeopardy!, which asks “Anarticle that he wrote about his riverboat days waseventually expanded into Life on the Mississippi.”The young apprentice and newspaper writer whowrote the article is named Samuel Clemens; how-ever, the reference answer is that author’s later penname, Mark Twain. Most QA evaluation metricswould count Samuel Clemens as incorrect. In areal game of Jeopardy!, this would not be an issue(Section 3.1).

Of course, fun is relative and any dataset isbound to contain at least some errors. However,playtesting is an easy way to find systematic prob-lems in your dataset: unfair, unfun playtests makefor ineffective leaderboards. Eating your own dogfood can help diagnose artifacts, scoring issues, orother shortcomings early in the process.

Boyd-Graber et al. (2012) created an interfacefor play testing that was fun enough that peopleplayed for free. After two weeks the site was takendown, but it was popular enough that the triviacommunity forked the open source code to create abootleg version that is still going strong almost adecade later.

The deeper issues when creating a QA task are:have you designed a task that is internally consis-tent, supported by a scoring metric that matchesyour goals (more on this in a moment), using goldannotations that correctly reward those who do thetask well? Imagine someone who loves answeringthe questions your task poses: would they havefun on your task? If so, you may have a gooddataset. Gamification (von Ahn, 2006) harnessesusers’ passion better motivator than traditional paidlabeling. Even if you pay crowdworkers, if yourquestions are particularly unfun, you may need tothink carefully about your dataset and your goals.

2.2 Am I Measuring what I Care About?

Answering questions requires multiple skills: iden-tifying where an answer is mentioned (Hermannet al., 2015), knowing the canonical name for theanswer (Yih et al., 2015), realizing when to answerand when to abstain (Rajpurkar et al., 2018), orbeing able to justify an answer explicitly with evi-dence (Thorne et al., 2018). In QA, the emphasison SOTA and leaderboards has focused attention onsingle automatically computable metrics—systemstend to be compared by their ‘SQuAD score’ ortheir ‘NQ score’, as if this were all there is to say

https://protobowl.com/

Figure 1: Two datasets with 0.16 annotation error, butthe top better discriminates QA ability. In the gooddataset (top), most questions are challenging but notimpossible. In the bad dataset (bottom), there are moretrivial or impossible questions and annotation error isconcentrated on the challenging, discriminative ques-tions. Thus, a smaller fraction of questions decide whosits atop the leaderboard, requiring a larger test set.

about their relative capabilities. Like QA leader-boards, trivia tournaments need to decide on a sin-gle winner, but they explicitly recognize that thereare more interesting comparisons to be made.

For example, a tournament may recognize dif-fernt background/resources—high school, smallschool, undergrads (Henzel, 2018). Similarly, morepractical leaderboards would reflect training timeor resource requirements (see Dodge et al., 2019)including ‘constrained’ or ‘unconstrained’ train-ing (Bojar et al., 2014). Tournaments also giveawards for specific skills (e.g., least incorrect an-swers). Again, there are obvious leaderboardanalogs that would go beyond a single number. Forexample, in SQuAD 2.0, abstaining contributes thesame to the overall F1 as a fully correct answer,obscuring whether a system is more precise or aneffective abstainer. If the task recognizes both abil-ities as important, reporting a single score risksimplicitly prioritizing one balance of the two.

As a positive example for being explicit, the2018 FEVER shared task (Thorne et al., 2018) fa-vors getting the correct final answer over exhaus-tively justifying said answer with a a metric thatrequired only one piece of evidence per question.However, by still breaking out the full justificationprecision and recall scores in the leaderboard, itis still clear that the most precise system only wasin fourth place on the “primary” leaderboard, andthat the top three “primary metric” winners arecompromising on the evidence selection.

2.3 Do my Questions Separate the Best?

Let us assume that you have picked a metric (or aset of metrics) that capture what you care about:

systems answer questions correctly, abstain whenthey cannot, explain why they answered the waythey did, or whatever facet of QA is most importantfor your dataset. Now, this leaderboard can rackup citations as people chase the top spot. But yourleaderboard is only useful if it is discriminative:does it separate the best from the rest?

There are many ways questions might not bediscriminative. If every system gets a question right(e.g., abstain on “asdf” or correctly answer “Whatis the capital of Poland?”), it does not separateparticipants. Similarly, if every system flubs “whatis the oldest north-facing kosher restaurant” wrong,it also does not discriminate systems. Sugawaraet al. (2018) call these questions “easy” and “hard”;we instead argue for a three-way distinction.

In between easy questions (system answers cor-rectly with probability 1.0) and hard (probabil-ity 0.0), questions with probabilities nearer to0.5 are more interesting. Taking a cue from Vy-gotsky’s proximal development theory of humanlearning (Chaiklin, 2003), these discriminativequestions—rather than the easy or the hard ones—should help improve QA systems the most. TheseGoldilocks4 questions are also most important fordeciding who will sit atop the leaderboard; ide-ally they (and not random noise) will decide. Un-fortunately, many existing datasets seem to havemany easy questions; Sugawara et al. (2020) findthat ablations like shuffling word order (Feng andBoyd-Graber, 2019), shuffling sentences, or onlyoffering the most similar sentence do not impairsystems, and Rondeau and Hazen (2018) hypothe-size that most QA systems do little more than pat-tern matching, impressive performance numbersnotwithstanding.

2.4 Why so few Goldilocks Questions?

This is a common problem in trivia tournaments,particularly pub quizzes (Diamond, 2009), wheretoo difficult questions can scare off patrons. Manyquiz masters prefer popularity over discriminationand thus prefer easier questions.

Sometimes there are fewer Goldilocks questionsnot by choice, but by chance: a dataset becomesless discriminative through annotation error. Alldatasets have some annotation error; if this annota-tion error is concentrated on the Goldilocks ques-

4In a British folktale first recorded by Robert Southey,an interloper, “Goldilocks”, finds three bowls of porridge:one too hot, one too cold, and one “just right”. Goldilocksquestions’ difficulty are likewise “just right”.

tions, the dataset will be less useful. As we writethis in 2019, humans and computers sometimesstruggle on the same questions. Thus, annotationerror is likely to be correlated with which questionswill determine who will sit atop a leaderboard.

Figure 1 shows two datsets with the same anno-tation error and the same number of overall ques-tions. However, they have different difficulty dis-tributions and correlation of annotation error anddifficulty. The dataset that has more discrimina-tive questions and consistent annotator error hasfewer questions that are effectively useless for de-termining the winner of the leaderboard. We callthis the effective dataset proportion ρ. A datasetwith no annotation error and only discriminativequestions has ρ = 1.0, while the bad dataset in Fig-ure 1 has ρ = 0.16. Figure 2 shows the test set sizerequired to reliably discriminate systems for differ-ent values of ρ, based on a simulation described inAppendix B.

At this point, you might be despairing about howbig you need your dataset to be.5 The same terrortransfixes trivia tournament organizers. We dis-cuss a technique used by them for making individ-ual questions more discriminative using a propertycalled pyramidality in Section 4.

3 The Craft of Question Writing

One thing that trivia enthusiasts agree on is thatquestions need to be good questions that are wellwritten. Research shows that asking “good ques-tions” requires sophisticated pragmatic reason-ing (Hawkins et al., 2015), and pedagogy explicitlyacknowledges the complexity of writing effectivequestions for assessing student performance (?, fora book length treatment of writing good multiplechoice questions, see). There is no question thatwriting good questions is a craft in its own right.

QA datasets, however, are often collected fromthe wild or written by untrained crowdworkers.Even assuming they are confident users of En-glish (the primary language of QA datasets), crowd-workers lack experience in crafting questions andmay introduce idiosyncracies that shortcut ma-chine learning (Geva et al., 2019). Similarly, datacollected from the wild such as Natural Ques-tions (Kwiatkowski et al., 2019) by design havevast variations in quality. In the previous section,

5Indeed, using a more sophisticated simulation approachVoorhees (2003) found that the TREC 2002 QA test set couldnot discriminate systems with less than a seven absolute scorepoint difference.

we focused on how datasets as a whole should bestructured. Now, we focus on how specific ques-tions should be structured to make the dataset asvaluable as possible.

3.1 Avoiding Ambiguity and Assumptions

Ambiguity in questions not only frustrates answer-ers who resolved the ambiguity ‘incorrectly‘, theygenerally undermine the usefulness of questionsto precisely assess knowledge at all. For this rea-son, the US Department of Transportation explicitlybans ambiguous questions from exams for flightinstructors (?); and the trivia community has de-veloped rules that prevent ambiguity from arising.While this is true in many contexts, examples arerife in format called Quizbowl (Boyd-Graber et al.,2012), whose very long questions6 showcase triviawriters’ tactics. For example, Zhu Ying (2005 PAR-FAIT) warns [emphasis added]:

He’s not Sherlock Holmes, but his address is221B. He’s not the Janitor on Scrubs, but hisfather is played by R. Lee Ermy. [. . . ] For tenpoints, name this misanthropic, crippled, Vicodin-dependent central character of a FOX medicaldrama.ANSWER: Gregory House, MD

to head off these wrong answers.In contrast, QA datasets often contain ambigu-

ous and under-specified questions. While thissometimes reflects real world complexities suchas actual under-specified or ill-formed searchqueries (Faruqui and Das, 2018; Kwiatkowski et al.,2019), simply ignoring this ambiguity is problem-atic. As a concrete example, consider Natural Ques-tions (Kwiatkowski et al., 2019) where the goldanswer to “what year did the us hockey team won[sic] the olympics” has the answers 1960 and 1980,ignoring the US women’s team, which won in 1998and 2018, and further assuming the query is aboutice rather than field hockey (also an olympic sport).This ambiguity is a post hoc interpretation as theassociated Wikipedia pages are, indeed, about theUnited States men’s national ice hockey team. Wecontend the ambiguity persists in the original ques-tion, and the information retrieval arbitrarily pro-vides one of many interpretations. True to its name,these under-specified queries make up a substan-tial (if not almost the entire) fraction of naturalquestions in search contexts.

6Like Jeopardy!, they are not syntactically questions butstill are designed to elicit knowledge-based responses; forconsistency, we will still call them questions.

12 5 10 15100

100025005000

10000

Test

Nee

ded = 0.25

Average Accuracy

50

60

70

80

90

12 5 10 15

= 0.85

12 5 10 15 Accuracy

= 1.00

Figure 2: How much test data do you need to discriminate two systems with 95% confidence? This depends onboth the difference in accuracy between the systems (x axis) and the average accuracy of the systems (closer to50% is harder). Test set creators do not have much control over those. They do have control, however, over howmany questions are discriminative. If all questions are discriminative (right), you only need 2500 questions, but ifthree quarters of your questions are too easy, too hard, or have annotation errors (left), you’ll need 15000.

The problem is neither that such questions ex-ist nor that MRQA considers questions relative tothe associated context. The problem is that tasksdo not explicitly acknowledge the original ambigu-ity and instead gloss over the implicit assumptionsused in creating the data. This introduces poten-tial noise and bias (i.e., giving a bonus to systemsthat make the same assumptions as the dataset)in leaderboard rankings. At best, these will be-come part of the measurement error of datasets(no dataset is perfect). At worst, they will reca-pitulate the biases that went into the creation ofthe datasets. Then, the community will implic-itly equate the biases with correctness: you gethigh scores if you adopt this set of assumptions.Then, these enter into real-world systems, furtherperpetuating the bias. Playtesting can reveal theseissues (Section 2.1), as implicit assumptions canrob a player of correctly answered questions. Ifyou wanted to answer 2014 to “when did Michi-gan last win the championship”—when the Michi-gan State Spartans won the Women’s Cross Coun-try championship—and you cannot because youchose the wrong school, the wrong sport, and thewrong gender, you would complain as a player; re-searchers instead discover latent assumptions thatcrept into the data.7

It is worth emphasizing that this is not a purelyhypothetical problem. For example, Open DomainRetrieval Question Answering (Lee et al., 2019)deliberately avoids providing a reference context

7Where to draw the line is a matter of judgment;computers—who lack common sense—might find questionsambiguous where humans would not.

for the question in its framing but, in re-purposingdata such as Natural Questions, opaquely relies onit for the gold answers.

3.2 Avoiding superficial evaluations

A related issue is that, in the words of Voorheesand Tice (2000), “there is no such thing as a ques-tion with an obvious answer”. As a consequence,trivia question authors are considered to delineateacceptable and unacceptable answers.

For example, Robert Chu (Harvard Fall XI) usesa mental model of an answerer to explicitly delin-eate the range of acceptable correct answers:

In Newtonian gravity, this quantity satisfies Pois-son’s equation. [. . . ] For a dipole, this quantity isgiven by negative the dipole moment dotted withthe electric field. [. . . ] For 10 points, name thisform of energy contrasted with kinetic.ANSWER: potential energy (prompt on energy;accept specific types like electrical potential en-ergy or gravitational potential energy; do notaccept or prompt on just “potential”)

Likewise, the style guides for writing questionsstipulate that you must give the answer type clearlyand early on. These mentions specify whether youwant a book, a collection, a movement, etc. Italso signals the level of specificity requested. Forexample, a question about a date must state “dayand month required” (September 11, “month andyear required” (April 1968), or “day, month, andyear required” (September 1, 1939). This is truefor other answers as well: city and team, party andcountry, or more generally “two answers required”.Despite all of these conventions, no pre-defined set

of answers is perfect, and a process for adjudicatinganswers is an integral part of trivia competitions.

In major high school and college national com-petitions and game shows, if low-level staff cannotresolve the issue by either throwing out a singlequestion or accepting minor variations (America in-stead of USA), the low-level staff contacts the tour-nament director. The tournament director—withtheir deeper knowledge of rules and questions—often decide the issue. If not, the protest goesthrough an adjudication process designed to mini-mize bias:8 write the summary of the dispute, getall parties to agree to the summary, and then handthe decision off to mutually agreed experts fromthe tournament’s phone tree. The substance of thedisagreement is communicated (without identities),and the experts apply the rules and decide.

For example, a when a particularly inept Jeop-ardy! contestant9 answered endoscope to “Yoursurgeon could choose to take a look inside you withthis type of fiber-optic instrument”. Since the vanDoren scandal (Freedman, 1997), every televisiontrivia contestant has an advocate assigned from anauditing company. In this case, the advocate initi-ated a process that went to a panel of judges whothen ruled that endoscope (a more general term)was also correct.

The need for a similar process seems to havebeen well-recognized in the earliest days of QA

system bake-offs such as TREC-QA, and Voorhees(2008) notes that

[d]ifferent QA runs very seldom return exactly thesame [answer], and it is quite difficult to deter-mine automatically whether the difference [. . . ]is significant.

In stark contrast to this, QA datasets typically onlyprovide a single string or, if one is lucky, severalstrings. A correct answer means exactly matchingthese strings or at least having a high token overlapF1, and failure to agree with the pre-recorded ad-missible answers will put you at an uncontestabledisadvantage on the leaderboard (Section 2.2).

To illustrate how current evaluations fall shortof meaningful discrimination, we qualitatively an-alyze two near-SOTA systems on SQuAD V1.1 onCodaLab: the original XLNet (Yang et al., 2019)and a subsequent iteration called XLNet-123.10

8https://www.naqt.com/rules/#protest9http://www.j-archive.com/showgame.

php?game_id=611210We could not find a paper describing XLNet-123 in detail,

the submission is by http://tia.today.

Despite XLNet-123’s margin of almost four abso-lute F1 (94 vs 98) on development data, a manualinspection of a sample of 100 of XLNet-123’s winsindicate that around two-thirds are ‘spurious’: 56%are likely to be considered not only equally goodbut essentially identical; 7% are cases where theanswer set ommits a correct alternative; and 5% ofcases are ‘bad’ questions.11

Our goal is not to dwell on the exact proportions,to minimize the achievements of these strong sys-tems, or to minimize the usefulness of quantitativeevaluations. We merely want to raise the limitationof ‘blind automation’ for distinguishing betweensystems on a leaderboard.

Taking our cue from the trivia community, hereis an alternative for MRQA. Test sets are only cre-ated for a specific time; all systems are submitted si-multaneously. Then, all questions and answers arerevealed. System authors can protest correctnessrulings on questions, directly addressing the issuesabove. After agreement is reached, quantitativemetrics are computed for comparison purposes—despite their inherent limitations they at least canbe trusted. Adopting this for MRQA would requirecreating a new, smaller test set every year. How-ever, this would gradually refine the annotationsand process.

This suggestion is not novel: Voorhees and Tice(2000) conclude that automatic evaluations are suf-ficient “for experiments internal to an organizationwhere the benefits of a reusable test collection aremost significant (and the limitations are likely to beunderstood)” (our emphasis) but that “satisfactorytechniques for [automatically] evaluating new runs”have not been found yet. We are not aware of anychange on this front—if anything, we seem to havebecome more insensitive as a community to justhow limited our current evaluations are.

3.3 Focus on the Bubble

While every question should be perfect, time andresources are limited. Thus, authors of tournamentshave a policy of “focusing on the bubble”, wherethe “bubble” are the questions mostly likely to dis-criminate between top teams.

For humans, authors and editors focus on thequestions and clues that they predict will decidethe tournament. These questions are thoroughlyplaytested, vetted, and edited. Only after thesequestions have been perfected will the other ques-

11Examples in Appendix C.

https://www.naqt.com/rules/#protest

http://www.j-archive.com/showgame.php?game_id=6112

http://www.j-archive.com/showgame.php?game_id=6112

http://tia.today

tions undergo the same level of polish.For computers, the same logic applies. Authors

should ensure that these questions are correct, freeof ambiguity, and unimpeachable. However, as faras we can tell, the authors of QA datasets do notgive any special attention to these questions.

Unlike a human trivia tournament, however—with finite patience of the participants—this doesnot mean that you should necessarily remove allof the easy or hard questions from your dataset—spend more of your time/effort/resources on thebubble. You would not want to introduce a sam-pling bias that leads to inadvertently forgetting howto answer questions like “who is buried in Grant’stomb?” (Dwan, 2000, Chapter 7).

4 Why Quizbowl is the Gold Standard

We now focus our thus far wide-ranging QA dis-cussion to a specific format: Quizbowl, which hasmany of the desirable properties outlined above.We have no delusion that mainstream QA will uni-versally adopt this format. However, given thencommunity’s emphasis on fair evaluation, com-puter QA can borrow aspects from the gold standardof human QA. We discuss what this may look likein Section 5, but first we describe the gold standardof human QA.

We have shown several examples of Quizbowlquestions, but we have not yet explained in detailhow the format works; see Rodriguez et al. (2019)for a more comprehensive description. You mightbe scared off by how long the questions are. How-ever, in real Quizbowl trivia tournaments, they arenot finished because the questions are designed tobe interrupted.

Interruptable A moderator reads a question.Once someone knows the answer, they use a signal-ing device to “buzz in”. If the player who buzzed isright, they get points. Otherwise, they lose pointsand the question continues for the other team.

Not all trivia games with buzzers have this prop-erty, however. For example, take Jeopardy!, thesubject of Watson’s tour de force (Ferrucci et al.,2010). While Jeopardy! also uses signaling de-vices, these only work at the end of the question;Ken Jennings, one of the top Jeopardy! players(and also a Quizbowler) explains it on a PlanetMoney interview (Malone, 2019):

Jennings: The buzzer is not live until Alexfinishes reading the question. And if you buzzin before your buzzer goes live, you actually

lock yourself out for a fraction of a second. Sothe big mistake on the show is people who areall adrenalized and are buzzing too quickly, tooeagerly.Malone: OK. To some degree, Jeopardy! is kindof a video game, and a crappy video game whereit’s, like, light goes on, press button—that’s it.Jennings: (Laughter) Yeah.

Thus, Jeopardy!’s buzzers are a gimmick to en-sure good television; however, Quizbowl buzzersdiscriminate knowledge (Section 2.3). Similarly,while TriviaQA (Joshi et al., 2017) is written byknowledgeable writers, the questions are not pyra-midal.

Pyramidal Recall that an effective dataset (tour-nament) discriminates the best from the rest—thehigher the proportion of effective questions ρ, thebetter. Quizbowl’s ρ is nearly 1.0 because discrimi-nation happens within a question: after every word,an answerer must decide whether they have enoughinformation to answer the question. Quizbowl ques-tions are arranged so that questions are maximallypyramidal: questions begin with hard clues—onesthat require deep understanding—to more accessi-ble clues that are well known.

Well-Edited Quizbowl questions are created inphases. First, the author selects the answer and as-sembles (pyramidal) clues. A subject editor then re-moves ambiguity, adjusts acceptable answers, andtweaks clues to optimize discrimination. Finally,a packetizer ensures the overall set is diverse, hasuniform difficulty, and is without repeats.

Unnatural Trivia questions are fake: the askeralready knows the answer. But they’re no more fakethan a course’s final exam, which like leaderboardsare designed to test knowledge.

Experts know when questions are ambigiuous;while “what play has a character whose father isdead” could be Hamlet, Antigone, or Proof, a goodwriter’s knowledge avoids the ambiguity. Whenauthors omit these cues, the question is derided as a“hose” (Eltinge, 2013), which robs the tournamentof fun (Section 2.1).

One of the benefits of contrived formats is afocus on specific phenomena. Dua et al. (2019)exclude questions an existing MRQA system couldanswer to focus on challenging quantitative reason-ing. One of the trivia experts consulted in Wallaceet al. (2019) crafted a question that tripped up neu-ral QA systems with “this author opens Crime andPunishment”; the top system confidently answers

Fyodor Dostoyevski. However, that phrase wasembedded in a longer question “The narrator inCogwheels by this author opens Crime and Pun-ishment to find it has become The Brothers Kara-mazov”. Again, this shows the inventiveness andlinguistic dexterity of the trivia community.

A counterargument is that when real humans askquestions—e.g., on Yahoo! Questions (Szpektorand Dror, 2013), Quora (Iyer et al., 2017) or websearch (Kwiatkowski et al., 2019)—they ignore thecraft of question writing. Real humans tend to reactto unclear questions with confusion or divergentanswers, often making explicit in their answers howthey interpreted the original question (“I assumeyou meant. . . ”).

Given real world applications will have to dealwith the inherent noise and ambiguity of unclearquestions, our systems must cope with it. However,dealing with it is not the same as glossing over theircomplexities.

Complicated Quizbowl is more complex thanother datasets. Unlike other datasets where youjust need to decide what to answer, you also needto choose when to answer the question. Whilethis improves the dataset’s discrimination, it canhurt popularity because you cannot copy/paste codefrom other QA tasks. The complicated pyramidalstructure also makes some questions—e.g., what islog base four of sixty-four—difficult12 to ask. How-ever, the underlying mechanisms (e.g., reinforce-ment learning) share properties with other tasks,such as simultaneous translation (Grissom II et al.,2014; Ma et al., 2019), human incremental process-ing (Levy et al., 2008; Levy, 2011), and opponentmodeling (He et al., 2016).

5 A Call to Action

You may disagree with the superiority of Quizbowlas a QA framework (even for trivia nerds, not allagree. . . de gustibus non est disputandum). In thisfinal section, we hope to distill our advice into acall to action regardless of your question format of

12But not always impossible, as IHSSBCA shows:

This is the smallest counting number which isthe radius of a sphere whose volume is an integermultiple of π. It is also the number of distinct realsolutions to the equation x7 − 19x5 = 0. Thisnumber also gives the ratio between the volumesof a cylinder and a cone with the same heightsand radii. Give this number equal to the log basefour of sixty-four.

choice. Here are our recommendations if you wantto have an effective leaderboard.

Talk to Trivia Nerds You should talk to trivianerds because they have useful information (notjust about the election of 1876). Trivia is not justthe accumulation of information but also connect-ing disparate facts (Jennings, 2006). These skillsare exactly those that we want computers to de-velop.

Trivia nerds are writing questions anyway; wecan save money and time if we pool resources.Computer scientists benefit if the trivia communitywrites questions that aren’t trivial for computersto solve (e.g., avoiding quotes and named entities).The trivia community benefits from tools that maketheir job easier: show related questions, link toWikipedia, or predict where humans will answer.

Likewise, the broader public has unique knowl-edge and skills. In contrast to low-paid crowdwork-ers, public platforms for question answering andcitizen science (Bowser et al., 2013) are brimmingwith free expertise if you can engage the relevantcommunities. For example, the Quora query “Isthere a nuclear control room on nuclear aircraft car-riers?” is purportedly answered by someone whoworked in such a room (Humphries, 2017). Asmachine learning algorithms improve, the “goodenough” crowdsourcing that got us this far maysimply not be enough for continued progress.

Many question answering datasets benefit fromthe efforts of the trivia community. Ethically usingthe data requires acknowledging their contributionsand using their input to create datasets (Jo andGebru, 2020, Consent and Inclusivity).

Eat Your Own Dog Food As you develop newquestion answering tasks, you should feel comfort-able playing the task as a human. Importantly, thisis not just to replicate what crowdworkers are doing(also important) but to remove hidden assumptions,institute fair metrics, and define the task well. Forthis to feel real, you will need to keep score; haveall of your coauthors participate and compare theirscores.

Again, we emphasize that human and com-puter skills are not identical, but this is a benefit:humans natural aversion to unfairness will help youcreate a better task, while computers will blindlyoptimize a broken objective function (Bostrom,2003; ?). As you go through the process of play-ing on your question–answer dataset, you can see

where you might have fallen short on the goals weoutline in Section 3.

Won’t Somebody Look at the Data? After QA

datasets are released, there should also be deeper,more frequent discussion of actual questions withinthe NLP community. Part of every post-mortem oftrivia tournaments is a detailed discussion of thequestions, where good questions are praised andbad questions are excoriated. This is not meantto shame the writers but rather to help build andreinforce cultural norms: questions should be well-written, precise, and fulfill the creator’s goals. Justlike trivia tournaments, QA datasets resemble aproduct for sale. Creators want people to investtime and sometimes money (e.g., GPU hours) in us-ing their data and submitting to their leaderboards.It is “good business” to build a reputation for qual-ity questions and discussing individual questions.

Similarly, discussing and comparing the actualpredictions made by the competing systems shouldbe part of any competition culture—without it, it ishard to tell what a couple of points on some leader-board mean. To make this possible, we recommendthat leaderboards include an easy way for anyoneto download a system’s development predictionsfor qualitative analyses.

Make Questions Discriminative We argue thatquestions should be discriminative (Section 2.3),and while Quizbowl is one solution (Section 4), noteveryone is crazy enough to adopt this (beautiful)format. For more traditional QA tasks, you canmaximize the usefulness of your dataset by ensur-ing as many questions as possible are challenging(but not impossible) for today’s QA systems.

But you can use some Quizbowl intuitions toimprove discrimination. In visual QA, you can of-fer increasing resolutions of the image. For othersettings, create pyramidality by adding metadata:coreference, disambiguation, or alignment to aknowledge base. In short, consider multiple ver-sions/views of your data that progress from difficultto easy. This not only makes more of your datasetdiscriminative but also reveals what makes a ques-tion answerable.

Embrace Multiple Answers or Specify Speci-ficity As QA moves to more complicated for-mats and answer candidates, what constitutes acorrect answer becomes more complicated. Fullyautomatic evaluations are valuable for both train-ing and quick-turnaround evaluation. In the case

annotators disagree, the question should explic-itly state what level of specificity is required(e.g., September 1, 1939 vs. 1939 or Leninism vs.socialism). Or, if not all questions have a singleanswer, link answers to a knowledge base with mul-tiple surface forms or explicitly enumerate whichanswers are acceptable.

Appreciate Ambiguity If your intended QA ap-plication has to handle ambiguous questions, dojustice to the ambiguity by making it part of yourtask—for example, recognize the original ambigu-ity and resolve it (“did you mean. . . ”) instead ofgiving credit for happening to ‘fit the data’.

To ensure that our datasets properly “isolate theproperty that motivated it in the first place” (?),we need to explicitly appreciate the unavoidableambiguity instead of silently glossing over it.13

This is already an active area of research, withconversational QA being a new setting actively ex-plored by several datasets (Reddy et al., 2018; Choiet al., 2018); and other work explicitly focusingon identifying useful clarification questions (Raoand Daumé III, 2018), thematically linked ques-tions (Elgohary et al., 2018) or resolving ambi-guities that arise from coreference or pragmaticconstraints by rewriting underspecified questionstrings in context (Elgohary et al., 2019).

Revel in Spectacle However, with more compli-cated systems and evaluations, a return to the yearlyevaluations of TRECQA may be the best option.This improves not only the quality of evaluation(we can have real-time human judging) but also letsthe test set reflect the build it/break it cycle (Ruefet al., 2016), as attempted by the 2019 iterationof FEVER. Moreover, another lesson the QA com-munity could learn from trivia games is to turn itinto a spectacle: exciting games with a telegenichost. This has a benefit to the public, who see howQA systems fail on difficult questions and to QA

researchers, who have a spoonful of fun sugar to in-spect their systems’ output and their competitors’.

In between are automatic metrics that mimic theflexibility of human raters, inspired by machinetranslation evaluations (Papineni et al., 2002; Spe-cia and Farzindar, 2010) or summarization (Lin,2004). However, we should not forget that these

13Not surprisingly, ‘inherent’ ambiguity is not limited toQA; Pavlick and Kwiatkowski (2019) show natural languageinference data have ‘inherent disagreements’ between humansand advocate for recovering the full range of accepted infer-ences.

metrics were introduced as ‘understudies’—goodenough when quick evaluations are needed for sys-tem building but no substitute for a proper evalu-ation. In machine translation, Laubli et al. (2020)reveal that crowdworkers cannot spot the errorsthat neural MT systems make—fortunately, trivianerds are cheaper than professional translators.

Be Honest in Crowning QA ChampionsWhile—particularly for leaderboards—it istempting to turn everything into a single number,recognize that there are often different sub-tasksand types of players who deserve recognition. Asimple model that requires less training data orruns in under ten milliseconds may be objectivelymore useful than a bloated, brittle monster ofa system that has a slightly higher F1 (Dodgeet al., 2019). While you may only rank by a singlemetric (this is what trivia tournaments do too),you may want to recognize the highest-scoringmodel that was built by undergrads, took no morethan one second per example, was trained only onWikipedia, etc.

Finally, if you want to make human–computercomparisons, pick the right humans. Paraphrasinga participant of the 2019 MRQA workshop (Fischet al., 2019), a system better than the average hu-man at brain surgery does not imply superhumanperformance in brain surgery. Likewise, beating adistracted crowdworker on QA is not QA’s endgame.If your task is realistic, fun, and challenging, youwill find experts to play against your computer.Not only will this give you human baselines worthreporting—they can also tell you how to fix yourQA dataset. . . after all, they’ve been at it longer thanyou have.

Acknowledgements Many thanks to Massimil-iano Ciaramita, Jon Clark, Christian Buck, EmilyPitler, and Michael Collins for insightful discus-sions that helped frame these ideas. Thanks toKevin Kwok for permission to use Protobowlscreenshot and information.

ReferencesLuis von Ahn. 2006. Games with a purpose. Computer,

39:92 – 94.

David Baber. 2015. Television Game Show Hosts: Bi-ographies of 32 Stars. McFarland.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,

Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlešTamchyna. 2014. Findings of the 2014 workshop onstatistical machine translation. In Proceedings of theNinth Workshop on Statistical Machine Translation,pages 12–58, Baltimore, Maryland, USA. Associa-tion for Computational Linguistics.

Nick Bostrom. 2003. Ethical issues in advanced arti-ficial intelligence. Institute of Advanced Studies inSystems Research and Cybernetics, 2:12–17.

Anne Bowser, Derek Hansen, Yurong He, CarolBoston, Matthew Reid, Logan Gunnell, and JenniferPreece. 2013. Using gamification to inspire new cit-izen science volunteers. In Proceedings of the FirstInternational Conference on Gameful Design, Re-search, and Applications, Gamification ’13, pages18–25, New York, NY, USA. ACM.

Jordan Boyd-Graber, Brianna Satinoff, He He, andHal Daume III. 2012. Besting the quiz master:Crowdsourcing incremental classification games. InProceedings of Empirical Methods in Natural Lan-guage Processing.

Seth Chaiklin. 2003. The Zone of Proximal Devel-opment in Vygotsky’s Analysis of Learning and In-struction, Learning in Doing: Social, Cognitiveand Computational Perspectives, page 39–64. Cam-bridge University Press.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-moyer. 2018. QuAC: Question answering in con-text. In Proceedings of Empirical Methods in Nat-ural Language Processing.

Anthony Cuthbertson. 2018. Robots can now read bet-ter than humans, putting millions of jobs at risk.

Paul Diamond. 2009. How To Make 100 Pounds ANight (Or More) As A Pub Quizmaster. DP Quiz.

Jesse Dodge, Suchin Gururangan, Dallas Card, RoySchwartz, and Noah A. Smith. 2019. Show yourwork: Improved reporting of experimental results.In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 2185–2194, Hong Kong, China. Association for Computa-tional Linguistics.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics, pages 2368–2378, Minneapolis, Minnesota.

Matthew Dunn, Levent Sagun, Mike Higgins,V. Ugur Güney, Volkan Cirik, and KyunghyunCho. 2017. SearchQA: A new Q&A dataset aug-mented with context from a search engine. CoRR,abs/1704.05179.

https://doi.org/10.1109/MC.2006.196

https://doi.org/10.3115/v1/W14-3302

https://doi.org/10.3115/v1/W14-3302

https://doi.org/10.1145/2583008.2583011

https://doi.org/10.1145/2583008.2583011

https://doi.org/10.1017/CBO9780511840975.004

https://doi.org/10.1017/CBO9780511840975.004

https://doi.org/10.1017/CBO9780511840975.004

https://www.aclweb.org/anthology/D18-1241


https://www.newsweek.com/robots-can-now-read-better-humans-putting-millions-jobs-risk-781393

https://www.newsweek.com/robots-can-now-read-better-humans-putting-millions-jobs-risk-781393

https://doi.org/10.18653/v1/D19-1224

https://doi.org/10.18653/v1/D19-1224

https://doi.org/10.18653/v1/N19-1246

https://doi.org/10.18653/v1/N19-1246

http://arxiv.org/abs/1704.05179


R. Dwan. 2000. As Long as They’re Laughing: Grou-cho Marx and You Bet Your Life. Midnight MarqueePress.

Ahmed Elgohary, Denis Peskov, and Jordan Boyd-Graber. 2019. Can you unpack that? learning torewrite questions-in-context. In Empirical Methodsin Natural Language Processing.

Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber.2018. Dataset and baselines for sequential open-domain question answering. In Empirical Methodsin Natural Language Processing.

Stephen Eltinge. 2013. Quizbowl lexicon.

Manaal Faruqui and Dipanjan Das. 2018. Identifyingwell-formed natural language questions. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 798–803,Brussels, Belgium. Association for ComputationalLinguistics.

Shi Feng and Jordan Boyd-Graber. 2019. What ai cando for me: Evaluating machine learning interpreta-tions in cooperative play. In International Confer-ence on Intelligent User Interfaces.

David Ferrucci, Eric Brown, Jennifer Chu-Carroll,James Fan, David Gondek, Aditya A. Kalyanpur,Adam Lally, J. William Murdock, Eric Nyberg,John Prager, Nico Schlaefer, and Chris Welty. 2010.Building Watson: An Overview of the DeepQAProject. AI Magazine, 31(3).

Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu-nsol Choi, and Danqi Chen, editors. 2019. Proceed-ings of the 2nd Workshop on Machine Reading forQuestion Answering. Association for ComputationalLinguistics, Hong Kong, China.

Morris Freedman. 1997. The fall of Charlie Van Doren.The Virginia Quarterly Review, 73(1):157–165.

Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019.Are we modeling the task or the annotator? an inves-tigation of annotator bias in natural language under-standing datasets. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. As-sociation for Computational Linguistics.

Alvin Grissom II, He He, Jordan Boyd-Graber, andJohn Morgan. 2014. Don’t until the final verb wait:Reinforcement learning for simultaneous machinetranslation. In Proceedings of Empirical Methodsin Natural Language Processing.

Robert X. D. Hawkins, Andreas Stuhlmüller, Judith De-gen, and Noah D. Goodman. 2015. Why do youask? good questions provoke informative answers.In Proceedings of the 37th Annual Meeting of theCognitive Science Society.

He He, Jordan Boyd-Graber, Kevin Kwok, and HalDaumé III. 2016. Opponent modeling in deep re-inforcement learning. In Proceedings of the Interna-tional Conference of Machine Learning.

R. Robert Henzel. 2018. Naqt eligibility overview.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines to readand comprehend. In Proceedings of Advances inNeural Information Processing Systems.

Bryan Humphries. 2017. Is there a nuclear controlroom on nuclear aircraft carriers?

Shankar Iyer, Nikhil Dandekar, , and Kornél Csernai.2017. Quora question pairs.

Ken Jennings. 2006. Brainiac: adventures in the cu-rious, competitive, compulsive world of trivia buffs.Villard.

Eun Seo Jo and Timnit Gebru. 2020. Lessons fromarchives: Strategies for collecting sociocultural datain machine learning. In Proceedings of the Confer-ence on Fairness, Accountability, and Transparency.

Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. TriviaQA: A large scale dis-tantly supervised challenge dataset for reading com-prehension. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 1601–1611.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Matthew Kelcey,Jacob Devlin, Kenton Lee, Kristina N. Toutanova,Llion Jones, Ming-Wei Chang, Andrew Dai, JakobUszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-ral questions: a benchmark for question answeringresearch. Transactions of the Association of Compu-tational Linguistics.

Samuel Laubli, Sheila Castilho, Graham Neubig,Rico Sennrich, Qinlan Shen, and Antonio Toral.2020. A set of recommendations for assessing hu-man–machine parity in language translation. Jour-nal of Artificial Intelligence Research, 67.

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.2019. Latent retrieval for weakly supervised opendomain question answering. In Proceedings of theAssociation for Computational Linguistics.

Roger Levy. 2011. Integrating surprisal and uncertain-input models in online sentence comprehension: for-mal techniques and empirical results. In Proceed-ings of the Association for Computational Linguis-tics, pages 1055–1065.

Roger P. Levy, Florencia Reali, and Thomas L. Grif-fiths. 2008. Modeling the effects of memory on hu-man online sentence processing with particle filters.

https://books.google.ch/books?id=hO0KAQAAMAAJ

https://books.google.ch/books?id=hO0KAQAAMAAJ

http://umiacs.umd.edu/~jbg//docs/2018_emnlp_linked.pdf

http://umiacs.umd.edu/~jbg//docs/2018_emnlp_linked.pdf

http://www.pace-nsc.org/pace-quizbowl-lexicon/

https://doi.org/10.18653/v1/D18-1091

https://doi.org/10.18653/v1/D18-1091




http://www.jstor.org/stable/26439004

https://doi.org/10.18653/v1/D19-1107

https://doi.org/10.18653/v1/D19-1107

https://doi.org/10.18653/v1/D19-1107

https://www.naqt.com/rules/eligibility.html

http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf

http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf

https://www.quora.com/Is-there-a-nuclear-control-room-on-nuclear-aircraft-carriers/answer/Bryan-Humphries-1/

https://www.quora.com/Is-there-a-nuclear-control-room-on-nuclear-aircraft-carriers/answer/Bryan-Humphries-1/

https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs

https://tomkwiat.users.x20web.corp.google.com/papers/natural-questions/main-1455-kwiatkowski.pdf



In Proceedings of Advances in Neural InformationProcessing Systems, pages 937–944.

Chin-Yew Lin. 2004. Rouge: a package for auto-matic evaluation of summaries. In Proceedings ofthe Workshop on Text Summarization Branches Out(WAS 2004).

Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,Zhongjun He, Hairong Liu, Xing Li, Hua Wu, andHaifeng Wang. 2019. STACL: Simultaneous trans-lation with implicit anticipation and controllable la-tency using prefix-to-prefix framework. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3025–3036,Florence, Italy. Association for Computational Lin-guistics.

Kenny Malone. 2019. How uncle Jamie broke jeop-ardy.

Adam Najberg. 2018. Alibaba AI model tops humansin reading comprehension.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. In Proceedings of theAssociation for Computational Linguistics.

Ellie Pavlick and Tom Kwiatkowski. 2019. Inherentdisagreements in human textual inferences. Transac-tions of the Association for Computational Linguis-tics, 7:677–694.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Proceedings of the Associationfor Computational Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questionsfor machine comprehension of text. In Proceedingsof Empirical Methods in Natural Language Process-ing.

Sudha Rao and Hal Daumé III. 2018. Learning to askgood questions: Ranking clarification questions us-ing neural expected value of perfect information. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 2737–2746, Melbourne, Aus-tralia. Association for Computational Linguistics.

Siva Reddy, Danqi Chen, and Christopher D. Manning.2018. CoQA: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics, 7:249–266.

Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, andJordan L. Boyd-Graber. 2019. Quizbowl: Thecase for incremental question answering. CoRR,abs/1904.04792.

Marc-Antoine Rondeau and T. J. Hazen. 2018. Sys-tematic error analysis of the Stanford question an-swering dataset. In Proceedings of the Workshopon Machine Reading for Question Answering, pages12–20, Melbourne, Australia. Association for Com-putational Linguistics.

Andrew Ruef, Michael Hicks, James Parker, DaveLevin, Michelle L. Mazurek, and Piotr Mardziel.2016. Build it, break it, fix it: Contesting securedevelopment. In Proceedings of the 2016 ACMSIGSAC Conference on Computer and Communica-tions Security, CCS ’16, pages 690–703, New York,NY, USA. ACM.

Dan Russell. 2020. Why control-f is the single mostimportant thing you can teach someone about search.SearchReSearch.

Lucia Specia and Atefeh Farzindar. 2010. A.: Esti-mating machine translation post-editing effort withHTER. In In: AMTA 2010 Workshop, Bringing MTto the User: MT Research and the Translation Indus-try. The 9th Conference of the Association for Ma-chine Translation in the Americas.

Saku Sugawara, Kentaro Inui, Satoshi Sekine, andAkiko Aizawa. 2018. What makes reading compre-hension questions easier? In Proceedings of Empiri-cal Methods in Natural Language Processing.

Saku Sugawara, Pontus Stenetorp, Kentaro Inui, andAkiko Aizawa. 2020. Assessing the benchmark-ing capacity of machine reading comprehensiondatasets. In Association for the Advancement of Ar-tificial Intelligence.

Idan Szpektor and Gideon Dror. 2013. From queryto question in one click: Suggesting synthetic ques-tions to searchers. In Proceedings of WWW 2013.

David Taylor, Colin McNulty, and Jo Meek. 2012.Your starter for ten: 50 years of university challenge.

James Thorne, Andreas Vlachos, Oana Cocarascu,Christos Christodoulopoulos, and Arpit Mittal, edi-tors. 2018. Proceedings of the First Workshop onFact Extraction and VERification (FEVER). Associ-ation for Computational Linguistics, Brussels, Bel-gium.

Ellen M. Voorhees. 2003. Evaluating the evaluation: Acase study using the TREC 2002 question answeringtrack. In Proceedings of the 2003 Human LanguageTechnology Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 260–267.

Ellen M. Voorhees. 2008. Evaluating QuestionAnswering System Performance, pages 409–430.Springer Netherlands, Dordrecht.

Ellen M Voorhees and Dawn M Tice. 2000. Buildinga question answering test collection. In Proceedingsof the 23rd annual international ACM SIGIR confer-ence on Research and development in informationretrieval, pages 200–207. ACM.

https://doi.org/10.18653/v1/P19-1289

https://doi.org/10.18653/v1/P19-1289

https://doi.org/10.18653/v1/P19-1289

https://www.alizila.com/alibaba-ai-model-tops-humans-in-reading-comprehension/

https://www.alizila.com/alibaba-ai-model-tops-humans-in-reading-comprehension/

https://doi.org/10.1162/tacl_a_00293

https://doi.org/10.1162/tacl_a_00293

https://doi.org/10.18653/v1/P18-2124

https://doi.org/10.18653/v1/P18-2124

https://doi.org/10.18653/v1/P18-1255

https://doi.org/10.18653/v1/P18-1255

https://doi.org/10.18653/v1/P18-1255



https://doi.org/10.18653/v1/W18-2602

https://doi.org/10.18653/v1/W18-2602

https://doi.org/10.18653/v1/W18-2602

https://doi.org/10.1145/2976749.2978382

https://doi.org/10.1145/2976749.2978382

http://searchresearch1.blogspot.com/2010/02/why-control-f-is-single-most-important.html

http://searchresearch1.blogspot.com/2010/02/why-control-f-is-single-most-important.html

https://doi.org/10.18653/v1/D18-1453

https://doi.org/10.18653/v1/D18-1453

https://www.bbc.co.uk/sounds/play/b01m49vh

https://www.aclweb.org/anthology/W18-5500

https://www.aclweb.org/anthology/W18-5500

https://www.aclweb.org/anthology/N03-1034



https://doi.org/10.1007/978-1-4020-4746-6_13

https://doi.org/10.1007/978-1-4020-4746-6_13

Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Ya-mada, and Jordan Boyd-Graber. 2019. Trick me ifyou can: Human-in-the-loop generation of adversar-ial question answering examples. Transactions ofthe Association of Computational Linguistics, 10.

Dirk Weissenborn, Georg Wiese, and Laura Seiffe.2017. Making neural QA as simple as possible butnot simpler. In Conference on Computational Natu-ral Language Learning.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. CoRR, abs/1906.08237.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (ACL 2015), pages1321–1331. ACL.

Xingdi Yuan, Jie Fu, Marc-Alexandre Cote, Yi Tay,Christopher Pal, and Adam Trischler. 2019. Interac-tive machine comprehension with information seek-ing agents.

https://doi.org/10.18653/v1/K17-1028

https://doi.org/10.18653/v1/K17-1028






Appendix

Figure, footnote, and table numbers continue frommain article.

A An Abridged History of ModernTrivia

In the United States, modern trivia exploded im-mediately after World War II via countless gameshows including College Bowl (Baber, 2015), theprecursor to Quizbowl. The craze spread tothe United Kingdom in a bootlegged version ofQuizbowl called University Challenge (now lis-censed by ITV) and pub quizzes (Taylor et al.,2012).

The initial explosion, however, was not withoutcontroversy. A string of cheating scandals, most no-tably the Van Doren (Freedman, 1997) scandal (thesubject of the film Quiz Show), and the 1977 entryof Quizbowl into intercollegiate competition forcedtrivia to “grow up”. Professional organizations andmore “grownup” game shows like Jeopardy! (the“all responses in the form of a question” gimmickgrew out of how some game shows gave contestantsthe answers) helped created formalized structuresfor trivia.

As the generation that grew up with formalizedtrivia reached adulthood, they sought to make theoutcomes of trivia competitions more rigorous, es-chewing the randomness that makes for good tele-vision. Organizations like National Academic QuizTournaments and the Academic Competition Fed-eration created routes for the best players to helpdirect how trivia competitions would be run. In2019, these organizations have institutionalized thebest practices of “good trivia” described here.

B Simulating the Test Set Needed

We simulate a head-to-head trivia competitionwhere System A and System B have an accuracy a(probability of getting a question right) separatedby some difference: aA − aB ≡ ∆. We then sim-ulate this on a test set of size N—scaled by theeffective dataset proportion ρ—via draws from twoBinomial distributions with success probabilitiesof aA and aB , respectively:

Ra ∼Binomial(ρN, aARb ∼Binomial(ρN, aB) (1)

and see the minimum test set questions (using anexperiment size of 5000) needed to detect the better

system 95% of the time (i.e., the minimum N suchthat Ra > Rb from Equation 1 in 0.95 of the exper-iments). Our emphasis, however is ρ: the smallerthe percentage of discriminative questions (eitherbecause of difficulty or because of annotation er-ror), the larger your test set must be.14

C Qualitative Analysis Examples

We provide some concrete examples for the classesinto which we classified the XLNet-123 wins overXLNet. We indicate gold answer spans (providedby the human annotators) by underlining (theremay be, the XLNet answer span by bold face, andthe XLNet-123 answer span by italics, combiningfor tokens shared between spans as is appropriate.

C.1 Insignificant and significant spandifferences

QUESTION: What type of vote must the Parlia-ment have to either block or suggest changes tothe Commission’s proposals?CONTEXT: [. . . ] The essence is there are threereadings, starting with a Commission proposal,where the Parliament must vote by a majority ofall MEPs (not just those present) to block or sug-gest changes, [. . . ]

a majority of all MEPs is as good an answer asmajority, yet its Exact Match score is 0. Note thatthe problem is not merely one of picking a softmetric; even its Token-F1 score is merely 0.4, ef-fectively penalizing a system for giving a morecomplete answer. The limitations of Token-F1 be-come even clearer in light of the following signifi-cant span difference:

QUESTION: What measure of a computationalproblem broadly defines the inherent difficulty ofthe solution?CONTEXT: A problem is re-garded as inherently difficultif its solution requires significant resources,whatever the algorithm used. [. . . ]

We agree with the automatic eval that a systemanswering significant resources to this questionshould not be given full (and possibly no) credit asit fails to mention relevant context. Nevertheless,the Token-F1 of this answer is 0.57, i.e. larger thanfor the insignificant span difference just discussed.

14Disclaimer: This should be only one of many consider-ations in deciding on the size of your test set. Other factorsmay include balancing for demographic properties, coveringlinguistic variation, or capturing task-specific phenomena.

C.2 Missing Gold Answers

We also observed 7 (out of 100) cases of missinggold answers. As an example, consider

QUESTION: What would someone who iscivilly disobedient do in court?CONTEXT: Steven Barkan writes that if defen-dants plead not guilty, “they must decide whethertheir primary goal will be to win an acquittal andavoid imprisonment or a fine, or to use the pro-ceedings as a forum to inform the jury and thepublic of the political circumstances surroundingthe case and their reasons for breaking the law viacivil disobedience.” [. . . ]In countries such as the United States whose lawsguarantee the right to a jury trial but do not excuselawbreaking for political purposes, some civil dis-obedients seek jury nullification.

While annotators did mark two distinct spansas gold answers, they happen to have not markedup jury nullification which is a fine answer to thequestion and should be rewarded. Any individualcase can likely be discussed as to whether this trulyis a missing answer or the question rules it outdue to some specific subtlety in its phrasing. Thisis precisely the point—relying on a pre-collectedset of static answer strings without a process foradjudicating cases of disagreements in official com-parisons does not do justice to the complexity ofquestion answering.

C.3 Bad Questions

We also observed 5 cases of genuinely bad ques-tions. Consider

QUESTION: What library contains the SelmurProductions catalogue?CONTEXT: Also part of the library is the afore-mentioned Selznick library, the Cinerama Produc-tions/Palomar theatrical library and the SelmurProductions catalog that the network acquiredsome years back [. . . ]

This is simply an annotation error—the cor-rect answer to the question is not available fromthe paragraph and would have to be (the Ameri-can Broadcast Company’s) Programming Library.While we have to live with annotation errors as partof reality, it is not clear that we ought to simplyaccept them for official evaluations—any humantaking a closer look at the paragraph, as part of anadjudication process, would at least note that thequestion is problematic.

Other cases of ‘annotation’ error are more subtle,involving meaning-changing typos, for example:

QUESTION: Which French kind [sic] issued thisdeclaration?

CONTEXT: They retained the religious provi-sions of the Edict of Nantes until the rule of LouisXIV, who progressively increased persecution ofthem until he issued the Edict of Fontainebleau(1685), which abolished all legal recognition ofProtestantism in France[.] [...]

While one could debate whether or not systemsought to be able to do ‘charitable’ reinterpretationsof the question text, this is part of the point—caseslike these warrant discussion and should not besilently glossed over when ‘computing the score’.

what question answering can learn from trivia nerds · 2020-04-23 · what question answering can...

Documents