gcrc: a new challenging mrc dataset from gaokao chinese

12
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1319–1330 August 1–6, 2021. ©2021 Association for Computational Linguistics 1319 GCRC: A New MRC Dataset from Gaokao Chinese for Explainable Evaluation Hongye Tan 1* , Xiaoyue Wang 1 * , Yu Ji 1 , Ru Li , Xiaoli Li 2 , Zhiwei Hu 1 , Yunxiao Zhao 1 and Xiaoqi Han 1 1.School of Computer and Information Technology, Shanxi University, Taiyuan, China 2.Institute for Infocomm Research, A*Star, Singapore {wangxy0808, jiyu0515, yunxiaomr, xiaoqisev}@163.com {liru,tanhongye}@sxu.edu.cn [email protected], [email protected] Abstract Recently, driven by numerous publicly available machine reading comprehension (MRC) datasets, MRC systems have made some progress. These datasets, however, have two major limitations: 1) the defined tasks are relatively simple, and 2) they do not provide explainable evaluation which is crit- ical to objectively and comprehensively review the reasoning capabilities of current MRC systems. In this paper, we propose GCRC, a new dataset with challenging and high-quality multi-choice ques- tions, collected from Gaokao Chinese (Chinese subject from the National College Entrance Ex- amination of China). We have manually labelled three types of evidence to evaluate MRC systems’ reasoning process: 1) sentence-level relevant sup- porting facts in an article required for answering a given question, 2) error reason of a distractor (i.e., an incorrect option) for explaining why a distractor should be eliminated, which is an important reason- ing step for multi-choice questions, and 3) types of reasoning skills required for answering ques- tions. Extensive experiments show that our pro- posed dataset is more challenging and very useful for identifying the limitations of existing MRC sys- tems in an explainable way, facilitating researchers to develop novel machine learning and reasoning approaches to tackle this challenging research prob- lem. 1 * These authors contributed equally. Corresponding author 1 Resources will be available through https:// github.com/SXUNLP/GCRC 1 Introduction Machine Reading Comprehension (MRC) is a crit- ical task in many real-world applications, which requires machines to understand a text passage, and answer relevant questions. It evaluates ma- chines’ understanding and reasoning capabilities on the underlying natural language text. Numerous MRC datasets have been proposed and facilitate the progress of MRC systems, which have achieved near-human performance on some datasets. How- ever, this does not indicate the MRC systems have owned human-like language understanding and rea- soning capabilities. One reason is that questions in current data are not challenging enough, leading to most of them get “solved” very soon. (Sugawara et al., 2018) demonstrate that in many MRC datasets, a consid- erable number of easy questions can be answered based on the first few tokens of the questions or word matching, without complex reasoning capa- bilities. The other reason is that most datasets only provide a black-box and overall evaluation on the accuracy of predicted answers, which does not pro- vide explainable evaluation on a system’s internal reasoning capabilities. In other words, it is unable to explain whether a system gets a correct answer via the right reasoning process, and not enough to identify the specific limitations of a system. Recently, in order to address the problems men- tioned, some datasets, focusing on reasonings and providing additional information to evaluate the internal reasoning steps of a system, have been proposed. For example, MultiRC (Khashabi et al., 2018) and HotpotQA (Yang et al., 2018) include the questions requiring multi-sentence reasoning 1

Upload: others

Post on 03-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Findings of the Association for Computational Linguistics ACL-IJCNLP 2021 pages 1319ndash1330August 1ndash6 2021 copy2021 Association for Computational Linguistics

1319

GCRC A New MRC Dataset from Gaokao Chinese forExplainable Evaluation

Hongye Tan1 Xiaoyue Wang1 Yu Ji1 Ru Lidagger Xiaoli Li2 Zhiwei Hu1Yunxiao Zhao1 and Xiaoqi Han1

1School of Computer and Information Technology Shanxi University Taiyuan China2Institute for Infocomm Research AStar Singapore

wangxy0808 jiyu0515 yunxiaomr xiaoqisev163comlirutanhongyesxueducn

xllintuedusg zhiweihuwhueducn

Abstract

Recently driven by numerous publicly availablemachine reading comprehension (MRC) datasetsMRC systems have made some progress Thesedatasets however have two major limitations 1)the defined tasks are relatively simple and 2) theydo not provide explainable evaluation which is crit-ical to objectively and comprehensively review thereasoning capabilities of current MRC systems Inthis paper we propose GCRC a new dataset withchallenging and high-quality multi-choice ques-tions collected from Gaokao Chinese (Chinesesubject from the National College Entrance Ex-amination of China) We have manually labelledthree types of evidence to evaluate MRC systemsrsquoreasoning process 1) sentence-level relevant sup-porting facts in an article required for answering agiven question 2) error reason of a distractor (iean incorrect option) for explaining why a distractorshould be eliminated which is an important reason-ing step for multi-choice questions and 3) typesof reasoning skills required for answering ques-tions Extensive experiments show that our pro-posed dataset is more challenging and very usefulfor identifying the limitations of existing MRC sys-tems in an explainable way facilitating researchersto develop novel machine learning and reasoningapproaches to tackle this challenging research prob-lem1

These authors contributed equallydaggerCorresponding author1Resources will be available through https

githubcomSXUNLPGCRC

1 Introduction

Machine Reading Comprehension (MRC) is a crit-ical task in many real-world applications whichrequires machines to understand a text passageand answer relevant questions It evaluates ma-chinesrsquo understanding and reasoning capabilitieson the underlying natural language text NumerousMRC datasets have been proposed and facilitatethe progress of MRC systems which have achievednear-human performance on some datasets How-ever this does not indicate the MRC systems haveowned human-like language understanding and rea-soning capabilities

One reason is that questions in current data arenot challenging enough leading to most of themget ldquosolvedrdquo very soon (Sugawara et al 2018)demonstrate that in many MRC datasets a consid-erable number of easy questions can be answeredbased on the first few tokens of the questions orword matching without complex reasoning capa-bilities The other reason is that most datasets onlyprovide a black-box and overall evaluation on theaccuracy of predicted answers which does not pro-vide explainable evaluation on a systemrsquos internalreasoning capabilities In other words it is unableto explain whether a system gets a correct answervia the right reasoning process and not enough toidentify the specific limitations of a system

Recently in order to address the problems men-tioned some datasets focusing on reasonings andproviding additional information to evaluate theinternal reasoning steps of a system have beenproposed For example MultiRC (Khashabi et al2018) and HotpotQA (Yang et al 2018) includethe questions requiring multi-sentence reasoning

1

1320

Figure 1 An annotated instance in GCRC which is from the real Gaokao Chinese in 2019 ( indicatesthe correct option) The sentences marked in yellowgreen are respectively the SFs for Option A and CAs mentioned in Option B the SFs of Option B are the first paragraph Similarly the SFs of Option D arethe whole article The contents marked in blue are the required reasoning skills for the options and theERs of distractors

or multi-hop reasoning Moreover MultiRC andHotpotQA both provide sentence-level supportingfacts (SFs) which can be used as a kind of internalexplanation for the answers although locating SFsis just the first step for question answering (QA) asmany questions need to integrate the SF informa-tion with reasoning R4C (Inoue et al 2020) and2WikiMultiHopQA (Ho et al 2020) introduce achain of facts (reflecting entity relationships) as thederivation steps to evaluate the internal reasoningstep of systems

However these kinds of reasonings are quitelimited and not enough to answer those complexand comprehensive questions For example Fig-ure1 shows a multi-choice question where the op-tion D ldquoThis article reflects the concerns aboutbiodiversity crisis and proposes countermeasuresrdquois a summarization of the whole given article andits judgement cannot be made by simply reasoning

over some entity relationships Instead the reason-ing over the full information across the article willbe needed

To address the above real challenges we pro-pose GCRC a new challenging dataset with 8719multi-choice questions collected from readingcomprehension (RC) tasks of Gaokao Chinese(short for Chinese subject from the National Col-lege Entrance Examination of China) GCRC isof high quality and high difficulty level becauseGaokao Chinese examinations are designed by ed-ucational experts with high-level comprehensivequestions and complicated articles aiming to testthe language comprehension of adult examinees

In order to provide explainable evaluation in-stances (eg Figure 1) in the dev and test sets(including 1725 questions and 6900 options as pre-sented in Table 2 and Section 4) of GCRC are anno-tated with three kinds of information (1) sentence-

1321

level supporting facts (SFs) serves as the basicevaluation for a systemrsquos internal reasoning and(2) error reasons (ERs) of a distractor (ie anincorrect option) for explaining why a distractorshould be eliminated In Gaokao Chinese distrac-tors are often designed in a very confusing way andlook like a correct option although they are incor-rect Identifying the semantic difference betweena distractor and the given article and knowing ex-actly why a distractor is wrong could help systemsor examinees to choose the correct answers There-fore we spend considerable effort to thoroughlyunderstand articles and manually label seven typesof ERs of a distractor as a form of internal reason-ing step for multi-choice questions and (3) types ofreasoning skills required for answering questionsin Gaokao Chinese We introduce eight typicalreasoning skills in GCRC which enable us to eval-uate whether a system has corresponding reasoningcapability at individual question level

Our main contributions can be summarized as

bull We construct a new challenging datasetGCRC that is collected from Gaokao Chi-nese consisting of 8719 multiple-choicequestions which will be released on Githubin future

bull Three kinds of critical information are man-ually annotated for explainable evaluationTo the best of our knowledge GCRC is thefirst Chinese MRC dataset to provide mostcomprehensive challenging high-quality ar-ticles and questions for more deep explain-able evaluation on different MRC system per-formance In particular error reasons of adistractor are the first introduced in this areaas an important reasoning step for complexmulti-choice questions

bull Extensive experiments show that GCRC isnot only a more challenging benchmark tofacilitate researchers to develop novel mod-els but also help us identify the limitationsof existing MRC systems in an explainableway

2 Related Work

21 Datasets from standard tests

There exist some datasets collected from standardexamstests including English datasets RACE (Lai

et al 2017) DREAM (Sun et al 2019) ARC(Clark et al 2018) ReClor (Yu et al 2020) etcand Chinese datasets C3 (Sun et al 2020) MCQA(Guo et al 2017) GeoSQA (Huang et al 2019)and MedQA (Zhang et al 2018) etc

Some of these datasets are of specific domainsFor example ARC is of the science domain andthe questions are from the American science examsfrom 3rd to 9th grade ReClor targets logical rea-soning and the questions are collected from the LawSchool Admission Council MCQA and GeoSQAare extracted from Gaokao History and Gaokao Ge-ography respectively Other datasets cover generictopics For example RACE and DREAM are bothcollected from the English exams for Chinese stu-dents while C3 is collected from the Chinese ex-ams for Chinese-as-a-second-language learners

GCRC is more similar to C3 RACE andDREAM as their questions are all generic onesHowever their question difficulty level is differentbecause questions in GCRC target for adult nativespeakers while questions in other datasets targetfor second-language learners As such the GCRCis much more challenging than other datasets Ad-ditionally GCRC provides three kinds of rich in-formation for more deep explainable evaluation

22 Datasets with explanations

Some datasets provide explanation information(Wiegreffe and Marasovic 2021) identified threetypes of explanations highlights free-text andstructured explanations Highlights are subsets ofthe input elements (words phrases snippets or fullsentences) to explain the prediction Free-text ex-planations are textual or natural language explana-tions containing the information beyond the giveninput Structured explanations have various formsfor specific datasets One of the most common formis a chain of facts as the derivation of multi-hopreasoning for an answer

(Inoue et al 2020) classified explanations intotwo types justification explanation (collections ofSFs for a decision) and introspective explanation(a derivation for a decision) which are respectivelycorresponding to highlights and structured expla-nations

For the MRC task a few datasets are withexplanation information For example MultiRC(Khashabi et al 2018) and HotpotQA (Yang et al2018) provides sentence-level SFs belonging tojustification explanations to evaluate a systemrsquos

1322

ability to identify SFs from articles that related toa questionoption R4C (Inoue et al 2020) and2WikiMultiHopQA (Ho et al 2020) provide bothjustification and introspective explanations In bothdatasets introspective explanations are a set of fac-tual entity relationships Their difference is thatthe explanation of R4C is of semi-structured formwhile the explanation of 2WikiMultiHopQA is ofstructured data form

The dataset C3 provides the types of requiredprior knowledge for 600 questions and its goal isto study how to leverage various knowledge to im-prove the systemrsquos ability for text comprehension

Inspired by these datasets we provide morerich explanations in GCRC Different from the ex-isting work besides SFs we provide two additionalinformation Specifically we propose innovativeERs of a distractor as a reasoning step for multi-choice questions In addition we annotate the re-quired reasoning skills for each instance whichenable us to clearly identify the limitations of ex-isting MRC systems Note that prior knowledgeannotated in C3 is different from reasoning skillsin GCRC as prior knowledge in C3 mainly in-clude linguistic domain specific and general worldknowledge while reasoning skills in GCRC focuson the abilities of making an inference Generallyreasoning needs to integrate prior knowledge andthe information in the given text As far as we knowGCRC is the first Chinese MRC dataset with richexplainable information for evaluation purpose

3 Dataset Overview

Questions in GCRC are of multi-choice styleSpecifically given an article D a question Q andits options O the evaluation based on GCRC willmeasure a system from the following aspects

bull QA accuracy It is a common evaluation met-ric also provided by the other QA datasets

bull The performance of locating the SFs in Dwhich is a basic evaluation metric to assesswhether MRC system can collect all neces-sary sentences from articles before reason-ing

bull The performance of identifying ERs of a dis-tractor which evaluates whether a systemis able to exclude incorrect options by deepreasoning for multi-choice questions

bull The performance of different reasoning skillsrequired by questions This evaluation showsthe limitations of reasoning skills for a MRCsystem

Next we clearly define the ERs of a distractor andthe reasoning skills required for choosing a correctoption in Section 31 and Section 32 respectivelyThe ERs of a distractor are concrete and focus onthe forms of the errors while reasoning skills aremore abstract referring to the abilities of makingan inference

31 Error reasons of a distractor

As mentioned in Section 1 knowing exactly whya distractor is wrong will help MRC systems or anexaminee select correct answers Therefore we in-troduce error reasons of a distractor as an importantinternal reasoning step and present seven typicalERs of a distractor after investigating the instancesin GCRC

Wrong details The distractor is lexically simi-lar to the original text but has a different meaningcaused by alterations of some words such as modi-fiers or qualifier Words

Wrong temporal properties The distractordescribes an events with wrong temporal propertiesGenerally we consider five temporal propertiesdefined in (Zhou et al 2019) duration temporalordering typical time frequency and stationary

Wrong subject-predicate-object triple rela-tionship The distractor has a triple relationshipof subject-predicate-object (sub-pred-obj) but therelationship conflicts with the ground truth triplein the given article caused by substituting one ofthe components in the triple For example in Fig-ure 1 Option A ldquoThe global ecosystem which isdeeply influenced by human beings helps alleviatethe biodiversity crisisrdquo has a sub-pred-obj triple

ldquothe global ecosystemmdashalleviatemdashthe biodiversitycrisis However from the given article the correcttriple is ldquothe global ecosystemmdashresult in mdashthe bio-diversity crisisrdquo leading to the wrong distractor

Wrong necessary and sufficient conditionsThe distractor intentionally misinterprets the nec-essary and sufficient conditions expressed in theoriginal article A necessary condition is a con-dition that must be present for an event to occurwhile a sufficient condition is a condition or setof conditions that will produce the event For ex-ample the statement ldquoPlants will grow as long as

1323

with airrdquo is wrong as air is the necessary conditionfor plant growing but besides air plants also needwater and sunshine to grow

Wrong causality The distractor re-versesconfuses the causes and effects mentionedin an original article or add non-existent causalityGenerally the cause-effect relationship is explicitlyexpressed with causal connectives in distractors

Irrelevant to the question The distractorrsquoscontents are indeed mentioned in the given articlebut are irrelevant to the current question

Irrelevant to the article The distractorrsquos con-tents are not mentioned in the original article Forexample in Figure 1 ldquoproposes countermeasuresrdquoexpressed in Option D ldquoThis article reflects theconcerns about biodiversity crisis and proposescountermeasuresrdquo is not mentioned in the givenarticle Thus Option D should be excluded

32 Required reasoning skills

In Gaokao Chinese RC tasks measure an exami-neersquos text understanding and logical reasoning abil-ities from different perspectives We investigate theskills required for answering questions in GCRCand introduce eight typical skills which are orga-nized into the following three levels according tothe amount of information needed and the com-plexity of reasoning for QA Some of the reasoningskills (marked with ) are similar to those proposedby Sugawara et al (2017)

321 Level 1 Basic information capturing

This level covers the ability to capture relevantdetailed information distributed across the articleand combine them to match with options

Detail understanding It focuses on distin-guishing the semantic differences between thegiven article and an option The option most of thetime preserves the most lexical surface form of theoriginal article but has some minor differences indetails by using different modifiers or qualifyingwords

322 Level 2 Local information integration

This level covers how to identify different types ofrelationships linked between sentences in an articleIn Gaokao Chinese the relationshiprsquos expressionsare often implicit in a given article

Temporalspatial reasoning It aims to un-derstand various temporal or spatial properties ofevents entities and states

Coreference resolution It aims to under-stand the coreference and anaphoric chains by rec-ognizing the expressions referring to the same en-tity in the given article

Deductive reasoning It focus on taking a gen-eral rule or key idea described in an article andapplying it to make inferences about a specific ex-ample or phenomenon expressed in the option Forexample2 the statement of ldquowe rely on mobile nav-igation to travel and lose the ability to identifyroutesrdquo is a specific phenomenon of the statementof ldquowhile we are training artificial intelligence sys-tems we may also be trained by artificial intelli-gence systemsrdquo

Mathematical reasoning It performs somemathematical operations such as numerical sortingand comparison to obtain a correct option

Cause-effect comprehension It aims to un-derstand the causal relationships explicitly or im-plicitly expressed in a given article

323 Level 3 Global information integration

This level involves information integration of mul-tiple sentences or the whole articles to comprehendthe main ideas article organization structures andthe authorsrsquo emotion and sentiment

Inductive reasoning It integrates informationfrom separate words and sentences and makes in-ferences about an option which is often a summa-rization of several sentences a paragraph or thewhole article

Appreciative analysis It aims to understandthe article organization method the authorsrsquo writ-ing style and method attitude opinion and emo-tional state For example in Figure 1 Option BldquoThe first paragraph highlights the severity of thebiodiversity crisis by giving some statisticsrdquo is theanalysis result of authorsrsquo writing style and method

4 Construction and Annotation ofGCRC

We have spent tremendous effort to construct theimportant GCRC dataset Firstly we search anddownload about 10000 latest (year 2015 to 2020)

2This example is translated from real Gaokao Chinese in2018

1324

multiple-choice questions of real and mock GaokaoChinese from five websites Then for preprocess-ing we remove those duplicated questions andquestions with figures and tables and keep thosequestions with four options (only one of them is cor-rect) Next we identify and rectify some mistakessuch as typos in articles or corresponding ques-tionsoptions Finally the total number of ques-tions in GCRC is 8719 Moreover we adjust thelabels of the correct answers making them evenlydistributed over the four different options ie AB C D

Next we will annotate the questions accordingto the following three steps We would emphasizethat the annotation process is extremely challeng-ing as it requires human annotators to fully under-stand both syntactic structure and semantic infor-mation of the articles and corresponding questionsand options as well as identify the error reasons ofdistractors and reasoning skills required

bull Step 1 Annotation preparation Wefirst prepare an annotation guideline includ-ing task definition and annotated examplesThen we invite 12 graduate students of ourteam to participate in the annotation workTo maintain high quality and consistent an-notations our annotators first annotate thesequestions individually and subsequently dis-cuss and reach agreements if there is any dis-crepancy between two annotators The pro-cess further improves the annotation guide-line and better trains our annotators

bull Step 2 Initial annotations Firstly eachquestion is annotated by an annotator inde-pendently where we show an article ques-tion all candidate options and the label of theanswer Annotators have completed the fol-lowing three tasks (1) select the sentences inthe article which are needed for reasoningas the sentence-level SFs (2) provide ERsof a distractor from the types discussed inSection 31 (3) provide the reasoning skillsrequired for each option based on the skillsdescribed in Section 32

bull Step 3 Annotation consistency Whentwo annotators disagree with their own an-notations we invite the third annotator todiscuss with them and reach the final anno-tations In the rare cases where they cannotagree with each other we will keep the an-notations with at least two supports

The annotatorsrsquo consistency is evaluated by theInter Annotator Agreement IAA) value and theIAA value is 838

As mentioned before manual annotations ofGCRC is expensive and complicated because ques-tions in Gaokao Chinese are designed for adult na-tive speakers and thus are challenging It requiresdeep language comprehension to solve these ques-tions As a result making explainable annotationsin GCRC across multiple levels implies GCRC isa very precious dataset with valuable annotationsAlthough the size of the dataset is not too big it isbig enough to be used for providing diagnosis ofexisting MRC systems It also provides an idealtestbed for researchers to propose novel transferlearning or few-shot learning methods to solve thetasks

Statistics We partition our data into train-ing (80) development (dev 10) and test set(10) mainly according to the number of ques-tions We have annotated three types of informationfor a subset of GCRC Specifically we annotatethe sentence-level SFs for 8084 options (of 2021questions) sampled from the training set and 6900options (of 1725 questions) in the dev and test sets

In addition we annotate ERs of 6159 distrac-tors in the training dev and test sets and reasoningskills for 6900 options in the dev and test sets Webelieve our dataset with relatively big annotationsizes can ensure us to identify the limitations ofexisting RC systems in an explainable way Table1 shows the the details of GCRC data splitting andcorresponding annotation size

Table 2 shows the detailed comparisons forGCRC and other three RC datasets collected fromstandard exams including C3 (Sun et al 2020)RACE (Lai et al 2017) DREAM (Sun et al2019) As shown in Table 2 we observe that GCRCis the longest in terms of the average length of arti-cles questions and options

Table 3 shows the distribution of types of rea-soning skills based on the dev and test sets Weobserve that 4877 questions need detail under-standing and 3310 questions require inductivereasoning In addition Figure 2 presents the ERstypes distribution of distrators based on the dev andtest sets in which 332 of distractors are withwrong details and 264 of distractors include in-formation irrelevant to the corresponding articles

1325

Splitting Train Dev Test Total

of articles 3790 683 620 5093

of questions 6994 863 862 8719

of questionsoptions with SFs 20218084 8633452 8623448 374614984

of questionsdistractors with ERs 20003261 8631428 8621470 37256159

of questionsoptions with rea ski - 8633452 8623448 17256900

Table 1 Statistics of data splitting and annotationsize

GCRC C3 GCRC C3 RACE DREAM(in ChineseCharacters)

(in tokens)

Article len(avg)

11192 1169 3296 538 3524 661

Question len(avg)

209 122 142 78 113 74

Option len(avg)

438 55 271 32 67 42

Table 2 Statistics and comparison among fourMRC datasets collected from standard exams

5 Experiments

With our newly constructed GCRC dataset it isinteresting to evaluate the performance of existingmodels and better understand GCRCrsquos characteris-tics comparing with other existing data sets

51 QA accuracy and GCRC difficultylevel

We evaluate the QA performance of several popularMRC systems on GCRC which will reflect the dif-ficulty level of GCRC Specifically for comprehen-sive evaluation we employ four models includingone rule-based model and three recent state-of-the-art models based on neural networks

bull Sliding window (Richardson et al 2013)It is a rule-based baseline and chooses theanswer with the highest matching score Inparticular it has TFIDF style representationand calculates the lexical similarity betweena sentence (via concatenating a question andone of its candidate options) and each spanin the given article with a fixed window size

bull Co-Matching (Wang et al 2018) It is aBi-LSTM-based model and consists of a co-matching component and a hierarchical ag-gregation component The model not onlymatches the article with a question and eachcandidate option at the word-level but alsocaptures sentence structures of the article Ithas achieved promising results on RACE

Figure 2 Distirbution() of types of ERs of dis-tractors based on the dev and test sets including3725 questions and 6159 distractors

Reason skills Dev Test allLevel 1 Basic information capturing Detail 4711 5044 4877

Level 2 Local information integration

TemporalSpatial 084 073 079Co-reference 542 488 515Deductive 353 379 366Mathematical 045 067 056Cause-effect 567 476 521

Level 3 Global information integration Inductive 3475 3145 3310Appreciative 223 330 276

Leve 1 Basic information capturing 4711 5044 4877Level 2 Local information integration 1591 1483 1536Level 3 Global information integration 3698 3475 3587

Table 3 Distribution () of types of required skillsbased on the annotated examples totally including1725 questions and 6900 options

Note that we have used 300-dimensionalword embeddings based on GloVe (GlobalVectors for Word Representation)3

bull BERT (Devlin et al 2019) It is a pre-trained language model which adopts multi-ple bidirectional transformer layers and self-attention mechanisms to learn contextual re-lations between words in a text BERT hasachieved very good performance on numer-ous NLP tasks including the RC task definedby SQuAD We employ Chinese BERT-baseand English BERT-base released on the web-site4

bull XLNet (Yang et al 2019) It is a generalizedautoregressive pretraining model which usesa permutation language modeling objectiveto combine the advantages of autoregressiveand autoencoding methods XLNet outper-forms BERT on 20 tasks For experiments onChinese datasets we use XLNet-large andits Chinese version released on the website5

3The English word embed-dinghttpsnlpstanfordeduprojectsglove and the Chineseword embedding httpsnlpstanfordeduprojectsglove

4httpsgithubcomgoogle-researchbert5httpsgithubcombrightmartxlnetzh

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1320

Figure 1 An annotated instance in GCRC which is from the real Gaokao Chinese in 2019 ( indicatesthe correct option) The sentences marked in yellowgreen are respectively the SFs for Option A and CAs mentioned in Option B the SFs of Option B are the first paragraph Similarly the SFs of Option D arethe whole article The contents marked in blue are the required reasoning skills for the options and theERs of distractors

or multi-hop reasoning Moreover MultiRC andHotpotQA both provide sentence-level supportingfacts (SFs) which can be used as a kind of internalexplanation for the answers although locating SFsis just the first step for question answering (QA) asmany questions need to integrate the SF informa-tion with reasoning R4C (Inoue et al 2020) and2WikiMultiHopQA (Ho et al 2020) introduce achain of facts (reflecting entity relationships) as thederivation steps to evaluate the internal reasoningstep of systems

However these kinds of reasonings are quitelimited and not enough to answer those complexand comprehensive questions For example Fig-ure1 shows a multi-choice question where the op-tion D ldquoThis article reflects the concerns aboutbiodiversity crisis and proposes countermeasuresrdquois a summarization of the whole given article andits judgement cannot be made by simply reasoning

over some entity relationships Instead the reason-ing over the full information across the article willbe needed

To address the above real challenges we pro-pose GCRC a new challenging dataset with 8719multi-choice questions collected from readingcomprehension (RC) tasks of Gaokao Chinese(short for Chinese subject from the National Col-lege Entrance Examination of China) GCRC isof high quality and high difficulty level becauseGaokao Chinese examinations are designed by ed-ucational experts with high-level comprehensivequestions and complicated articles aiming to testthe language comprehension of adult examinees

In order to provide explainable evaluation in-stances (eg Figure 1) in the dev and test sets(including 1725 questions and 6900 options as pre-sented in Table 2 and Section 4) of GCRC are anno-tated with three kinds of information (1) sentence-

1321

level supporting facts (SFs) serves as the basicevaluation for a systemrsquos internal reasoning and(2) error reasons (ERs) of a distractor (ie anincorrect option) for explaining why a distractorshould be eliminated In Gaokao Chinese distrac-tors are often designed in a very confusing way andlook like a correct option although they are incor-rect Identifying the semantic difference betweena distractor and the given article and knowing ex-actly why a distractor is wrong could help systemsor examinees to choose the correct answers There-fore we spend considerable effort to thoroughlyunderstand articles and manually label seven typesof ERs of a distractor as a form of internal reason-ing step for multi-choice questions and (3) types ofreasoning skills required for answering questionsin Gaokao Chinese We introduce eight typicalreasoning skills in GCRC which enable us to eval-uate whether a system has corresponding reasoningcapability at individual question level

Our main contributions can be summarized as

bull We construct a new challenging datasetGCRC that is collected from Gaokao Chi-nese consisting of 8719 multiple-choicequestions which will be released on Githubin future

bull Three kinds of critical information are man-ually annotated for explainable evaluationTo the best of our knowledge GCRC is thefirst Chinese MRC dataset to provide mostcomprehensive challenging high-quality ar-ticles and questions for more deep explain-able evaluation on different MRC system per-formance In particular error reasons of adistractor are the first introduced in this areaas an important reasoning step for complexmulti-choice questions

bull Extensive experiments show that GCRC isnot only a more challenging benchmark tofacilitate researchers to develop novel mod-els but also help us identify the limitationsof existing MRC systems in an explainableway

2 Related Work

21 Datasets from standard tests

There exist some datasets collected from standardexamstests including English datasets RACE (Lai

et al 2017) DREAM (Sun et al 2019) ARC(Clark et al 2018) ReClor (Yu et al 2020) etcand Chinese datasets C3 (Sun et al 2020) MCQA(Guo et al 2017) GeoSQA (Huang et al 2019)and MedQA (Zhang et al 2018) etc

Some of these datasets are of specific domainsFor example ARC is of the science domain andthe questions are from the American science examsfrom 3rd to 9th grade ReClor targets logical rea-soning and the questions are collected from the LawSchool Admission Council MCQA and GeoSQAare extracted from Gaokao History and Gaokao Ge-ography respectively Other datasets cover generictopics For example RACE and DREAM are bothcollected from the English exams for Chinese stu-dents while C3 is collected from the Chinese ex-ams for Chinese-as-a-second-language learners

GCRC is more similar to C3 RACE andDREAM as their questions are all generic onesHowever their question difficulty level is differentbecause questions in GCRC target for adult nativespeakers while questions in other datasets targetfor second-language learners As such the GCRCis much more challenging than other datasets Ad-ditionally GCRC provides three kinds of rich in-formation for more deep explainable evaluation

22 Datasets with explanations

Some datasets provide explanation information(Wiegreffe and Marasovic 2021) identified threetypes of explanations highlights free-text andstructured explanations Highlights are subsets ofthe input elements (words phrases snippets or fullsentences) to explain the prediction Free-text ex-planations are textual or natural language explana-tions containing the information beyond the giveninput Structured explanations have various formsfor specific datasets One of the most common formis a chain of facts as the derivation of multi-hopreasoning for an answer

(Inoue et al 2020) classified explanations intotwo types justification explanation (collections ofSFs for a decision) and introspective explanation(a derivation for a decision) which are respectivelycorresponding to highlights and structured expla-nations

For the MRC task a few datasets are withexplanation information For example MultiRC(Khashabi et al 2018) and HotpotQA (Yang et al2018) provides sentence-level SFs belonging tojustification explanations to evaluate a systemrsquos

1322

ability to identify SFs from articles that related toa questionoption R4C (Inoue et al 2020) and2WikiMultiHopQA (Ho et al 2020) provide bothjustification and introspective explanations In bothdatasets introspective explanations are a set of fac-tual entity relationships Their difference is thatthe explanation of R4C is of semi-structured formwhile the explanation of 2WikiMultiHopQA is ofstructured data form

The dataset C3 provides the types of requiredprior knowledge for 600 questions and its goal isto study how to leverage various knowledge to im-prove the systemrsquos ability for text comprehension

Inspired by these datasets we provide morerich explanations in GCRC Different from the ex-isting work besides SFs we provide two additionalinformation Specifically we propose innovativeERs of a distractor as a reasoning step for multi-choice questions In addition we annotate the re-quired reasoning skills for each instance whichenable us to clearly identify the limitations of ex-isting MRC systems Note that prior knowledgeannotated in C3 is different from reasoning skillsin GCRC as prior knowledge in C3 mainly in-clude linguistic domain specific and general worldknowledge while reasoning skills in GCRC focuson the abilities of making an inference Generallyreasoning needs to integrate prior knowledge andthe information in the given text As far as we knowGCRC is the first Chinese MRC dataset with richexplainable information for evaluation purpose

3 Dataset Overview

Questions in GCRC are of multi-choice styleSpecifically given an article D a question Q andits options O the evaluation based on GCRC willmeasure a system from the following aspects

bull QA accuracy It is a common evaluation met-ric also provided by the other QA datasets

bull The performance of locating the SFs in Dwhich is a basic evaluation metric to assesswhether MRC system can collect all neces-sary sentences from articles before reason-ing

bull The performance of identifying ERs of a dis-tractor which evaluates whether a systemis able to exclude incorrect options by deepreasoning for multi-choice questions

bull The performance of different reasoning skillsrequired by questions This evaluation showsthe limitations of reasoning skills for a MRCsystem

Next we clearly define the ERs of a distractor andthe reasoning skills required for choosing a correctoption in Section 31 and Section 32 respectivelyThe ERs of a distractor are concrete and focus onthe forms of the errors while reasoning skills aremore abstract referring to the abilities of makingan inference

31 Error reasons of a distractor

As mentioned in Section 1 knowing exactly whya distractor is wrong will help MRC systems or anexaminee select correct answers Therefore we in-troduce error reasons of a distractor as an importantinternal reasoning step and present seven typicalERs of a distractor after investigating the instancesin GCRC

Wrong details The distractor is lexically simi-lar to the original text but has a different meaningcaused by alterations of some words such as modi-fiers or qualifier Words

Wrong temporal properties The distractordescribes an events with wrong temporal propertiesGenerally we consider five temporal propertiesdefined in (Zhou et al 2019) duration temporalordering typical time frequency and stationary

Wrong subject-predicate-object triple rela-tionship The distractor has a triple relationshipof subject-predicate-object (sub-pred-obj) but therelationship conflicts with the ground truth triplein the given article caused by substituting one ofthe components in the triple For example in Fig-ure 1 Option A ldquoThe global ecosystem which isdeeply influenced by human beings helps alleviatethe biodiversity crisisrdquo has a sub-pred-obj triple

ldquothe global ecosystemmdashalleviatemdashthe biodiversitycrisis However from the given article the correcttriple is ldquothe global ecosystemmdashresult in mdashthe bio-diversity crisisrdquo leading to the wrong distractor

Wrong necessary and sufficient conditionsThe distractor intentionally misinterprets the nec-essary and sufficient conditions expressed in theoriginal article A necessary condition is a con-dition that must be present for an event to occurwhile a sufficient condition is a condition or setof conditions that will produce the event For ex-ample the statement ldquoPlants will grow as long as

1323

with airrdquo is wrong as air is the necessary conditionfor plant growing but besides air plants also needwater and sunshine to grow

Wrong causality The distractor re-versesconfuses the causes and effects mentionedin an original article or add non-existent causalityGenerally the cause-effect relationship is explicitlyexpressed with causal connectives in distractors

Irrelevant to the question The distractorrsquoscontents are indeed mentioned in the given articlebut are irrelevant to the current question

Irrelevant to the article The distractorrsquos con-tents are not mentioned in the original article Forexample in Figure 1 ldquoproposes countermeasuresrdquoexpressed in Option D ldquoThis article reflects theconcerns about biodiversity crisis and proposescountermeasuresrdquo is not mentioned in the givenarticle Thus Option D should be excluded

32 Required reasoning skills

In Gaokao Chinese RC tasks measure an exami-neersquos text understanding and logical reasoning abil-ities from different perspectives We investigate theskills required for answering questions in GCRCand introduce eight typical skills which are orga-nized into the following three levels according tothe amount of information needed and the com-plexity of reasoning for QA Some of the reasoningskills (marked with ) are similar to those proposedby Sugawara et al (2017)

321 Level 1 Basic information capturing

This level covers the ability to capture relevantdetailed information distributed across the articleand combine them to match with options

Detail understanding It focuses on distin-guishing the semantic differences between thegiven article and an option The option most of thetime preserves the most lexical surface form of theoriginal article but has some minor differences indetails by using different modifiers or qualifyingwords

322 Level 2 Local information integration

This level covers how to identify different types ofrelationships linked between sentences in an articleIn Gaokao Chinese the relationshiprsquos expressionsare often implicit in a given article

Temporalspatial reasoning It aims to un-derstand various temporal or spatial properties ofevents entities and states

Coreference resolution It aims to under-stand the coreference and anaphoric chains by rec-ognizing the expressions referring to the same en-tity in the given article

Deductive reasoning It focus on taking a gen-eral rule or key idea described in an article andapplying it to make inferences about a specific ex-ample or phenomenon expressed in the option Forexample2 the statement of ldquowe rely on mobile nav-igation to travel and lose the ability to identifyroutesrdquo is a specific phenomenon of the statementof ldquowhile we are training artificial intelligence sys-tems we may also be trained by artificial intelli-gence systemsrdquo

Mathematical reasoning It performs somemathematical operations such as numerical sortingand comparison to obtain a correct option

Cause-effect comprehension It aims to un-derstand the causal relationships explicitly or im-plicitly expressed in a given article

323 Level 3 Global information integration

This level involves information integration of mul-tiple sentences or the whole articles to comprehendthe main ideas article organization structures andthe authorsrsquo emotion and sentiment

Inductive reasoning It integrates informationfrom separate words and sentences and makes in-ferences about an option which is often a summa-rization of several sentences a paragraph or thewhole article

Appreciative analysis It aims to understandthe article organization method the authorsrsquo writ-ing style and method attitude opinion and emo-tional state For example in Figure 1 Option BldquoThe first paragraph highlights the severity of thebiodiversity crisis by giving some statisticsrdquo is theanalysis result of authorsrsquo writing style and method

4 Construction and Annotation ofGCRC

We have spent tremendous effort to construct theimportant GCRC dataset Firstly we search anddownload about 10000 latest (year 2015 to 2020)

2This example is translated from real Gaokao Chinese in2018

1324

multiple-choice questions of real and mock GaokaoChinese from five websites Then for preprocess-ing we remove those duplicated questions andquestions with figures and tables and keep thosequestions with four options (only one of them is cor-rect) Next we identify and rectify some mistakessuch as typos in articles or corresponding ques-tionsoptions Finally the total number of ques-tions in GCRC is 8719 Moreover we adjust thelabels of the correct answers making them evenlydistributed over the four different options ie AB C D

Next we will annotate the questions accordingto the following three steps We would emphasizethat the annotation process is extremely challeng-ing as it requires human annotators to fully under-stand both syntactic structure and semantic infor-mation of the articles and corresponding questionsand options as well as identify the error reasons ofdistractors and reasoning skills required

bull Step 1 Annotation preparation Wefirst prepare an annotation guideline includ-ing task definition and annotated examplesThen we invite 12 graduate students of ourteam to participate in the annotation workTo maintain high quality and consistent an-notations our annotators first annotate thesequestions individually and subsequently dis-cuss and reach agreements if there is any dis-crepancy between two annotators The pro-cess further improves the annotation guide-line and better trains our annotators

bull Step 2 Initial annotations Firstly eachquestion is annotated by an annotator inde-pendently where we show an article ques-tion all candidate options and the label of theanswer Annotators have completed the fol-lowing three tasks (1) select the sentences inthe article which are needed for reasoningas the sentence-level SFs (2) provide ERsof a distractor from the types discussed inSection 31 (3) provide the reasoning skillsrequired for each option based on the skillsdescribed in Section 32

bull Step 3 Annotation consistency Whentwo annotators disagree with their own an-notations we invite the third annotator todiscuss with them and reach the final anno-tations In the rare cases where they cannotagree with each other we will keep the an-notations with at least two supports

The annotatorsrsquo consistency is evaluated by theInter Annotator Agreement IAA) value and theIAA value is 838

As mentioned before manual annotations ofGCRC is expensive and complicated because ques-tions in Gaokao Chinese are designed for adult na-tive speakers and thus are challenging It requiresdeep language comprehension to solve these ques-tions As a result making explainable annotationsin GCRC across multiple levels implies GCRC isa very precious dataset with valuable annotationsAlthough the size of the dataset is not too big it isbig enough to be used for providing diagnosis ofexisting MRC systems It also provides an idealtestbed for researchers to propose novel transferlearning or few-shot learning methods to solve thetasks

Statistics We partition our data into train-ing (80) development (dev 10) and test set(10) mainly according to the number of ques-tions We have annotated three types of informationfor a subset of GCRC Specifically we annotatethe sentence-level SFs for 8084 options (of 2021questions) sampled from the training set and 6900options (of 1725 questions) in the dev and test sets

In addition we annotate ERs of 6159 distrac-tors in the training dev and test sets and reasoningskills for 6900 options in the dev and test sets Webelieve our dataset with relatively big annotationsizes can ensure us to identify the limitations ofexisting RC systems in an explainable way Table1 shows the the details of GCRC data splitting andcorresponding annotation size

Table 2 shows the detailed comparisons forGCRC and other three RC datasets collected fromstandard exams including C3 (Sun et al 2020)RACE (Lai et al 2017) DREAM (Sun et al2019) As shown in Table 2 we observe that GCRCis the longest in terms of the average length of arti-cles questions and options

Table 3 shows the distribution of types of rea-soning skills based on the dev and test sets Weobserve that 4877 questions need detail under-standing and 3310 questions require inductivereasoning In addition Figure 2 presents the ERstypes distribution of distrators based on the dev andtest sets in which 332 of distractors are withwrong details and 264 of distractors include in-formation irrelevant to the corresponding articles

1325

Splitting Train Dev Test Total

of articles 3790 683 620 5093

of questions 6994 863 862 8719

of questionsoptions with SFs 20218084 8633452 8623448 374614984

of questionsdistractors with ERs 20003261 8631428 8621470 37256159

of questionsoptions with rea ski - 8633452 8623448 17256900

Table 1 Statistics of data splitting and annotationsize

GCRC C3 GCRC C3 RACE DREAM(in ChineseCharacters)

(in tokens)

Article len(avg)

11192 1169 3296 538 3524 661

Question len(avg)

209 122 142 78 113 74

Option len(avg)

438 55 271 32 67 42

Table 2 Statistics and comparison among fourMRC datasets collected from standard exams

5 Experiments

With our newly constructed GCRC dataset it isinteresting to evaluate the performance of existingmodels and better understand GCRCrsquos characteris-tics comparing with other existing data sets

51 QA accuracy and GCRC difficultylevel

We evaluate the QA performance of several popularMRC systems on GCRC which will reflect the dif-ficulty level of GCRC Specifically for comprehen-sive evaluation we employ four models includingone rule-based model and three recent state-of-the-art models based on neural networks

bull Sliding window (Richardson et al 2013)It is a rule-based baseline and chooses theanswer with the highest matching score Inparticular it has TFIDF style representationand calculates the lexical similarity betweena sentence (via concatenating a question andone of its candidate options) and each spanin the given article with a fixed window size

bull Co-Matching (Wang et al 2018) It is aBi-LSTM-based model and consists of a co-matching component and a hierarchical ag-gregation component The model not onlymatches the article with a question and eachcandidate option at the word-level but alsocaptures sentence structures of the article Ithas achieved promising results on RACE

Figure 2 Distirbution() of types of ERs of dis-tractors based on the dev and test sets including3725 questions and 6159 distractors

Reason skills Dev Test allLevel 1 Basic information capturing Detail 4711 5044 4877

Level 2 Local information integration

TemporalSpatial 084 073 079Co-reference 542 488 515Deductive 353 379 366Mathematical 045 067 056Cause-effect 567 476 521

Level 3 Global information integration Inductive 3475 3145 3310Appreciative 223 330 276

Leve 1 Basic information capturing 4711 5044 4877Level 2 Local information integration 1591 1483 1536Level 3 Global information integration 3698 3475 3587

Table 3 Distribution () of types of required skillsbased on the annotated examples totally including1725 questions and 6900 options

Note that we have used 300-dimensionalword embeddings based on GloVe (GlobalVectors for Word Representation)3

bull BERT (Devlin et al 2019) It is a pre-trained language model which adopts multi-ple bidirectional transformer layers and self-attention mechanisms to learn contextual re-lations between words in a text BERT hasachieved very good performance on numer-ous NLP tasks including the RC task definedby SQuAD We employ Chinese BERT-baseand English BERT-base released on the web-site4

bull XLNet (Yang et al 2019) It is a generalizedautoregressive pretraining model which usesa permutation language modeling objectiveto combine the advantages of autoregressiveand autoencoding methods XLNet outper-forms BERT on 20 tasks For experiments onChinese datasets we use XLNet-large andits Chinese version released on the website5

3The English word embed-dinghttpsnlpstanfordeduprojectsglove and the Chineseword embedding httpsnlpstanfordeduprojectsglove

4httpsgithubcomgoogle-researchbert5httpsgithubcombrightmartxlnetzh

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1321

level supporting facts (SFs) serves as the basicevaluation for a systemrsquos internal reasoning and(2) error reasons (ERs) of a distractor (ie anincorrect option) for explaining why a distractorshould be eliminated In Gaokao Chinese distrac-tors are often designed in a very confusing way andlook like a correct option although they are incor-rect Identifying the semantic difference betweena distractor and the given article and knowing ex-actly why a distractor is wrong could help systemsor examinees to choose the correct answers There-fore we spend considerable effort to thoroughlyunderstand articles and manually label seven typesof ERs of a distractor as a form of internal reason-ing step for multi-choice questions and (3) types ofreasoning skills required for answering questionsin Gaokao Chinese We introduce eight typicalreasoning skills in GCRC which enable us to eval-uate whether a system has corresponding reasoningcapability at individual question level

Our main contributions can be summarized as

bull We construct a new challenging datasetGCRC that is collected from Gaokao Chi-nese consisting of 8719 multiple-choicequestions which will be released on Githubin future

bull Three kinds of critical information are man-ually annotated for explainable evaluationTo the best of our knowledge GCRC is thefirst Chinese MRC dataset to provide mostcomprehensive challenging high-quality ar-ticles and questions for more deep explain-able evaluation on different MRC system per-formance In particular error reasons of adistractor are the first introduced in this areaas an important reasoning step for complexmulti-choice questions

bull Extensive experiments show that GCRC isnot only a more challenging benchmark tofacilitate researchers to develop novel mod-els but also help us identify the limitationsof existing MRC systems in an explainableway

2 Related Work

21 Datasets from standard tests

There exist some datasets collected from standardexamstests including English datasets RACE (Lai

et al 2017) DREAM (Sun et al 2019) ARC(Clark et al 2018) ReClor (Yu et al 2020) etcand Chinese datasets C3 (Sun et al 2020) MCQA(Guo et al 2017) GeoSQA (Huang et al 2019)and MedQA (Zhang et al 2018) etc

Some of these datasets are of specific domainsFor example ARC is of the science domain andthe questions are from the American science examsfrom 3rd to 9th grade ReClor targets logical rea-soning and the questions are collected from the LawSchool Admission Council MCQA and GeoSQAare extracted from Gaokao History and Gaokao Ge-ography respectively Other datasets cover generictopics For example RACE and DREAM are bothcollected from the English exams for Chinese stu-dents while C3 is collected from the Chinese ex-ams for Chinese-as-a-second-language learners

GCRC is more similar to C3 RACE andDREAM as their questions are all generic onesHowever their question difficulty level is differentbecause questions in GCRC target for adult nativespeakers while questions in other datasets targetfor second-language learners As such the GCRCis much more challenging than other datasets Ad-ditionally GCRC provides three kinds of rich in-formation for more deep explainable evaluation

22 Datasets with explanations

Some datasets provide explanation information(Wiegreffe and Marasovic 2021) identified threetypes of explanations highlights free-text andstructured explanations Highlights are subsets ofthe input elements (words phrases snippets or fullsentences) to explain the prediction Free-text ex-planations are textual or natural language explana-tions containing the information beyond the giveninput Structured explanations have various formsfor specific datasets One of the most common formis a chain of facts as the derivation of multi-hopreasoning for an answer

(Inoue et al 2020) classified explanations intotwo types justification explanation (collections ofSFs for a decision) and introspective explanation(a derivation for a decision) which are respectivelycorresponding to highlights and structured expla-nations

For the MRC task a few datasets are withexplanation information For example MultiRC(Khashabi et al 2018) and HotpotQA (Yang et al2018) provides sentence-level SFs belonging tojustification explanations to evaluate a systemrsquos

1322

ability to identify SFs from articles that related toa questionoption R4C (Inoue et al 2020) and2WikiMultiHopQA (Ho et al 2020) provide bothjustification and introspective explanations In bothdatasets introspective explanations are a set of fac-tual entity relationships Their difference is thatthe explanation of R4C is of semi-structured formwhile the explanation of 2WikiMultiHopQA is ofstructured data form

The dataset C3 provides the types of requiredprior knowledge for 600 questions and its goal isto study how to leverage various knowledge to im-prove the systemrsquos ability for text comprehension

Inspired by these datasets we provide morerich explanations in GCRC Different from the ex-isting work besides SFs we provide two additionalinformation Specifically we propose innovativeERs of a distractor as a reasoning step for multi-choice questions In addition we annotate the re-quired reasoning skills for each instance whichenable us to clearly identify the limitations of ex-isting MRC systems Note that prior knowledgeannotated in C3 is different from reasoning skillsin GCRC as prior knowledge in C3 mainly in-clude linguistic domain specific and general worldknowledge while reasoning skills in GCRC focuson the abilities of making an inference Generallyreasoning needs to integrate prior knowledge andthe information in the given text As far as we knowGCRC is the first Chinese MRC dataset with richexplainable information for evaluation purpose

3 Dataset Overview

Questions in GCRC are of multi-choice styleSpecifically given an article D a question Q andits options O the evaluation based on GCRC willmeasure a system from the following aspects

bull QA accuracy It is a common evaluation met-ric also provided by the other QA datasets

bull The performance of locating the SFs in Dwhich is a basic evaluation metric to assesswhether MRC system can collect all neces-sary sentences from articles before reason-ing

bull The performance of identifying ERs of a dis-tractor which evaluates whether a systemis able to exclude incorrect options by deepreasoning for multi-choice questions

bull The performance of different reasoning skillsrequired by questions This evaluation showsthe limitations of reasoning skills for a MRCsystem

Next we clearly define the ERs of a distractor andthe reasoning skills required for choosing a correctoption in Section 31 and Section 32 respectivelyThe ERs of a distractor are concrete and focus onthe forms of the errors while reasoning skills aremore abstract referring to the abilities of makingan inference

31 Error reasons of a distractor

As mentioned in Section 1 knowing exactly whya distractor is wrong will help MRC systems or anexaminee select correct answers Therefore we in-troduce error reasons of a distractor as an importantinternal reasoning step and present seven typicalERs of a distractor after investigating the instancesin GCRC

Wrong details The distractor is lexically simi-lar to the original text but has a different meaningcaused by alterations of some words such as modi-fiers or qualifier Words

Wrong temporal properties The distractordescribes an events with wrong temporal propertiesGenerally we consider five temporal propertiesdefined in (Zhou et al 2019) duration temporalordering typical time frequency and stationary

Wrong subject-predicate-object triple rela-tionship The distractor has a triple relationshipof subject-predicate-object (sub-pred-obj) but therelationship conflicts with the ground truth triplein the given article caused by substituting one ofthe components in the triple For example in Fig-ure 1 Option A ldquoThe global ecosystem which isdeeply influenced by human beings helps alleviatethe biodiversity crisisrdquo has a sub-pred-obj triple

ldquothe global ecosystemmdashalleviatemdashthe biodiversitycrisis However from the given article the correcttriple is ldquothe global ecosystemmdashresult in mdashthe bio-diversity crisisrdquo leading to the wrong distractor

Wrong necessary and sufficient conditionsThe distractor intentionally misinterprets the nec-essary and sufficient conditions expressed in theoriginal article A necessary condition is a con-dition that must be present for an event to occurwhile a sufficient condition is a condition or setof conditions that will produce the event For ex-ample the statement ldquoPlants will grow as long as

1323

with airrdquo is wrong as air is the necessary conditionfor plant growing but besides air plants also needwater and sunshine to grow

Wrong causality The distractor re-versesconfuses the causes and effects mentionedin an original article or add non-existent causalityGenerally the cause-effect relationship is explicitlyexpressed with causal connectives in distractors

Irrelevant to the question The distractorrsquoscontents are indeed mentioned in the given articlebut are irrelevant to the current question

Irrelevant to the article The distractorrsquos con-tents are not mentioned in the original article Forexample in Figure 1 ldquoproposes countermeasuresrdquoexpressed in Option D ldquoThis article reflects theconcerns about biodiversity crisis and proposescountermeasuresrdquo is not mentioned in the givenarticle Thus Option D should be excluded

32 Required reasoning skills

In Gaokao Chinese RC tasks measure an exami-neersquos text understanding and logical reasoning abil-ities from different perspectives We investigate theskills required for answering questions in GCRCand introduce eight typical skills which are orga-nized into the following three levels according tothe amount of information needed and the com-plexity of reasoning for QA Some of the reasoningskills (marked with ) are similar to those proposedby Sugawara et al (2017)

321 Level 1 Basic information capturing

This level covers the ability to capture relevantdetailed information distributed across the articleand combine them to match with options

Detail understanding It focuses on distin-guishing the semantic differences between thegiven article and an option The option most of thetime preserves the most lexical surface form of theoriginal article but has some minor differences indetails by using different modifiers or qualifyingwords

322 Level 2 Local information integration

This level covers how to identify different types ofrelationships linked between sentences in an articleIn Gaokao Chinese the relationshiprsquos expressionsare often implicit in a given article

Temporalspatial reasoning It aims to un-derstand various temporal or spatial properties ofevents entities and states

Coreference resolution It aims to under-stand the coreference and anaphoric chains by rec-ognizing the expressions referring to the same en-tity in the given article

Deductive reasoning It focus on taking a gen-eral rule or key idea described in an article andapplying it to make inferences about a specific ex-ample or phenomenon expressed in the option Forexample2 the statement of ldquowe rely on mobile nav-igation to travel and lose the ability to identifyroutesrdquo is a specific phenomenon of the statementof ldquowhile we are training artificial intelligence sys-tems we may also be trained by artificial intelli-gence systemsrdquo

Mathematical reasoning It performs somemathematical operations such as numerical sortingand comparison to obtain a correct option

Cause-effect comprehension It aims to un-derstand the causal relationships explicitly or im-plicitly expressed in a given article

323 Level 3 Global information integration

This level involves information integration of mul-tiple sentences or the whole articles to comprehendthe main ideas article organization structures andthe authorsrsquo emotion and sentiment

Inductive reasoning It integrates informationfrom separate words and sentences and makes in-ferences about an option which is often a summa-rization of several sentences a paragraph or thewhole article

Appreciative analysis It aims to understandthe article organization method the authorsrsquo writ-ing style and method attitude opinion and emo-tional state For example in Figure 1 Option BldquoThe first paragraph highlights the severity of thebiodiversity crisis by giving some statisticsrdquo is theanalysis result of authorsrsquo writing style and method

4 Construction and Annotation ofGCRC

We have spent tremendous effort to construct theimportant GCRC dataset Firstly we search anddownload about 10000 latest (year 2015 to 2020)

2This example is translated from real Gaokao Chinese in2018

1324

multiple-choice questions of real and mock GaokaoChinese from five websites Then for preprocess-ing we remove those duplicated questions andquestions with figures and tables and keep thosequestions with four options (only one of them is cor-rect) Next we identify and rectify some mistakessuch as typos in articles or corresponding ques-tionsoptions Finally the total number of ques-tions in GCRC is 8719 Moreover we adjust thelabels of the correct answers making them evenlydistributed over the four different options ie AB C D

Next we will annotate the questions accordingto the following three steps We would emphasizethat the annotation process is extremely challeng-ing as it requires human annotators to fully under-stand both syntactic structure and semantic infor-mation of the articles and corresponding questionsand options as well as identify the error reasons ofdistractors and reasoning skills required

bull Step 1 Annotation preparation Wefirst prepare an annotation guideline includ-ing task definition and annotated examplesThen we invite 12 graduate students of ourteam to participate in the annotation workTo maintain high quality and consistent an-notations our annotators first annotate thesequestions individually and subsequently dis-cuss and reach agreements if there is any dis-crepancy between two annotators The pro-cess further improves the annotation guide-line and better trains our annotators

bull Step 2 Initial annotations Firstly eachquestion is annotated by an annotator inde-pendently where we show an article ques-tion all candidate options and the label of theanswer Annotators have completed the fol-lowing three tasks (1) select the sentences inthe article which are needed for reasoningas the sentence-level SFs (2) provide ERsof a distractor from the types discussed inSection 31 (3) provide the reasoning skillsrequired for each option based on the skillsdescribed in Section 32

bull Step 3 Annotation consistency Whentwo annotators disagree with their own an-notations we invite the third annotator todiscuss with them and reach the final anno-tations In the rare cases where they cannotagree with each other we will keep the an-notations with at least two supports

The annotatorsrsquo consistency is evaluated by theInter Annotator Agreement IAA) value and theIAA value is 838

As mentioned before manual annotations ofGCRC is expensive and complicated because ques-tions in Gaokao Chinese are designed for adult na-tive speakers and thus are challenging It requiresdeep language comprehension to solve these ques-tions As a result making explainable annotationsin GCRC across multiple levels implies GCRC isa very precious dataset with valuable annotationsAlthough the size of the dataset is not too big it isbig enough to be used for providing diagnosis ofexisting MRC systems It also provides an idealtestbed for researchers to propose novel transferlearning or few-shot learning methods to solve thetasks

Statistics We partition our data into train-ing (80) development (dev 10) and test set(10) mainly according to the number of ques-tions We have annotated three types of informationfor a subset of GCRC Specifically we annotatethe sentence-level SFs for 8084 options (of 2021questions) sampled from the training set and 6900options (of 1725 questions) in the dev and test sets

In addition we annotate ERs of 6159 distrac-tors in the training dev and test sets and reasoningskills for 6900 options in the dev and test sets Webelieve our dataset with relatively big annotationsizes can ensure us to identify the limitations ofexisting RC systems in an explainable way Table1 shows the the details of GCRC data splitting andcorresponding annotation size

Table 2 shows the detailed comparisons forGCRC and other three RC datasets collected fromstandard exams including C3 (Sun et al 2020)RACE (Lai et al 2017) DREAM (Sun et al2019) As shown in Table 2 we observe that GCRCis the longest in terms of the average length of arti-cles questions and options

Table 3 shows the distribution of types of rea-soning skills based on the dev and test sets Weobserve that 4877 questions need detail under-standing and 3310 questions require inductivereasoning In addition Figure 2 presents the ERstypes distribution of distrators based on the dev andtest sets in which 332 of distractors are withwrong details and 264 of distractors include in-formation irrelevant to the corresponding articles

1325

Splitting Train Dev Test Total

of articles 3790 683 620 5093

of questions 6994 863 862 8719

of questionsoptions with SFs 20218084 8633452 8623448 374614984

of questionsdistractors with ERs 20003261 8631428 8621470 37256159

of questionsoptions with rea ski - 8633452 8623448 17256900

Table 1 Statistics of data splitting and annotationsize

GCRC C3 GCRC C3 RACE DREAM(in ChineseCharacters)

(in tokens)

Article len(avg)

11192 1169 3296 538 3524 661

Question len(avg)

209 122 142 78 113 74

Option len(avg)

438 55 271 32 67 42

Table 2 Statistics and comparison among fourMRC datasets collected from standard exams

5 Experiments

With our newly constructed GCRC dataset it isinteresting to evaluate the performance of existingmodels and better understand GCRCrsquos characteris-tics comparing with other existing data sets

51 QA accuracy and GCRC difficultylevel

We evaluate the QA performance of several popularMRC systems on GCRC which will reflect the dif-ficulty level of GCRC Specifically for comprehen-sive evaluation we employ four models includingone rule-based model and three recent state-of-the-art models based on neural networks

bull Sliding window (Richardson et al 2013)It is a rule-based baseline and chooses theanswer with the highest matching score Inparticular it has TFIDF style representationand calculates the lexical similarity betweena sentence (via concatenating a question andone of its candidate options) and each spanin the given article with a fixed window size

bull Co-Matching (Wang et al 2018) It is aBi-LSTM-based model and consists of a co-matching component and a hierarchical ag-gregation component The model not onlymatches the article with a question and eachcandidate option at the word-level but alsocaptures sentence structures of the article Ithas achieved promising results on RACE

Figure 2 Distirbution() of types of ERs of dis-tractors based on the dev and test sets including3725 questions and 6159 distractors

Reason skills Dev Test allLevel 1 Basic information capturing Detail 4711 5044 4877

Level 2 Local information integration

TemporalSpatial 084 073 079Co-reference 542 488 515Deductive 353 379 366Mathematical 045 067 056Cause-effect 567 476 521

Level 3 Global information integration Inductive 3475 3145 3310Appreciative 223 330 276

Leve 1 Basic information capturing 4711 5044 4877Level 2 Local information integration 1591 1483 1536Level 3 Global information integration 3698 3475 3587

Table 3 Distribution () of types of required skillsbased on the annotated examples totally including1725 questions and 6900 options

Note that we have used 300-dimensionalword embeddings based on GloVe (GlobalVectors for Word Representation)3

bull BERT (Devlin et al 2019) It is a pre-trained language model which adopts multi-ple bidirectional transformer layers and self-attention mechanisms to learn contextual re-lations between words in a text BERT hasachieved very good performance on numer-ous NLP tasks including the RC task definedby SQuAD We employ Chinese BERT-baseand English BERT-base released on the web-site4

bull XLNet (Yang et al 2019) It is a generalizedautoregressive pretraining model which usesa permutation language modeling objectiveto combine the advantages of autoregressiveand autoencoding methods XLNet outper-forms BERT on 20 tasks For experiments onChinese datasets we use XLNet-large andits Chinese version released on the website5

3The English word embed-dinghttpsnlpstanfordeduprojectsglove and the Chineseword embedding httpsnlpstanfordeduprojectsglove

4httpsgithubcomgoogle-researchbert5httpsgithubcombrightmartxlnetzh

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1322

ability to identify SFs from articles that related toa questionoption R4C (Inoue et al 2020) and2WikiMultiHopQA (Ho et al 2020) provide bothjustification and introspective explanations In bothdatasets introspective explanations are a set of fac-tual entity relationships Their difference is thatthe explanation of R4C is of semi-structured formwhile the explanation of 2WikiMultiHopQA is ofstructured data form

The dataset C3 provides the types of requiredprior knowledge for 600 questions and its goal isto study how to leverage various knowledge to im-prove the systemrsquos ability for text comprehension

Inspired by these datasets we provide morerich explanations in GCRC Different from the ex-isting work besides SFs we provide two additionalinformation Specifically we propose innovativeERs of a distractor as a reasoning step for multi-choice questions In addition we annotate the re-quired reasoning skills for each instance whichenable us to clearly identify the limitations of ex-isting MRC systems Note that prior knowledgeannotated in C3 is different from reasoning skillsin GCRC as prior knowledge in C3 mainly in-clude linguistic domain specific and general worldknowledge while reasoning skills in GCRC focuson the abilities of making an inference Generallyreasoning needs to integrate prior knowledge andthe information in the given text As far as we knowGCRC is the first Chinese MRC dataset with richexplainable information for evaluation purpose

3 Dataset Overview

Questions in GCRC are of multi-choice styleSpecifically given an article D a question Q andits options O the evaluation based on GCRC willmeasure a system from the following aspects

bull QA accuracy It is a common evaluation met-ric also provided by the other QA datasets

bull The performance of locating the SFs in Dwhich is a basic evaluation metric to assesswhether MRC system can collect all neces-sary sentences from articles before reason-ing

bull The performance of identifying ERs of a dis-tractor which evaluates whether a systemis able to exclude incorrect options by deepreasoning for multi-choice questions

bull The performance of different reasoning skillsrequired by questions This evaluation showsthe limitations of reasoning skills for a MRCsystem

Next we clearly define the ERs of a distractor andthe reasoning skills required for choosing a correctoption in Section 31 and Section 32 respectivelyThe ERs of a distractor are concrete and focus onthe forms of the errors while reasoning skills aremore abstract referring to the abilities of makingan inference

31 Error reasons of a distractor

As mentioned in Section 1 knowing exactly whya distractor is wrong will help MRC systems or anexaminee select correct answers Therefore we in-troduce error reasons of a distractor as an importantinternal reasoning step and present seven typicalERs of a distractor after investigating the instancesin GCRC

Wrong details The distractor is lexically simi-lar to the original text but has a different meaningcaused by alterations of some words such as modi-fiers or qualifier Words

Wrong temporal properties The distractordescribes an events with wrong temporal propertiesGenerally we consider five temporal propertiesdefined in (Zhou et al 2019) duration temporalordering typical time frequency and stationary

Wrong subject-predicate-object triple rela-tionship The distractor has a triple relationshipof subject-predicate-object (sub-pred-obj) but therelationship conflicts with the ground truth triplein the given article caused by substituting one ofthe components in the triple For example in Fig-ure 1 Option A ldquoThe global ecosystem which isdeeply influenced by human beings helps alleviatethe biodiversity crisisrdquo has a sub-pred-obj triple

ldquothe global ecosystemmdashalleviatemdashthe biodiversitycrisis However from the given article the correcttriple is ldquothe global ecosystemmdashresult in mdashthe bio-diversity crisisrdquo leading to the wrong distractor

Wrong necessary and sufficient conditionsThe distractor intentionally misinterprets the nec-essary and sufficient conditions expressed in theoriginal article A necessary condition is a con-dition that must be present for an event to occurwhile a sufficient condition is a condition or setof conditions that will produce the event For ex-ample the statement ldquoPlants will grow as long as

1323

with airrdquo is wrong as air is the necessary conditionfor plant growing but besides air plants also needwater and sunshine to grow

Wrong causality The distractor re-versesconfuses the causes and effects mentionedin an original article or add non-existent causalityGenerally the cause-effect relationship is explicitlyexpressed with causal connectives in distractors

Irrelevant to the question The distractorrsquoscontents are indeed mentioned in the given articlebut are irrelevant to the current question

Irrelevant to the article The distractorrsquos con-tents are not mentioned in the original article Forexample in Figure 1 ldquoproposes countermeasuresrdquoexpressed in Option D ldquoThis article reflects theconcerns about biodiversity crisis and proposescountermeasuresrdquo is not mentioned in the givenarticle Thus Option D should be excluded

32 Required reasoning skills

In Gaokao Chinese RC tasks measure an exami-neersquos text understanding and logical reasoning abil-ities from different perspectives We investigate theskills required for answering questions in GCRCand introduce eight typical skills which are orga-nized into the following three levels according tothe amount of information needed and the com-plexity of reasoning for QA Some of the reasoningskills (marked with ) are similar to those proposedby Sugawara et al (2017)

321 Level 1 Basic information capturing

This level covers the ability to capture relevantdetailed information distributed across the articleand combine them to match with options

Detail understanding It focuses on distin-guishing the semantic differences between thegiven article and an option The option most of thetime preserves the most lexical surface form of theoriginal article but has some minor differences indetails by using different modifiers or qualifyingwords

322 Level 2 Local information integration

This level covers how to identify different types ofrelationships linked between sentences in an articleIn Gaokao Chinese the relationshiprsquos expressionsare often implicit in a given article

Temporalspatial reasoning It aims to un-derstand various temporal or spatial properties ofevents entities and states

Coreference resolution It aims to under-stand the coreference and anaphoric chains by rec-ognizing the expressions referring to the same en-tity in the given article

Deductive reasoning It focus on taking a gen-eral rule or key idea described in an article andapplying it to make inferences about a specific ex-ample or phenomenon expressed in the option Forexample2 the statement of ldquowe rely on mobile nav-igation to travel and lose the ability to identifyroutesrdquo is a specific phenomenon of the statementof ldquowhile we are training artificial intelligence sys-tems we may also be trained by artificial intelli-gence systemsrdquo

Mathematical reasoning It performs somemathematical operations such as numerical sortingand comparison to obtain a correct option

Cause-effect comprehension It aims to un-derstand the causal relationships explicitly or im-plicitly expressed in a given article

323 Level 3 Global information integration

This level involves information integration of mul-tiple sentences or the whole articles to comprehendthe main ideas article organization structures andthe authorsrsquo emotion and sentiment

Inductive reasoning It integrates informationfrom separate words and sentences and makes in-ferences about an option which is often a summa-rization of several sentences a paragraph or thewhole article

Appreciative analysis It aims to understandthe article organization method the authorsrsquo writ-ing style and method attitude opinion and emo-tional state For example in Figure 1 Option BldquoThe first paragraph highlights the severity of thebiodiversity crisis by giving some statisticsrdquo is theanalysis result of authorsrsquo writing style and method

4 Construction and Annotation ofGCRC

We have spent tremendous effort to construct theimportant GCRC dataset Firstly we search anddownload about 10000 latest (year 2015 to 2020)

2This example is translated from real Gaokao Chinese in2018

1324

multiple-choice questions of real and mock GaokaoChinese from five websites Then for preprocess-ing we remove those duplicated questions andquestions with figures and tables and keep thosequestions with four options (only one of them is cor-rect) Next we identify and rectify some mistakessuch as typos in articles or corresponding ques-tionsoptions Finally the total number of ques-tions in GCRC is 8719 Moreover we adjust thelabels of the correct answers making them evenlydistributed over the four different options ie AB C D

Next we will annotate the questions accordingto the following three steps We would emphasizethat the annotation process is extremely challeng-ing as it requires human annotators to fully under-stand both syntactic structure and semantic infor-mation of the articles and corresponding questionsand options as well as identify the error reasons ofdistractors and reasoning skills required

bull Step 1 Annotation preparation Wefirst prepare an annotation guideline includ-ing task definition and annotated examplesThen we invite 12 graduate students of ourteam to participate in the annotation workTo maintain high quality and consistent an-notations our annotators first annotate thesequestions individually and subsequently dis-cuss and reach agreements if there is any dis-crepancy between two annotators The pro-cess further improves the annotation guide-line and better trains our annotators

bull Step 2 Initial annotations Firstly eachquestion is annotated by an annotator inde-pendently where we show an article ques-tion all candidate options and the label of theanswer Annotators have completed the fol-lowing three tasks (1) select the sentences inthe article which are needed for reasoningas the sentence-level SFs (2) provide ERsof a distractor from the types discussed inSection 31 (3) provide the reasoning skillsrequired for each option based on the skillsdescribed in Section 32

bull Step 3 Annotation consistency Whentwo annotators disagree with their own an-notations we invite the third annotator todiscuss with them and reach the final anno-tations In the rare cases where they cannotagree with each other we will keep the an-notations with at least two supports

The annotatorsrsquo consistency is evaluated by theInter Annotator Agreement IAA) value and theIAA value is 838

As mentioned before manual annotations ofGCRC is expensive and complicated because ques-tions in Gaokao Chinese are designed for adult na-tive speakers and thus are challenging It requiresdeep language comprehension to solve these ques-tions As a result making explainable annotationsin GCRC across multiple levels implies GCRC isa very precious dataset with valuable annotationsAlthough the size of the dataset is not too big it isbig enough to be used for providing diagnosis ofexisting MRC systems It also provides an idealtestbed for researchers to propose novel transferlearning or few-shot learning methods to solve thetasks

Statistics We partition our data into train-ing (80) development (dev 10) and test set(10) mainly according to the number of ques-tions We have annotated three types of informationfor a subset of GCRC Specifically we annotatethe sentence-level SFs for 8084 options (of 2021questions) sampled from the training set and 6900options (of 1725 questions) in the dev and test sets

In addition we annotate ERs of 6159 distrac-tors in the training dev and test sets and reasoningskills for 6900 options in the dev and test sets Webelieve our dataset with relatively big annotationsizes can ensure us to identify the limitations ofexisting RC systems in an explainable way Table1 shows the the details of GCRC data splitting andcorresponding annotation size

Table 2 shows the detailed comparisons forGCRC and other three RC datasets collected fromstandard exams including C3 (Sun et al 2020)RACE (Lai et al 2017) DREAM (Sun et al2019) As shown in Table 2 we observe that GCRCis the longest in terms of the average length of arti-cles questions and options

Table 3 shows the distribution of types of rea-soning skills based on the dev and test sets Weobserve that 4877 questions need detail under-standing and 3310 questions require inductivereasoning In addition Figure 2 presents the ERstypes distribution of distrators based on the dev andtest sets in which 332 of distractors are withwrong details and 264 of distractors include in-formation irrelevant to the corresponding articles

1325

Splitting Train Dev Test Total

of articles 3790 683 620 5093

of questions 6994 863 862 8719

of questionsoptions with SFs 20218084 8633452 8623448 374614984

of questionsdistractors with ERs 20003261 8631428 8621470 37256159

of questionsoptions with rea ski - 8633452 8623448 17256900

Table 1 Statistics of data splitting and annotationsize

GCRC C3 GCRC C3 RACE DREAM(in ChineseCharacters)

(in tokens)

Article len(avg)

11192 1169 3296 538 3524 661

Question len(avg)

209 122 142 78 113 74

Option len(avg)

438 55 271 32 67 42

Table 2 Statistics and comparison among fourMRC datasets collected from standard exams

5 Experiments

With our newly constructed GCRC dataset it isinteresting to evaluate the performance of existingmodels and better understand GCRCrsquos characteris-tics comparing with other existing data sets

51 QA accuracy and GCRC difficultylevel

We evaluate the QA performance of several popularMRC systems on GCRC which will reflect the dif-ficulty level of GCRC Specifically for comprehen-sive evaluation we employ four models includingone rule-based model and three recent state-of-the-art models based on neural networks

bull Sliding window (Richardson et al 2013)It is a rule-based baseline and chooses theanswer with the highest matching score Inparticular it has TFIDF style representationand calculates the lexical similarity betweena sentence (via concatenating a question andone of its candidate options) and each spanin the given article with a fixed window size

bull Co-Matching (Wang et al 2018) It is aBi-LSTM-based model and consists of a co-matching component and a hierarchical ag-gregation component The model not onlymatches the article with a question and eachcandidate option at the word-level but alsocaptures sentence structures of the article Ithas achieved promising results on RACE

Figure 2 Distirbution() of types of ERs of dis-tractors based on the dev and test sets including3725 questions and 6159 distractors

Reason skills Dev Test allLevel 1 Basic information capturing Detail 4711 5044 4877

Level 2 Local information integration

TemporalSpatial 084 073 079Co-reference 542 488 515Deductive 353 379 366Mathematical 045 067 056Cause-effect 567 476 521

Level 3 Global information integration Inductive 3475 3145 3310Appreciative 223 330 276

Leve 1 Basic information capturing 4711 5044 4877Level 2 Local information integration 1591 1483 1536Level 3 Global information integration 3698 3475 3587

Table 3 Distribution () of types of required skillsbased on the annotated examples totally including1725 questions and 6900 options

Note that we have used 300-dimensionalword embeddings based on GloVe (GlobalVectors for Word Representation)3

bull BERT (Devlin et al 2019) It is a pre-trained language model which adopts multi-ple bidirectional transformer layers and self-attention mechanisms to learn contextual re-lations between words in a text BERT hasachieved very good performance on numer-ous NLP tasks including the RC task definedby SQuAD We employ Chinese BERT-baseand English BERT-base released on the web-site4

bull XLNet (Yang et al 2019) It is a generalizedautoregressive pretraining model which usesa permutation language modeling objectiveto combine the advantages of autoregressiveand autoencoding methods XLNet outper-forms BERT on 20 tasks For experiments onChinese datasets we use XLNet-large andits Chinese version released on the website5

3The English word embed-dinghttpsnlpstanfordeduprojectsglove and the Chineseword embedding httpsnlpstanfordeduprojectsglove

4httpsgithubcomgoogle-researchbert5httpsgithubcombrightmartxlnetzh

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1323

with airrdquo is wrong as air is the necessary conditionfor plant growing but besides air plants also needwater and sunshine to grow

Wrong causality The distractor re-versesconfuses the causes and effects mentionedin an original article or add non-existent causalityGenerally the cause-effect relationship is explicitlyexpressed with causal connectives in distractors

Irrelevant to the question The distractorrsquoscontents are indeed mentioned in the given articlebut are irrelevant to the current question

Irrelevant to the article The distractorrsquos con-tents are not mentioned in the original article Forexample in Figure 1 ldquoproposes countermeasuresrdquoexpressed in Option D ldquoThis article reflects theconcerns about biodiversity crisis and proposescountermeasuresrdquo is not mentioned in the givenarticle Thus Option D should be excluded

32 Required reasoning skills

In Gaokao Chinese RC tasks measure an exami-neersquos text understanding and logical reasoning abil-ities from different perspectives We investigate theskills required for answering questions in GCRCand introduce eight typical skills which are orga-nized into the following three levels according tothe amount of information needed and the com-plexity of reasoning for QA Some of the reasoningskills (marked with ) are similar to those proposedby Sugawara et al (2017)

321 Level 1 Basic information capturing

This level covers the ability to capture relevantdetailed information distributed across the articleand combine them to match with options

Detail understanding It focuses on distin-guishing the semantic differences between thegiven article and an option The option most of thetime preserves the most lexical surface form of theoriginal article but has some minor differences indetails by using different modifiers or qualifyingwords

322 Level 2 Local information integration

This level covers how to identify different types ofrelationships linked between sentences in an articleIn Gaokao Chinese the relationshiprsquos expressionsare often implicit in a given article

Temporalspatial reasoning It aims to un-derstand various temporal or spatial properties ofevents entities and states

Coreference resolution It aims to under-stand the coreference and anaphoric chains by rec-ognizing the expressions referring to the same en-tity in the given article

Deductive reasoning It focus on taking a gen-eral rule or key idea described in an article andapplying it to make inferences about a specific ex-ample or phenomenon expressed in the option Forexample2 the statement of ldquowe rely on mobile nav-igation to travel and lose the ability to identifyroutesrdquo is a specific phenomenon of the statementof ldquowhile we are training artificial intelligence sys-tems we may also be trained by artificial intelli-gence systemsrdquo

Mathematical reasoning It performs somemathematical operations such as numerical sortingand comparison to obtain a correct option

Cause-effect comprehension It aims to un-derstand the causal relationships explicitly or im-plicitly expressed in a given article

323 Level 3 Global information integration

This level involves information integration of mul-tiple sentences or the whole articles to comprehendthe main ideas article organization structures andthe authorsrsquo emotion and sentiment

Inductive reasoning It integrates informationfrom separate words and sentences and makes in-ferences about an option which is often a summa-rization of several sentences a paragraph or thewhole article

Appreciative analysis It aims to understandthe article organization method the authorsrsquo writ-ing style and method attitude opinion and emo-tional state For example in Figure 1 Option BldquoThe first paragraph highlights the severity of thebiodiversity crisis by giving some statisticsrdquo is theanalysis result of authorsrsquo writing style and method

4 Construction and Annotation ofGCRC

We have spent tremendous effort to construct theimportant GCRC dataset Firstly we search anddownload about 10000 latest (year 2015 to 2020)

2This example is translated from real Gaokao Chinese in2018

1324

multiple-choice questions of real and mock GaokaoChinese from five websites Then for preprocess-ing we remove those duplicated questions andquestions with figures and tables and keep thosequestions with four options (only one of them is cor-rect) Next we identify and rectify some mistakessuch as typos in articles or corresponding ques-tionsoptions Finally the total number of ques-tions in GCRC is 8719 Moreover we adjust thelabels of the correct answers making them evenlydistributed over the four different options ie AB C D

Next we will annotate the questions accordingto the following three steps We would emphasizethat the annotation process is extremely challeng-ing as it requires human annotators to fully under-stand both syntactic structure and semantic infor-mation of the articles and corresponding questionsand options as well as identify the error reasons ofdistractors and reasoning skills required

bull Step 1 Annotation preparation Wefirst prepare an annotation guideline includ-ing task definition and annotated examplesThen we invite 12 graduate students of ourteam to participate in the annotation workTo maintain high quality and consistent an-notations our annotators first annotate thesequestions individually and subsequently dis-cuss and reach agreements if there is any dis-crepancy between two annotators The pro-cess further improves the annotation guide-line and better trains our annotators

bull Step 2 Initial annotations Firstly eachquestion is annotated by an annotator inde-pendently where we show an article ques-tion all candidate options and the label of theanswer Annotators have completed the fol-lowing three tasks (1) select the sentences inthe article which are needed for reasoningas the sentence-level SFs (2) provide ERsof a distractor from the types discussed inSection 31 (3) provide the reasoning skillsrequired for each option based on the skillsdescribed in Section 32

bull Step 3 Annotation consistency Whentwo annotators disagree with their own an-notations we invite the third annotator todiscuss with them and reach the final anno-tations In the rare cases where they cannotagree with each other we will keep the an-notations with at least two supports

The annotatorsrsquo consistency is evaluated by theInter Annotator Agreement IAA) value and theIAA value is 838

As mentioned before manual annotations ofGCRC is expensive and complicated because ques-tions in Gaokao Chinese are designed for adult na-tive speakers and thus are challenging It requiresdeep language comprehension to solve these ques-tions As a result making explainable annotationsin GCRC across multiple levels implies GCRC isa very precious dataset with valuable annotationsAlthough the size of the dataset is not too big it isbig enough to be used for providing diagnosis ofexisting MRC systems It also provides an idealtestbed for researchers to propose novel transferlearning or few-shot learning methods to solve thetasks

Statistics We partition our data into train-ing (80) development (dev 10) and test set(10) mainly according to the number of ques-tions We have annotated three types of informationfor a subset of GCRC Specifically we annotatethe sentence-level SFs for 8084 options (of 2021questions) sampled from the training set and 6900options (of 1725 questions) in the dev and test sets

In addition we annotate ERs of 6159 distrac-tors in the training dev and test sets and reasoningskills for 6900 options in the dev and test sets Webelieve our dataset with relatively big annotationsizes can ensure us to identify the limitations ofexisting RC systems in an explainable way Table1 shows the the details of GCRC data splitting andcorresponding annotation size

Table 2 shows the detailed comparisons forGCRC and other three RC datasets collected fromstandard exams including C3 (Sun et al 2020)RACE (Lai et al 2017) DREAM (Sun et al2019) As shown in Table 2 we observe that GCRCis the longest in terms of the average length of arti-cles questions and options

Table 3 shows the distribution of types of rea-soning skills based on the dev and test sets Weobserve that 4877 questions need detail under-standing and 3310 questions require inductivereasoning In addition Figure 2 presents the ERstypes distribution of distrators based on the dev andtest sets in which 332 of distractors are withwrong details and 264 of distractors include in-formation irrelevant to the corresponding articles

1325

Splitting Train Dev Test Total

of articles 3790 683 620 5093

of questions 6994 863 862 8719

of questionsoptions with SFs 20218084 8633452 8623448 374614984

of questionsdistractors with ERs 20003261 8631428 8621470 37256159

of questionsoptions with rea ski - 8633452 8623448 17256900

Table 1 Statistics of data splitting and annotationsize

GCRC C3 GCRC C3 RACE DREAM(in ChineseCharacters)

(in tokens)

Article len(avg)

11192 1169 3296 538 3524 661

Question len(avg)

209 122 142 78 113 74

Option len(avg)

438 55 271 32 67 42

Table 2 Statistics and comparison among fourMRC datasets collected from standard exams

5 Experiments

With our newly constructed GCRC dataset it isinteresting to evaluate the performance of existingmodels and better understand GCRCrsquos characteris-tics comparing with other existing data sets

51 QA accuracy and GCRC difficultylevel

We evaluate the QA performance of several popularMRC systems on GCRC which will reflect the dif-ficulty level of GCRC Specifically for comprehen-sive evaluation we employ four models includingone rule-based model and three recent state-of-the-art models based on neural networks

bull Sliding window (Richardson et al 2013)It is a rule-based baseline and chooses theanswer with the highest matching score Inparticular it has TFIDF style representationand calculates the lexical similarity betweena sentence (via concatenating a question andone of its candidate options) and each spanin the given article with a fixed window size

bull Co-Matching (Wang et al 2018) It is aBi-LSTM-based model and consists of a co-matching component and a hierarchical ag-gregation component The model not onlymatches the article with a question and eachcandidate option at the word-level but alsocaptures sentence structures of the article Ithas achieved promising results on RACE

Figure 2 Distirbution() of types of ERs of dis-tractors based on the dev and test sets including3725 questions and 6159 distractors

Reason skills Dev Test allLevel 1 Basic information capturing Detail 4711 5044 4877

Level 2 Local information integration

TemporalSpatial 084 073 079Co-reference 542 488 515Deductive 353 379 366Mathematical 045 067 056Cause-effect 567 476 521

Level 3 Global information integration Inductive 3475 3145 3310Appreciative 223 330 276

Leve 1 Basic information capturing 4711 5044 4877Level 2 Local information integration 1591 1483 1536Level 3 Global information integration 3698 3475 3587

Table 3 Distribution () of types of required skillsbased on the annotated examples totally including1725 questions and 6900 options

Note that we have used 300-dimensionalword embeddings based on GloVe (GlobalVectors for Word Representation)3

bull BERT (Devlin et al 2019) It is a pre-trained language model which adopts multi-ple bidirectional transformer layers and self-attention mechanisms to learn contextual re-lations between words in a text BERT hasachieved very good performance on numer-ous NLP tasks including the RC task definedby SQuAD We employ Chinese BERT-baseand English BERT-base released on the web-site4

bull XLNet (Yang et al 2019) It is a generalizedautoregressive pretraining model which usesa permutation language modeling objectiveto combine the advantages of autoregressiveand autoencoding methods XLNet outper-forms BERT on 20 tasks For experiments onChinese datasets we use XLNet-large andits Chinese version released on the website5

3The English word embed-dinghttpsnlpstanfordeduprojectsglove and the Chineseword embedding httpsnlpstanfordeduprojectsglove

4httpsgithubcomgoogle-researchbert5httpsgithubcombrightmartxlnetzh

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1324

multiple-choice questions of real and mock GaokaoChinese from five websites Then for preprocess-ing we remove those duplicated questions andquestions with figures and tables and keep thosequestions with four options (only one of them is cor-rect) Next we identify and rectify some mistakessuch as typos in articles or corresponding ques-tionsoptions Finally the total number of ques-tions in GCRC is 8719 Moreover we adjust thelabels of the correct answers making them evenlydistributed over the four different options ie AB C D

Next we will annotate the questions accordingto the following three steps We would emphasizethat the annotation process is extremely challeng-ing as it requires human annotators to fully under-stand both syntactic structure and semantic infor-mation of the articles and corresponding questionsand options as well as identify the error reasons ofdistractors and reasoning skills required

bull Step 1 Annotation preparation Wefirst prepare an annotation guideline includ-ing task definition and annotated examplesThen we invite 12 graduate students of ourteam to participate in the annotation workTo maintain high quality and consistent an-notations our annotators first annotate thesequestions individually and subsequently dis-cuss and reach agreements if there is any dis-crepancy between two annotators The pro-cess further improves the annotation guide-line and better trains our annotators

bull Step 2 Initial annotations Firstly eachquestion is annotated by an annotator inde-pendently where we show an article ques-tion all candidate options and the label of theanswer Annotators have completed the fol-lowing three tasks (1) select the sentences inthe article which are needed for reasoningas the sentence-level SFs (2) provide ERsof a distractor from the types discussed inSection 31 (3) provide the reasoning skillsrequired for each option based on the skillsdescribed in Section 32

bull Step 3 Annotation consistency Whentwo annotators disagree with their own an-notations we invite the third annotator todiscuss with them and reach the final anno-tations In the rare cases where they cannotagree with each other we will keep the an-notations with at least two supports

The annotatorsrsquo consistency is evaluated by theInter Annotator Agreement IAA) value and theIAA value is 838

As mentioned before manual annotations ofGCRC is expensive and complicated because ques-tions in Gaokao Chinese are designed for adult na-tive speakers and thus are challenging It requiresdeep language comprehension to solve these ques-tions As a result making explainable annotationsin GCRC across multiple levels implies GCRC isa very precious dataset with valuable annotationsAlthough the size of the dataset is not too big it isbig enough to be used for providing diagnosis ofexisting MRC systems It also provides an idealtestbed for researchers to propose novel transferlearning or few-shot learning methods to solve thetasks

Statistics We partition our data into train-ing (80) development (dev 10) and test set(10) mainly according to the number of ques-tions We have annotated three types of informationfor a subset of GCRC Specifically we annotatethe sentence-level SFs for 8084 options (of 2021questions) sampled from the training set and 6900options (of 1725 questions) in the dev and test sets

In addition we annotate ERs of 6159 distrac-tors in the training dev and test sets and reasoningskills for 6900 options in the dev and test sets Webelieve our dataset with relatively big annotationsizes can ensure us to identify the limitations ofexisting RC systems in an explainable way Table1 shows the the details of GCRC data splitting andcorresponding annotation size

Table 2 shows the detailed comparisons forGCRC and other three RC datasets collected fromstandard exams including C3 (Sun et al 2020)RACE (Lai et al 2017) DREAM (Sun et al2019) As shown in Table 2 we observe that GCRCis the longest in terms of the average length of arti-cles questions and options

Table 3 shows the distribution of types of rea-soning skills based on the dev and test sets Weobserve that 4877 questions need detail under-standing and 3310 questions require inductivereasoning In addition Figure 2 presents the ERstypes distribution of distrators based on the dev andtest sets in which 332 of distractors are withwrong details and 264 of distractors include in-formation irrelevant to the corresponding articles

1325

Splitting Train Dev Test Total

of articles 3790 683 620 5093

of questions 6994 863 862 8719

of questionsoptions with SFs 20218084 8633452 8623448 374614984

of questionsdistractors with ERs 20003261 8631428 8621470 37256159

of questionsoptions with rea ski - 8633452 8623448 17256900

Table 1 Statistics of data splitting and annotationsize

GCRC C3 GCRC C3 RACE DREAM(in ChineseCharacters)

(in tokens)

Article len(avg)

11192 1169 3296 538 3524 661

Question len(avg)

209 122 142 78 113 74

Option len(avg)

438 55 271 32 67 42

Table 2 Statistics and comparison among fourMRC datasets collected from standard exams

5 Experiments

With our newly constructed GCRC dataset it isinteresting to evaluate the performance of existingmodels and better understand GCRCrsquos characteris-tics comparing with other existing data sets

51 QA accuracy and GCRC difficultylevel

We evaluate the QA performance of several popularMRC systems on GCRC which will reflect the dif-ficulty level of GCRC Specifically for comprehen-sive evaluation we employ four models includingone rule-based model and three recent state-of-the-art models based on neural networks

bull Sliding window (Richardson et al 2013)It is a rule-based baseline and chooses theanswer with the highest matching score Inparticular it has TFIDF style representationand calculates the lexical similarity betweena sentence (via concatenating a question andone of its candidate options) and each spanin the given article with a fixed window size

bull Co-Matching (Wang et al 2018) It is aBi-LSTM-based model and consists of a co-matching component and a hierarchical ag-gregation component The model not onlymatches the article with a question and eachcandidate option at the word-level but alsocaptures sentence structures of the article Ithas achieved promising results on RACE

Figure 2 Distirbution() of types of ERs of dis-tractors based on the dev and test sets including3725 questions and 6159 distractors

Reason skills Dev Test allLevel 1 Basic information capturing Detail 4711 5044 4877

Level 2 Local information integration

TemporalSpatial 084 073 079Co-reference 542 488 515Deductive 353 379 366Mathematical 045 067 056Cause-effect 567 476 521

Level 3 Global information integration Inductive 3475 3145 3310Appreciative 223 330 276

Leve 1 Basic information capturing 4711 5044 4877Level 2 Local information integration 1591 1483 1536Level 3 Global information integration 3698 3475 3587

Table 3 Distribution () of types of required skillsbased on the annotated examples totally including1725 questions and 6900 options

Note that we have used 300-dimensionalword embeddings based on GloVe (GlobalVectors for Word Representation)3

bull BERT (Devlin et al 2019) It is a pre-trained language model which adopts multi-ple bidirectional transformer layers and self-attention mechanisms to learn contextual re-lations between words in a text BERT hasachieved very good performance on numer-ous NLP tasks including the RC task definedby SQuAD We employ Chinese BERT-baseand English BERT-base released on the web-site4

bull XLNet (Yang et al 2019) It is a generalizedautoregressive pretraining model which usesa permutation language modeling objectiveto combine the advantages of autoregressiveand autoencoding methods XLNet outper-forms BERT on 20 tasks For experiments onChinese datasets we use XLNet-large andits Chinese version released on the website5

3The English word embed-dinghttpsnlpstanfordeduprojectsglove and the Chineseword embedding httpsnlpstanfordeduprojectsglove

4httpsgithubcomgoogle-researchbert5httpsgithubcombrightmartxlnetzh

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1325

Splitting Train Dev Test Total

of articles 3790 683 620 5093

of questions 6994 863 862 8719

of questionsoptions with SFs 20218084 8633452 8623448 374614984

of questionsdistractors with ERs 20003261 8631428 8621470 37256159

of questionsoptions with rea ski - 8633452 8623448 17256900

Table 1 Statistics of data splitting and annotationsize

GCRC C3 GCRC C3 RACE DREAM(in ChineseCharacters)

(in tokens)

Article len(avg)

11192 1169 3296 538 3524 661

Question len(avg)

209 122 142 78 113 74

Option len(avg)

438 55 271 32 67 42

Table 2 Statistics and comparison among fourMRC datasets collected from standard exams

5 Experiments

With our newly constructed GCRC dataset it isinteresting to evaluate the performance of existingmodels and better understand GCRCrsquos characteris-tics comparing with other existing data sets

51 QA accuracy and GCRC difficultylevel

We evaluate the QA performance of several popularMRC systems on GCRC which will reflect the dif-ficulty level of GCRC Specifically for comprehen-sive evaluation we employ four models includingone rule-based model and three recent state-of-the-art models based on neural networks

bull Sliding window (Richardson et al 2013)It is a rule-based baseline and chooses theanswer with the highest matching score Inparticular it has TFIDF style representationand calculates the lexical similarity betweena sentence (via concatenating a question andone of its candidate options) and each spanin the given article with a fixed window size

bull Co-Matching (Wang et al 2018) It is aBi-LSTM-based model and consists of a co-matching component and a hierarchical ag-gregation component The model not onlymatches the article with a question and eachcandidate option at the word-level but alsocaptures sentence structures of the article Ithas achieved promising results on RACE

Figure 2 Distirbution() of types of ERs of dis-tractors based on the dev and test sets including3725 questions and 6159 distractors

Reason skills Dev Test allLevel 1 Basic information capturing Detail 4711 5044 4877

Level 2 Local information integration

TemporalSpatial 084 073 079Co-reference 542 488 515Deductive 353 379 366Mathematical 045 067 056Cause-effect 567 476 521

Level 3 Global information integration Inductive 3475 3145 3310Appreciative 223 330 276

Leve 1 Basic information capturing 4711 5044 4877Level 2 Local information integration 1591 1483 1536Level 3 Global information integration 3698 3475 3587

Table 3 Distribution () of types of required skillsbased on the annotated examples totally including1725 questions and 6900 options

Note that we have used 300-dimensionalword embeddings based on GloVe (GlobalVectors for Word Representation)3

bull BERT (Devlin et al 2019) It is a pre-trained language model which adopts multi-ple bidirectional transformer layers and self-attention mechanisms to learn contextual re-lations between words in a text BERT hasachieved very good performance on numer-ous NLP tasks including the RC task definedby SQuAD We employ Chinese BERT-baseand English BERT-base released on the web-site4

bull XLNet (Yang et al 2019) It is a generalizedautoregressive pretraining model which usesa permutation language modeling objectiveto combine the advantages of autoregressiveand autoencoding methods XLNet outper-forms BERT on 20 tasks For experiments onChinese datasets we use XLNet-large andits Chinese version released on the website5

3The English word embed-dinghttpsnlpstanfordeduprojectsglove and the Chineseword embedding httpsnlpstanfordeduprojectsglove

4httpsgithubcomgoogle-researchbert5httpsgithubcombrightmartxlnetzh

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1326

In addition to evaluate the performance of differ-ent models on GCRC we also want to see howthese models perform on other datasets namelyC3 RACE and DREAM In order to fairly and ob-jectively compare with them we randomly samplefrom these datasets and create three new datasetswith the same sizes of the splits (train dev test) asGCRC (shown in Table 1) The hyper-parametersof these neural baselines can be seen in Appendix

Human Performance We obtain the humanperformance on 200 questions which are randomlysampled from the GCRC test set We invite 20 highschool students to answer these questions wherethey are provided with the questions options andarticles The average accuracy of human is 8318

We first use the training sets from four datasetsto train four models which are subsequently usedto evaluate their accuracies on respective test dataDev sets are used to tune the values of the param-eters of different models Table 4 shows the com-parison results We observe that the neural networkbased models outperform the rule based slidingwindow model In addition pretrained models per-form better than the non-pretrained models in mostcases Moreover it is interesting to observe thatall four existing models generally perform worseon GCRC than other three datasets and the humanperformance on GCRC is also the lowest amongthe four datasets This clearly indicates that GCRCis more challenging Meanwhile it can be seenthat the performance gap between human and thebest system on GCRC is 4645 suggesting thatthere is sufficient room for improvement whichcontradicts with some overly optimistic claims thatmachines have exceeded humans on QA As thereis a significant performance gap between machinesand humans GCRC dataset facilitates researchersto develop novel learning approaches to better un-derstand its questions and bridge the huge gap

GCRC C3 RACE DREAMSliding window 2770 3670 3085 4008Co-Matching 3573 4501 3538 4891BERT 3074 6496 4613 5336XLNet 3515 6090 5000 5963Human 8318 96 955 945

Table 4 Accuracy comparison between four com-putational models and human on four benchmarkdatasets ( indicates the performance is based onthe annotated test set and copied from the corre-sponding paper)

In the next few subsections we will studywhether a representative model ie BERT canperform well in the explainable evaluation relatedsubtasks namely locating sentence-level SFs iden-tifying ERs and the performance of different rea-soning skills required by questions

52 Performance of locating sentence-level SFs using BERT

We investigate whether BERT benefits fromsentence-level SFs We conduct an experimentin which we input the ground-truth SFs insteadof the given article to BERT for question answer-ing We observe that the accuracy of QA on thetest set of GCRC is 3747 Comparing with theresult shown in Table 4 the accuracy is increasedby 673 indicating that locating SFs is helpfulfor QA On the other hand we can see that the im-provement is still far from closing the gap to humanperformance because the questions are difficult andneed further reasoning to solve

Due to the usefulness of SFs we train a BERTmodel to identify whether a sentence belongs toSFs We regard this task as a sentence pair (iean original sentence in the given article and anoption) classification task We use the GCRCsubset (2021863862 questions as shown in Ta-ble 1) annotated with SFs for training and test-ing where the training dev and test sets include2007988502483656 sentence-option pairs re-spectively The experimental results are shown inTable 5 We can see that performance of locatingthe SFs is still low (in terms of precision P recall Rand F1 measure) indicating that accurately obtain-ing the SFs in GCRC is challenging and directlyapplying BERT will not work well

P R F1Dev 7807 5018 6110Test 7020 5146 5939

Table 5 Results of BERT locating SFs

53 Performance of identifying ERs of adistractor using BERT

In order to evaluate whether BERT model under-stands why a distractor is excluded we modifyBERT model and add a new component to performthe identification of distractorsrsquo ERs The com-ponent is a multi-label (ie 8 labels including 1

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1327

ERs of options R PWrong details 484 165 909Wrong temporal properties 29 1379 055Wrong subject-predicate-object triple 190 211 588Wrong causality 218 2477 762Wrong necessary and sufficient conditions 106 189 222Irrelevant to the question 85 1765 221Irrelevant to the article 352 483 1735Correct options 1984 2818 5669

Table 6 Results of BERT identifying ERs of dis-tractors

correct type and 7 types of ERs) classifier to pre-dict the probability distribution of the types of ERsand is jointly optimized with normal QA and itshares the low-level representations The classi-fierrsquos objective is to minimize a cross entropy losswhich is jointly optimized with normal QA andthey share the low-level representations For thistask the contextual input of BERT is ground-truthSFs instead of the given article The results areshown in Table 6 We observe that the performanceof identification of ERs is quite low indicating thesignificance and value of our dataset which is toidentify the limitations of existing systems on ex-plainability and facilitates researchers to developnovel learning models

54 Performance of different reasoningskills required by questions usingBERT

We also investigate BERTrsquos performance on thequestions requiring different reasoning skills Wecategorize the performance for each type of reason-ing skills on the test set of GCRC Table 7 showsthe results Note that the reasoning skills are anno-tated for options of questions We observe BERTobtains the lowest score on the options requiringdeductive reasoning Overall the system generallyperforms worse on each type of options indicatingthe reasoning power of the system is not strongenough and needs to be significantly improved

Reasoning Detail understanding Temporalspatial Coreference Deductiveof options 1683 42 179 143Accuracy 3197 6190 3966 3566

Mathematical Cause-effect Inductive Appreciativeof options 40 175 1056 130Accuracy 6000 4000 3314 4769

Table 7 Results of BERT on GCRC by reasoningskills required for QA

From the above it can be seen that no baselineis realized to output answers sentence-level SFsand ERs of a distractors together but it doesnrsquot

affect our dataset to diagnose the limitations of ex-isting RC models For each task we modify theexisting model of BERT output the correspondingexplanations and report their performance Thusthe limitations of the model can be clearly identi-fied In the future we will realize such a baselinethat can do all the tasks together and we will de-sign a new joint metric for evaluating the wholequestion-answering process

6 Conclusions

In this paper we present a new challenging ma-chine reading comprehension dataset (GCRC) col-lected from Gaokao Chinese consisting of 8719high-level comprehensive multiple-choice ques-tions To the best of our knowledge this is cur-rently the most comprehensive challenging andhigh-quality dataset in MRC domain In additionwe spend considerable effort to label three types ofinformation including sentence-level SFs ERs ofa distractor and reasoning skills required for QAaiming to comprehensively evaluate systems in anexplainable way Through experiments we observeGCRC is very challenging data set for existing mod-els and we hope it can inspire innovative machinelearning and reasoning approach to tackle the chal-lenging problem and make MRC as an enablingtechnology for many real-world applications

Acknowledgments

We thank the anonymous reviewers for their help-ful comments and suggestions This work wassupported by the National Key Research and Devel-opment Program of China (No2018YFB1005103)

References

Peter Clark Isaac Cowhey Oren Etzioni TusharKhot Ashish Sabharwal Carissa Schoenick andOyvind Tafjord 2018 Think you have solvedquestion answering try arc the AI2 reasoningchallenge CoRR abs180305457

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-trainingof deep bidirectional transformers for languageunderstanding In Proceedings of the 2019 Con-ference of the North American Chapter of the

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1328

Association for Computational Linguistics Hu-man Language Technologies Volume 1 (Longand Short Papers) pages 4171ndash4186 Minneapo-lis Minnesota Association for ComputationalLinguistics

Shangmin Guo Xiangrong Zeng Shizhu He KangLiu and Jun Zhao 2017 Which is the effec-tive way for Gaokao Information retrieval orneural networks In Proceedings of the 15thConference of the European Chapter of the Asso-ciation for Computational Linguistics Volume1 Long Papers pages 111ndash120 Valencia SpainAssociation for Computational Linguistics

Xanh Ho Anh-Khoa Duong Nguyen Saku Sug-awara and Akiko Aizawa 2020 Constructing amulti-hop QA dataset for comprehensive evalu-ation of reasoning steps In Proceedings of the28th International Conference on ComputationalLinguistics pages 6609ndash6625 Barcelona Spain(Online) International Committee on Computa-tional Linguistics

Zixian Huang Yulin Shen Xiao Li YursquoangWei Gong Cheng Lin Zhou Xinyu Dai andYuzhong Qu 2019 GeoSQA A benchmark forscenario-based question answering in the geog-raphy domain at high school level In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages5866ndash5871 Hong Kong China Association forComputational Linguistics

Naoya Inoue Pontus Stenetorp and Kentaro Inui2020 R4C A benchmark for evaluating RCsystems to get the right answer for the right rea-son In Proceedings of the 58th Annual Meetingof the Association for Computational Linguis-tics pages 6740ndash6750 Online Association forComputational Linguistics

Daniel Khashabi Snigdha Chaturvedi MichaelRoth Shyam Upadhyay and Dan Roth 2018Looking beyond the surface A challenge set forreading comprehension over multiple sentencesIn Proceedings of the 2018 Conference of theNorth American Chapter of the Association forComputational Linguistics Human LanguageTechnologies Volume 1 (Long Papers) pages252ndash262 New Orleans Louisiana Associationfor Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu YimingYang and Eduard Hovy 2017 RACE Large-scale ReAding comprehension dataset from ex-aminations In Proceedings of the 2017 Con-ference on Empirical Methods in Natural Lan-guage Processing pages 785ndash794 CopenhagenDenmark Association for Computational Lin-guistics

Matthew Richardson Christopher JC Burges andErin Renshaw 2013 MCTest A challengedataset for the open-domain machine compre-hension of text In Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing pages 193ndash203 SeattleWashington USA Association for Computa-tional Linguistics

Saku Sugawara Kentaro Inui Satoshi Sekine andAkiko Aizawa 2018 What makes reading com-prehension questions easier In Proceedingsof the 2018 Conference on Empirical Methodsin Natural Language Processing pages 4208ndash4219 Brussels Belgium Association for Com-putational Linguistics

Saku Sugawara Yusuke Kido Hikaru Yokono andAkiko Aizawa 2017 Evaluation metrics formachine reading comprehension Prerequisiteskills and readability In Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers)pages 806ndash817 Vancouver Canada Associationfor Computational Linguistics

Kai Sun Dian Yu Jianshu Chen Dong Yu YejinChoi and Claire Cardie 2019 DREAM A chal-lenge data set and models for dialogue-basedreading comprehension Transactions of the As-sociation for Computational Linguistics 7217ndash231

Kai Sun Dian Yu Dong Yu and Claire Cardie2020 Investigating prior knowledge for chal-lenging Chinese machine reading comprehen-sion Transactions of the Association for Com-putational Linguistics 8141ndash155

Shuohang Wang Mo Yu Jing Jiang and ShiyuChang 2018 A co-matching model for multi-choice reading comprehension In Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2 Short

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1329

Papers) pages 746ndash751 Melbourne AustraliaAssociation for Computational Linguistics

Sarah Wiegreffe and Ana Marasovic 2021 Teachme to explain A review of datasets for explain-able NLP CoRR abs210212060

Zhilin Yang Zihang Dai Yiming Yang JaimeCarbonell Russ R Salakhutdinov and Quoc VLe 2019 Xlnet Generalized autoregressivepretraining for language understanding In Ad-vances in Neural Information Processing Sys-tems volume 32 Curran Associates Inc

Zhilin Yang Peng Qi Saizheng Zhang YoshuaBengio William Cohen Ruslan Salakhutdinovand Christopher D Manning 2018 HotpotQAA dataset for diverse explainable multi-hopquestion answering In Proceedings of the 2018Conference on Empirical Methods in NaturalLanguage Processing pages 2369ndash2380 Brus-sels Belgium Association for ComputationalLinguistics

Weihao Yu Zihang Jiang Yanfei Dong and JiashiFeng 2020 Reclor A reading comprehensiondataset requiring logical reasoning In Interna-tional Conference on Learning Representations

Xiao Zhang Ji Wu Zhiyang He Xien Liu andYing Su 2018 Medical exam question answer-ing with large-scale reading comprehension InProceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence (AAAI-18) the30th innovative Applications of Artificial Intelli-gence (IAAI-18) and the 8th AAAI Symposiumon Educational Advances in Artificial Intelli-gence (EAAI-18) New Orleans Louisiana USAFebruary 2-7 2018 pages 5706ndash5713 AAAIPress

Ben Zhou Daniel Khashabi Qiang Ning and DanRoth 2019 ldquogoing on a vacationrdquo takes longerthan ldquogoing for a walkrdquo A study of tempo-ral commonsense understanding In Proceed-ings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) pages3363ndash3369 Hong Kong China Association forComputational Linguistics

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet

1330

Appendix

A Hyper-parameters of Neural BaselinesThe hyper-parameters of Co-Matching BERT

XLNet are shown in Table 8 Table 9 and Table 10

GCRC C3 RACE DREAMtrain batch size 8 16 16 32dev batch size 8 16 8 32Test batch size 8 16 8 32epoch 50 100 50 100learning rate 3e-5 1e-3 1e-3 2e-3seed 128 64 256 128dropoutP 02 02 02 02emb dim 300 300 300 300mem dim 150 150 150 150

Table 8 Hyper-parameters of Co-Matching

GCRC C3 RACE DREAMtrain batch size 32 16 16 32dev batch size 4 16 4 16Test batch size 4 16 4 16len 320 384 450 384epoch 6 3 10 3learning rate 3e-5 1e-5 2e-5 1e-5gradient accumulation steps 8 8 8 8seed 42 42 42 42

Table 9 Hyper-parameters of BERT

GCRC C3 RACE DREAMtrain batch size 2 2 1 2dev batch size 2 2 1 2Test batch size 2 2 1 2len 320 320 320 180epoch 16 16 5 8learning rate 2e-4 1e-3 2e-5 1e-5gradient accumulation steps 2 2 24 2

Table 10 Hyper-parameters of XLNet