what if sentence-hood is hard to define: a case study in

Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2348–2359November 7–11, 2021. ©2021 Association for Computational Linguistics

2348

What If Sentence-hood is Hard to Define:A Case Study in Chinese Reading ComprehensionJiawei Wang1,2,3, Hai Zhao1,2,3,∗, Yinggong Zhao4, Libin Shen4

1 Department of Computer Science and Engineering, Shanghai Jiao Tong University2 Key Laboratory of Shanghai Education Commission for Intelligent Interaction

and Cognitive Engineering, Shanghai Jiao Tong University3MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University

4Leyan Tech, Shanghai, [email protected], [email protected]

{ygzhao, libin}@leyantech.com

Abstract

Machine reading comprehension (MRC) is achallenging NLP task for it requires to care-fully deal with all linguistic granularities fromword, sentence to passage. For extractiveMRC, the answer span has been shown mostlydetermined by key evidence linguistic units, inwhich it is a sentence in most cases. However,we recently discovered that sentences may notbe clearly defined in many languages to dif-ferent extents, so that this causes so-calledlocation unit ambiguity problem and as a re-sult makes it difficult for the model to deter-mine which sentence exactly contains the an-swer span when sentence itself has not beenclearly defined at all. Taking Chinese languageas a case study, we explain and analyze sucha linguistic phenomenon and correspondinglypropose a reader with Explicit Span-SentencePredication to alleviate such a problem. Ourproposed reader eventually helps achieve newa state-of-the-art on Chinese MRC benchmarkand shows great potential in dealing with otherlanguages.

1 Introduction

Machine reading comprehension (MRC) is a taskthat requires models to answer a question accordingto a given passage. This is a challenging task for itdemands to carefully deal with all linguistic gran-ularities from word, sentence to passage (Zhanget al., 2020b; Zhou et al., 2020). For extractiveMRC as the focus of this paper, the answer span hasbeen shown mostly determined by key evidence lin-guistic units, in which it is a sentence in most cases(Zhang et al., 2020a). However, we recently foundthat sentences may be not clearly defined in many

∗ Corresponding author. This paper was partially sup-ported by Key Projects of National Natural Science Founda-tion of China (U1836222 and 61733011).

languages to different extents, so that this causesso-called location unit ambiguity problem to letmodel more difficultly determine which sentenceexactly contains the answer span when sentenceitself has not been clearly defined at all. In detail,sentence may include multiple clauses like English,or it consists of a series of sub-sentences like Chi-nese, where all sub-sentences share the same sub-ject, predicate or object (Li et al., 2020b). Whena language has relatively strict grammar means todetermine the boundaries of sentence constituentssuch as clauses or sub-sentences, it will facilitateMRC models to more conveniently focus on a cer-tain range of text for finding answer span. Oth-erwise, there comes an obvious so-called locationunit ambiguity problem to hinder the performanceof extractive MRC.

In the following, we take Chinese language as acase study to explain and analyze such a linguisticphenomenon and correspondingly find a solution.For the characteristics of Chinese, “In terms ofsentence structure, English is determined by rule,while Chinese is determined by man” (Wang, 1984),that is, English focuses more on syntax while Chi-nese focuses more on semantics. A full long En-glish sentence has to be subject to strict grammati-cal means so that clauses can be clearly identified,while in Chinese, such a long sentence may be writ-ten in a loose way, typically, whose subject maybe conveniently omitted for all later sub-sentences,so that the boundaries between sentences and sub-sentences are blurred. As a result, there are moreindependent short sentences in Chinese which maybe equally written as a single grammar-rigorouslong one in English (Li and Nenkova, 2015; Zhaoet al., 2017; Duan and Zhao, 2020).

As shown in a Chinese MRC example in Fig-ure 1, the completely paraphrased sentence to an-swer the question is given in a series of short sub-

2349

卵为黄色，一次产卵20~30粒 Hatch after about 10 days.

Their eggs are yellow, with 20~30 eggs at once laying, which will hatch after about 10 days.

为黄色 yellow, with 20~30 eggs hatch after about 10 days

Chinese Reader English Reader

Exact match：EM = 1

Implicit answer-related sentence locating

Span extraction2~3个月，约10日孵化

Omitting error:P ⸦ G

Surplus error:P ⸧ G

拉迪氏鱼的卵有什么特征？

What are the characteristics of Lardigesi's eggs?

黄色，一次产卵20~30粒，卵约10日孵化

yellow, with 20~30 eggs at once laying, which will hatch after about 10 days

Passage

Question Answer

Ladigesi's laying period lasts for 2~3 months.

Eggs are yellow. Once laying 20~30 eggs. 拉迪氏鱼产卵期2~3个月，，约10日孵化。

Ladigesi's laying period lasts for 2~3 months.

Figure 1: An example of location unit ambiguity of Chinese MRC models compared with English. The mainalignment between two languages is marked in orange. P and G refer to predicted span and ground truth answerspan, respectively.

sentences in Chinese, which are connected in dis-course relation but relatively independent in syntax.Actually, we provide two groups of English transla-tions in Figure 1, in which the same Chinese ‘longsentence’ may be accurately translated into either aseries of short sentences (in small font) or a strictlywell-formed long sentence (in big font). In additionto flexible word order, Chinese expressions tend toadopt ellipsis for every possible constituent includ-ing the shared subject, leading word or conjunc-tions, which makes it much more difficult to iden-tify a strictly-defined long sentence in Chinese thanin English . Thus assuming that there is a implicitlocating process of answer-related sentence beforeextracting the answer span, English MRC mod-els may easily locate the complete answer-relatedsentence (right part of Figure 1), while ChineseMRC models may face the location unit ambiguity(left part of Figure 1), ignoring some needed sub-sentences (omitting) or focusing on unrelated ones(surplus). Such specific difficulty in Chinese MRCessentially requires a mechanism that is capable ofteaching the model to locate exact answer-relatedsentences in an explicit way.

In this paper, we intend to discover if this sen-tence definition difficulty caused location unit am-biguity can be solved well and take a case study onChinese extractive MRC. The basic form of extrac-tive MRC is requiring models to extract a text spanout of the passage to answer the question, given a〈passage, question〉 pair, such as SQuAD1.1 (Ra-

jpurkar et al., 2016), NQ (Kwiatkowski et al., 2019)and CMRC 2018 (Cui et al., 2019). There arealso some other variants: SQuAD2.0 (Rajpurkaret al., 2018), CoQA (Reddy et al., 2019), HotpotQA(Yang et al., 2018), etc. The mainstream scheme ofexisting models is modeling extractive MRC as atoken-level task, that is, to predict the probabilityof each token as a start/end span, so as to extractthe most suitable answer span (Devlin et al., 2019).

Specifically, we propose ESPReader (Readerwith Explicit Span-sentence Predication), applyingthe proposed extra explicit span-sentence predica-tion (ESP) subtask to help model locate the answer-contained sentences more precisely. ESP is auto-matically constructed from the original span extrac-tion dataset, which enables the model forcedly tolocate the sentence containing the answer span inan expicit way. ESP will be jointly trained withthe original token-level task. Our model uses self-attention to acquire answer-aware sentence-levelrepresentations from ESP and then fuses them withthe original token-level representations from en-coder by cross-attention for better span extraction.

Our contribution is summarized as follows:

• To our best knowledge, we are the first toreport the sentence definition ambiguity inhuman language together with its negative im-pact over MRC task.

• Our proposed ESP can be automatically con-structed from the original corpus without extra

2350

human tagging.

• Experiments verify the performance and gen-erality of our proposed model, and a new state-of-the-art on base-level models is achieved.

2 Related Work

2.1 Machine Reading Comprehension

Machine reading comprehension (MRC) is one ofthe main research directions of natural languageprocessing (NLP). MRC tasks aim at testing ma-chine’s comprehension of natural language by re-quiring to answer questions given a relative pas-sage (Hermann et al., 2015; Zhang et al., 2020d),whose types mainly include cloze (Hill et al.,2015; Cui et al., 2016), multi-choice (Lai et al.,2017; Sun et al., 2019) and span extraction (Ra-jpurkar et al., 2016; Cui et al., 2019; Reddy et al.,2019). In this paper, we focus on Chinese MRCof the last style. MRC tasks have made greatprogress and there appeared many models withgreat performance: Read+Verify (Hu et al., 2019),RankQA (Kratzwald et al., 2019), SG-Net (Zhanget al., 2020c), SAE (Tu et al., 2020), Retro-Reader(Zhang et al., 2021), etc. Among them Reddy et al.(2020) aimed at resolving the partial matched prob-lem in English span extraction tasks, which is closeto our model design and task purpose for Chinese.Their solution is constructed as a two-stage modelthat first locates the initial answer, and then marksit in the raw passage and redoes the reading process.Differently, our method is a fully end-to-end modelwith a special model design which enables modelto learn accurate locations of span-sentences.

2.2 PrLMs and Chinese PrLMs

Pre-trained contextualized language models(PrLMs) like BERT (Devlin et al., 2019) achievedexcellent results in various downstream tasks.PrLMs now dominate the encoder design of manyNLP tasks, including MRC (Zhang et al., 2021; Xuet al., 2021). More and more well designed PrLMskeep emerging, including XLNet (Yang et al.,2019), RoBERTa (Liu et al., 2019), ALBERT (Lanet al., 2019), ELECTRA (Clark et al., 2020), etc.As for Chinese PrLMs, MacBERT (Cui et al.,2020) uses whole word masking and n-grammasking strategies to select candidate tokens formasking and replaces the [MASK] token withsimilar words for the masking purpose.

2.3 Multi-grained and Hierarchical Models

To handle the location unit ambiguity, our modelquotes a middle-level improvement design, thusour research has some correlation with multi-grained and hierarchical models (Choi et al., 2017;Wang et al., 2018; Luo et al., 2020). Shen et al.(2018) proposed a multi-grained approach combin-ing character-level, word-level and relation-levelfor text embeddings. Ma et al. (2019) proposeda claim verification framework based on hierar-chical attention neural networks to learn sentence-level evidence embeddings to obtain claim-specificrepresentation. All the above works used low-level semantic information to obtain high-levelsemantic representation, which is different fromour intent of using sentence-level information toassist token-level task. Zhang et al. (2020a) pro-posed a hierarchical network that chooses top Kanswer-related sentences from the given passagescoring by cosine and bilinear scores to build anew passage for further multi-choice tasks. Theirwork is somewhat similar to our method. How-ever, we let model directly locate the answer-contained sentence, and use this sentence-level in-formation for further token-level span extractionby cross-attention instead of straightly discardingother lower scoring sentences.

3 Our Proposed Model

As shown in Figure 2, our proposed Reader withExplicit Span-sentence Predication (ESPReader)consists of three modules, that is PrLM encoder,sentence-level self-attention layer and fusion cross-attention layer. The details will be given below.

Explicit Span-sentence Predication To en-hance the model with the capacity of locating theanswer-related sentences more precisely, an ex-plicit span-sentence predication (ESP) is proposedas a sentence-level subtask. For the sake of theintegrity of sentence structure and content, para-graphs are divided into natural sentences by endingpunctuation (“,”, “.”, “?”, and “!”) other than a fixedlength. After such segmentation, sentence con-taining the answer span will be labeled as a span-sentence. During training, our model is required toexplicitly locate the span-sentence while extractinganswer span, which may alleviates the location unitambiguity issue as span-sentence boundaries havebeen annotated according to the least sub-sentencesegmentation among punctuations.

2351

PrLM Encoder

[CLS] q1 [SEP] [SEP]p1... ...

h1 h2 ... hnToken-level encoder output

s1 s2 sm

S

Add & Norm

Feed-forward layer

Sent

ence

-leve

l Se

lf-at

tent

ion

laye

r

Multi-headself-attention

Add & Norm

Feed-forward layer

Fusi

on

Cro

ss-a

ttent

ion

laye

r

Multi-headcross-attention

Linear & softmax

Aggregation

SSA layerFCA layer

HT HS

Token-level output fused with answer-aware

sentence-level informationAnswer-aware

sentence-level representation

Span extractionExplicit span-

sentence predication (only in training)

Linear & softmax

Question Passage

Figure 2: The architecture of our proposed model.

Since an answer span may stride over multiplesentences, we model the ESP subtask as a form ofpredicting the location of the start/end sentence (orsub-sentence), which is consistent with the form oforiginal span extraction task.

Sentence Position Embedding We sum up fourembeddings including sentence position embed-ding Es (see Appendix A for details about Es) asinput to let the PrLM encoder yield representationsas: Einput = Ew + Ep + Et + Es, where Ew, Ep

and Et are respectively word embedding, positionembedding (the token’s offset in the whole inputsequence) and token type embedding (the tokenbelongs to question or passage), respectively.

Sentence-level Representation Reimers andGurevych (2019) found that using mean of the out-put vector of the last layer of PrLM as sentencerepresentation outperforms the overall representa-tion according to [CLS] token marginally. Li et al.(2020a) claimed that using the average of the lasttwo layers as the sentence embedding is better andmapping it to the standard Gaussian latent spacecan further eliminate the uneven problem of embed-ding space caused by word frequency difference.

Taking both experimental effectiveness andmodel simplicity into consideration, we use theaverage of last layer’s output of PrLM for all to-kens Ht = {h1t , h2t , ..., hnt } in the corresponding

sentences as the sentence-level representation S:

S = {s1, s2, ..., sm},

si =1

ni

spi+ni−1∑j=spi

hjt(1)

where spi and ni are the start position offset andlength of sentencei, respectively.

Sentence-level Self-attention Layer In terms ofthe PrLM encoded sentence representations, weapply multi-head attention mechanism (Vaswaniet al., 2017) to calculate the self-attention betweensentences, as follows:

Ais = softmax(

QisK

isT

√dk

)V is ,

Hs = Concate(A1s, A

2s, ..., A

Ds )

(2)

where Ais is the sentence-level attention score of

headi. D is the total number of heads.

Qis = SWQ,i

s , Kis = SWK,i

s , V is = SW V,i

s (3)

where WQ,is ,WK,i

s ,W V,is ∈ Rdh×dk are all learn-

able parameters matrices.Next, Hs will be passed through a feed-forward

layer followed by GeLU activation (Hendrycks andGimpel, 2016), and then passed through the resid-ual layer and layer normalization to get the finalsentence-level output Hs = {h1s, h2s, ..., hms }. Topredict the start/end sentence, we use a linear layer

2352

with softmax layer to obtain the probability of eachsentence as a start/end sentence separately:

ss, se = softmax(Linear(Hs)) (4)

where Linear is a linear transformation of dh → 2.Cross entropy loss is used as our training object:

Ls = yss logss + yse logse (5)

where yss and yse are the ground truth label vectorsof start/end sentence. Thus, Hs will be guided asanswer-aware sentence-level representations.

Fusion Cross-attention Layer To integratesentence-level information for span extraction, weconduct cross-attention between the output of en-coder Ht and the output of sentence-level self-attention layer Hs. The calculation is almost thesame as Eq. (2), except that the sources of vectorsQ, K and V differ: Q comes from Ht , while Kand V come from Hs:

QiF = HtW

Q,iF ,

KiF = HsW

K,iF , V i

F = HsWV,iF

(6)

where WQ,iF ,WK,i

F ,W V,iF are all learnable parame-

ters matrices as Eq. (3). The remaining calculationprocess is exactly the same as the sentence-levelself-attention layer. Through fusion cross-attentionlayer, the token-level fusion output Ft which isinjected with answer-aware sentence-level repre-sentations is obtained.

Finally, a manual weight α is used to aggregateFt and the original encoder output Ht to get thefinal token-level output:

H ′t = αHt + (1− α)Ft (7)

H ′t will be applied to make start/end span pre-

dictions ts and te as:

ts, te = softmax(Linear(H ′t)) (8)

Equally, cross entropy is used as the token-levelloss function:

Lt = yts logts + yte logte (9)

where yts and yte are the ground truth label vectorsof start/end span.

Training and Prediction During the trainingphase, we will jointly learn span extraction andESP, and the final loss is:

L = βLt + (1− β)Ls (10)

where β is a manual weight.During the prediction phase, we only make

start/end span prediction. The straightforward scor-ing function is:

Scoreraw(i, j) = tis + tje, (11)

where i, j are the start and end token position,respectively (0 ≤ i ≤ j ≤ n). Considering thatESPReader is forced to pay more attention to wholesentences by adding the proposed ESP subtask,which might result in a length growth in predictedspan, we design a scoring function with inverselength factor (ILF). Note that the span length is notexactly the shorter the better. It only works in thisway when two sentences are with the similar lengthfor the sake of reducing redundancy. Taking allthese into account, our adopted scoring function isas follows:

ScoreILF (i, j) = tis + tje + ILF (i, j),

ILF (i, j) = −µ(j − i)√((j − i)/l − 1)2

(12)

where l is the average length of all candidate an-swer spans. µ is a manual weight. When the spanlength is close to the average, ILF will assign someinhibitory effect on long spans. See Appendix Bfor a more concrete impression on ILF.

Train Dev TestQuestion 10,321 3351 4895Answer per query 1 3 3Max passage tokens 962 961 980Max question tokens 89 56 50Max answer tokens 100 85 92Avg passage tokens 452 469 472Avg question tokens 15 15 15Avg answer tokens 17 9 9

Table 1: Statistics of the CMRC 2018 dataset.

4 Experiment

4.1 DatasetOur proposed method is evaluated on the extractiveChinese MRC benchmark, CMRC 2018 (Cui et al.,

2353

Dev TestModel EM F1 Avg EM F1 AvgHuman performance∗ 91.1 97.4 94.3 92.4 97.9 95.2BERTbase

∗ 65.5 84.5 75.0 70.7 87.0 78.9ELECTRAbase

∗ 68.4 84.8 76.6 73.1 87.1 80.1RoBERTawwm_ext_base

∗ 67.4 87.2 77.3 72.6 89.4 81.0MacBERTbase

∗ 68.5 87.9 78.2 73.2 89.5 81.4ESPReader

on BERTbase 68.7 (↑3.2) 86.3 (↑1.8) 77.5 (↑2.5) – – –on RoBERTawwm_ext_base 70.7 (↑3.3) 88.3 (↑1.1) 79.5 (↑2.2) – – –on MacBERTbase 71.8 (↑3.3) 88.7 (↑0.8) 80.3 (↑2.1) 75.6 (↑2.4) 90.0 (↑0.5) 82.8 (↑1.4)

ELECTRAlarge∗ 69.1 85.2 77.2 73.9 87.1 80.5

RoBERTawwm_ext_large∗ 68.5 88.4 78.5 74.2 90.6 82.4

MacBERTlarge∗ 70.7 88.9 79.8 74.8 90.7 82.8

MacBERTlarge_extData_v2† – – – 80.4 93.8 87.1

ESPReaderon RoBERTawwm_ext_large 72.3 (↑3.8) 89.4 (↑1.0) 80.9 (↑2.4) – – –on MacBERTlarge 72.3 (↑1.6) 89.6 (↑0.7) 81.0 (↑1.2) 77.2 (↑2.4) 91.5 (↑0.8) 84.4 (↑1.6)

Table 2: Results on CMRC 2018. Overall best performances are depicted in boldface (base-level and large-level aremarked individually). ↑ refers to the relative increasing compared with according baseline. † refers to unpublishedwork and the results are gained from CMRC 2018 leaderboard. ∗ refers to results coming from Cui et al. (2020).

2019), which is similar to SQuAD1.1 (Rajpurkaret al., 2016), given a passage, asks model to locateanswer span inside for a question, and all questionsare supposed to be answerable. The official metricsare Exact Match (EM) and a softer metric F1 score.The dataset details are listed in Table 1 1.

4.2 Setup

In our ESPReader implementation, we adopt welltrained Chinese PrLMs as the encoder. Meanwhile,for each adopted PrLM, we add a one-layer MLPon its top which directly predicts start/end posi-tions of answer span as the default reader to formbaseline models for comparison.

We consider three Chinese PrLMs, MacBERT(Cui et al., 2020) which helps achieve the currentstate-of-the-art on CMRC 2018, Chinese versionsof BERT (base2) and RoBERTa (base3 and large4).

Our hyperparameters are in Appendix C.

1There is an extra small Challenge set in CMRC 2018. Itespecially checks the capability of model reasoning, which isbeyond the topic of this work which focuses on the locationunit ambiguity. We test our model on this set, which showsan Avg score drop of more than 1%. It is unsurprising andexplainable since our ESP task, which forces to locate a certainsentence, might do little help to model’s reasoning abilityamong multiple sentences.

2bert-base-chinese3chinese-roberta-wwm-ext4chinese-roberta-wwm-ext-large

4.3 ResultsTable 2 shows the experimental results on CMRC2018. As can be seen, compared with baselines, ourproposed model achieves significant5 EM and F1scores improvements on both base-level and large-level models, especially EM scores, with an overallaverage increase of more than 2%. Moreover, itis noticed that our ESPReader on MacBERTbase

achieves a new state-of-the-art on CMRC 2018leaderboard6 of base-level models by gaining acomparable F1 score to MacBERTlarge, and evenoutperforming it on EM score on both Dev and Testsets. Besides, ESPReader on MacBERTlarge alsogains the highest EM and average scores among allpublished work.

4.4 For Different Types Chinese MRC TasksTo validate the generality of our method, we fur-ther test ESPReader on other two different typesof Chinese MRC tasks, DRCD (Shao et al., 2018)and CJRC (Duan et al., 2019) (see Appendix D fordataset details). As shown in Table 3, ESPReaderobtains visible increase on both datasets comparedwith our baselines.

5we make the McNemar’s test (McNemar, 1947) to testthe statistical significance of our results. For results in bothTables 2 and 6, we get a p-value<0.01.

6http://ymcui.com/cmrc2018/7We strictly follow settings provided by Cui et al. (2020)

and report the best scores in three times of individual running

https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin

https://huggingface.co/hfl/chinese-roberta-wwm-ext/resolve/main/pytorch_model.bin

https://huggingface.co/hfl/chinese-roberta-wwm-ext-large/resolve/main/pytorch_model.bin

http://ymcui.com/cmrc2018/

2354

DRCD CJRCModel EM / F1 / Avg EM / F1 / Avg

Cui et al. (2020)RoBERTa 89.6 / 94.5 / 92.1 62.4 / 82.2 / 72.3MacBERT 91.7 / 95.6 / 93.7 62.9 / 82.5 / 72.7

Our Implementation 7

RoBERTa 88.8 / 94.1 / 91.5 68.6 / 77.5 / 73.1MacBERT 89.8 / 94.9 / 92.4 70.2 / 79.4 / 74.8ESPReader

on RoBERTa 89.6 / 94.7 / 92.2 69.9 / 78.6 / 74.3on MacBERT 90.3 / 95.1 / 92.7 71.1 / 80.1 / 75.6

Table 3: Results on Test set of DRCD and CJRC.RoBERTa and MacBERT refer to chinese-roberta-wwm-ext-large and chinese-macbert-large, respec-tively.

It is noticed that the improvement on DRCDis not that significant as CJRC (0.3% v.s. 0.8%on MacBERTlarge). One possible explanation isthat DRCD is a relatively simple task and the av-erage answer length is 4.9, which means most ofthe answers are in a single sub-sentence to let ourESP task unnecessary. To validate this, we makestatistics on three Chinese MRC datasets to findout the proportions of the examples where a singlesub-sentence is sufficient for extracting the answerspan, as shown in Table 4. Note that 99.2% ex-amples of DRCD can find answer span in a singlesub-sentence, which is consistent with our assump-tion.

Needed Sub-sentencesDataset one two moreCMRC 74.6% 13.1% 12.3%DRCD 99.2% 0.7% 0.1%CJRC 94.7% 3.0% 2.3%

Table 4: Proportions of the examples classified by thenumber of needed sub-sentences to extract the answerspan.

4.5 For Other LanguagesAlthough ESPReader is specifically designed forChinese MRC, we also test our ESPReader on En-glish MRC benchmarks, SQuAD2.0 (Rajpurkar

for each baseline.8bert-base-uncased9electra-base-discriminator

10Since Asai et al. (2018) only provided test set forboth languages, we fine-tune models on CMRC 2018 (fromMacBERTbase) and SQuAD1.1 (from ELECTRAbase) andthen directly evaluate on Japanese and French, respectively.

SQuAD2.0Model EM F1 AvgBERTbase

∗ 8 72.6 74.6 73.6ELECTRAbase

∗ 9 80.9 83.8 82.4ESPReader ∗

on BERTbase 74.6 76.2 75.4on ELECTRAbase 82.5 85.4 84.0

Table 5: Results on Dev set of SQuAD2.0. ∗ refers toour implementation.

Japanese FrenchModel EM / F1 / Avg EM / F1 / AvgBaseline 9.9 / 29.8 / 19.9 25.9 / 45.5 / 35.7ESPReader 19.6 / 36.2 / 27.9 29.7 / 45.2 / 37.5

Table 6: Zero-shot test 10 on Japanese and FrenchSQuAD datasets.

et al., 2018). The results are shown in Table 5. Eventhough our model is not supposed to design forEnglish tasks, it still achieves obvious improvingcompared with two English MRC baseline model(1.8% and 1.6% Avg score, respectively), which in-dicates our model’s potential in dealing with tasksof other languages. To validate this, we furtherconduct a zero-shot test on Japanese and Frenchdatasets provided by Asai et al. (2018), as shown inTabel 6. On both languages, our method achievessignificant improvements, especially the former.

Above results indicate that the location unit am-biguity is a common issue in many languages withdifferent seriousness. As shown of an EnglishMRC example in Figure 3, though strictly-definedas a clause of the answer span, sometimes it couldbe totally unrelated with respect to the question.

P: Plants produce oxygen and energy through the photosynthesis of chloroplast, which also exists in Euglena (a unicellular eukaryote).Q: How do plants produce oxygen and energy?A: through the photosynthesis of chloroplastPrediction: Baseline: through the photosynthesis of chloroplast, which also exists

in Euglena (a unicellular eukaryote) Ours: through the photosynthesis of chloroplast

Figure 3: An English MRC example with location unitambiguity. The span-sentence is underlined.

5 Ablation Study

5.1 Effect of Each ModuleFor the purpose of tracking improvement sourcesof our ESPReader, we conduct thorough ablation



https://huggingface.co/hfl/chinese-macbert-large/resolve/main/pytorch_model.bin

https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin

https://s3.amazonaws.com/models.huggingface.co/bert/google/electra-base-discriminator/pytorch_model.bin

2355

studies by adding the proposed modules one by onefrom the baseline setting (MacBERTbase). The re-sults are in Table 7. It is noticed that by only addingESP task, the Avg score on CMRC 2018 Dev setis improved by 1.3%. By further adding sentence-level self-attention layer or fusion cross-attentionlayer separately, the Avg score does not signif-icantly increase (0.1% and 0.3%, respectively).However, when both of them are included, anothervisible improvement (0.9%) is obtained.

Model EM F1 AvgBaseline (MacBERTbase) 68.5 87.9 78.2

+ ESP 70.5 88.3 79.4+ ESP + SSL 70.9 88.1 79.5+ ESP + FCL 71.0 88.4 79.7+ ESP + SSL + FCL 71.8 88.7 80.3+ ESP + SSL + FCL - SPE 71.5 88.7 80.1

Table 7: Results on CMRC 2018 Dev set when addingeach module. SSL: sentence-level self-attention layer,FCL: fusion cross-attention layer, SPE: sentence posi-tion embedding.

Considering that we additionally introduce sen-tence position embedding (Es) on the basis ofBERT’s embedding layer, we compared the perfor-mance of ESPReader with/without Es. As shownin Table 7, addingEs can bring a marginal improve-ment (0.2% Avg score).

Model EM F1 Avg

Baseline (BERTbase) 65.5 84.5 75.0ESPReader

+ RS (BERTbase) 65.8 85.6 75.7+ ILF (BERTbase) 68.7 86.3 77.5

Baseline (MacBERTbase) 68.5 87.9 78.2ESPReader

+ RS (MacBERTbase) 68.7 88.1 78.4+ ILF (MacBERTbase) 71.8 88.7 80.3

Table 8: Ablation study results of scoring functions onCMRC 2018 Dev set. RS and ILF refer to Scorerawand ScoreILF .

5.2 Scoring FunctionWe keep other settings unchanged and adopt twoscoring functions Scoreraw and ScoreILF , respec-tively. The results are listed in Table 8.

It is observed that the proposed scoring func-tion ScoreILF makes a nontrivial contribution tothe performance of our model on both EM and F1

scores, especially the former (2.9% on BERTbase

and 3.1% on MacBERTbase). This observation is inline with our assumption that ESPReader is forcedto pay more attention to whole answer-related sen-tences with an explicit span-sentence predicationand thus results in a length growth in predicted span.Note that our model brings increase on both EMand F1 scores to varying degrees, even though ILFis not applied. This indicates that the model bene-fits from the location guidance produced by ESPtask more than suffering from the length growthof predicted span caused by it, which can be welllessened by ILF.

Error typeModel P ⊂ G P ⊃ G otherBaseline (MacBERTbase) 21.7% 57.2% 21.1%ESPReader (RS) 19.9% 63.4% 16.7%ESPReader (ILF) 26.5% 52.9% 19.5%

Table 9: General percentage distribution of each errortype on CMRC 2018 Dev set.

MacBERTbase

ESSPReaderbase (RS)

221 583 100 114 1018

1010

909

202 639 85 84

241 481 77 100

P GP ⸦ G P ⸧ G F1 = 0

ESPReaderbase (ILF)

Figure 4: Numbers of each error type on CMRC 2018Dev set.

5.3 Error AnalysisTo take a deep sight into the sources of precisiongrowth, We further research the distribution of eacherror (EM = 0) type after applying our method,the general percentage distribution is shown in Ta-ble 9 and the details of actual numbers of each errortype are shown in Figure 4. Combining them, wefind that with ESP our model decreases both the ac-tual numbers and percentage of all error types (ex-cept for the surplus error), of which the actual num-ber of F1 = 0 (which means the predicted span istotally unrelated) is dropped by 26.3% (114→ 84).It indicates that ESP effectively corrects the loca-tion unit ambiguity issue of Chinese MRC. Notethat ILF for scoring helps reduce surplus errors butcauses more omitting errors.

In order to have an concrete insight that how

2356

向宠，襄阳宜城人。三国时领蜀汉将。向宠是哪里人 ?[CLS] [SEP][CLS]

宠向

是哪

里人?

[SEP]

向

宠

襄阳

宜城

人。

三国时

蜀汉

将

。领

，

MacBERT

向宠，襄阳宜城人。三国时领蜀汉将。向宠是哪里人 ?[CLS] [SEP][CLS]

宠向

是哪

里人?

[SEP]

向

宠

襄阳

宜城

人。

三国时

蜀汉

将

。领

，

ESPReader

Figure 5: Visualization of the attention scores (average of all heads) of last layers of MacBERTbase (left) andESPReader on MacBERTbase (right). The answer span is depicted in red.

our method helps solve location unit ambiguity,we draw attention heatmap of last encoder layerof MacBERT and our proposed model, as shownin Figure 5. Note that the baseline model focusesmuch on the answer-unrelated sub-sentence. How-ever, the attention distribution of our model is ob-viously more focused on the span-sentence, whichis contributed by our ESP mechanism.

6 Conclusion

This paper aims at addressing the newly discov-ered difficulty of the boundary ambiguity betweensentences and sub-sentences, which exists in manylanguages to different extents and essentially limitsthe performance of span extraction MRC models,especially in Chinese environment. We apply ex-plicit span-sentence predication (ESP) to enhancemodel’s ability of precisely locating sentences con-taining the target span. Our proposed model de-sign is evaluated on Chinese span extraction MRCbenchmark, CMRC 2018. The experimental re-sults show that our model significantly improvesboth EM and F1 scores compared with strong base-lines and helps achieve a new state-of-the-art per-formance. Our method also shows generality andpotential in dealing with other languages. Thiswork highlights the research line of further im-proving challenging MRC by analyzing specificlinguistics phenomena.

ReferencesAkari Asai, Akiko Eriguchi, Kazuma Hashimoto, and

Yoshimasa Tsuruoka. 2018. Multilingual extractivereading comprehension by runtime machine transla-tion. arXiv preprint arXiv:1809.03275.

Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, IlliaPolosukhin, Alexandre Lacoste, and Jonathan Be-rant. 2017. Coarse-to-fine question answering forlong documents. In Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (ACL), pages 209–220.

Kevin Clark, Minh-Thang Luong, Quoc V. Le, andChristopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather thangenerators. In Proceedings of the International Con-ference on Learning Representations (ICLR).

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shi-jin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for chinese natural language process-ing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing(EMNLP): Findings, pages 657–668.

Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao,Zhipeng Chen, Wentao Ma, Shijin Wang, and Guop-ing Hu. 2019. A span-extraction dataset for chinesemachine reading comprehension. In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5886–5891.

Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, andGuoping Hu. 2016. Consensus attention-based neu-ral networks for chinese reading comprehension. InProceedings of the 26th International Conference on

2357

Computational Linguistics (COLING), pages 1777–1786.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics (NAACL), pages 4171–4186.

Sufeng Duan and Hai Zhao. 2020. Attention is all youneed for chinese word segmentation. In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages3862–3872.

Xingyi Duan, Baoxin Wang, Ziyue Wang, Wentao Ma,Yiming Cui, Dayong Wu, Shijin Wang, Ting Liu,Tianxiang Huo, Zhen Hu, et al. 2019. CJRC: A re-liable human-annotated benchmark dataset for chi-nese judicial reading comprehension. In China Na-tional Conference on Chinese Computational Lin-guistics, pages 439–451. Springer.

Dan Hendrycks and Kevin Gimpel. 2016. Gaus-sian error linear units (gelus). arXiv preprintarXiv:1606.08415.

KM Hermann, T Kocisky, E Grefenstette, L Espeholt,W Kay, M Suleyman, and P Blunsom. 2015. Teach-ing machines to read and comprehend. Advances inNeural Information Processing Systems (NIPS), 28.

Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2015. The goldilocks principle: Readingchildren’s books with explicit memory representa-tions. arXiv preprint arXiv:1511.02301.

Minghao Hu, Furu Wei, Yuxing Peng, Zhen Huang,Nan Yang, and Dongsheng Li. 2019. Read+Verify:Machine reading comprehension with unanswerablequestions. In Proceedings of the 33th AAAI Con-ference on Artificial Intelligence (AAAI-19), pages6529–6537.

Bernhard Kratzwald, Anna Eigenmann, and StefanFeuerriegel. 2019. RankQA: Neural question an-swering with answer re-ranking. In Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics (ACL), pages 6076–6085.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-field, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Jacob Devlin,Kenton Lee, et al. 2019. Natural Questions: Abenchmark for question answering research. Trans-actions of the Association for Computational Lin-guistics (TACL), 7:453–466.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. RACE: Large-scale ReAd-ing Comprehension dataset from Examinations. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing (EMCLP),pages 785–794.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2019. ALBERT: A Lite BERT for self-supervisedlearning of language representations. In Proceed-ings of the International Conference on LearningRepresentations (ICLR).

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang,Yiming Yang, and Lei Li. 2020a. On the sentenceembeddings from pre-trained language models. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 9119–9130.

Junyi Jessy Li and Ani Nenkova. 2015. Detectingcontent-heavy sentences: A cross-language casestudy. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1271–1281.

Zuchao Li, Rui Wang, Kehai Chen, Masao Utiyama,Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao.2020b. Explicit sentence compression for neural ma-chine translation. In Proceedings of the AAAI Con-ference on Artificial Intelligence, volume 34, pages8311–8318.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized BERT pretrainingapproach. arXiv preprint arXiv:1907.11692.

Ying Luo, Fengshun Xiao, and Hai Zhao. 2020. Hi-erarchical contextualized representation for namedentity recognition. In Proceedings of the AAAI Con-ference on Artificial Intelligence (AAAI), volume 34,pages 8441–8448.

Jing Ma, Wei Gao, Shafiq Joty, and Kam-Fai Wong.2019. Sentence-level evidence embedding for claimverification with hierarchical attention networks. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL), pages2561–2571.

Quinn McNemar. 1947. Note on the sampling errorof the difference between correlated proportions orpercentages. Psychometrika, 12(2):153–157.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics (ACL), pages 784–789.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 2383–2392.

Revanth Gangi Reddy, Md Arafat Sultan, Efsun Sar-ioglu Kayi, Rong Zhang, Vittorio Castelli, and

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

2358

Avirup Sil. 2020. Answer span correction in ma-chine reading comprehension. In Proceedings ofthe 2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP): Findings, pages2496–2501.

Siva Reddy, Danqi Chen, and Christopher D Manning.2019. CoQA: A conversational question answeringchallenge. Transactions of the Association for Com-putational Linguistics (TACL), 7:249–266.

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Process-ing and 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages3973–3983.

Chih Chieh Shao, Trois Liu, Yuting Lai, Yiying Tseng,and Sam Tsai. 2018. DRCD: a chinese machinereading comprehension dataset. arXiv preprintarXiv:1806.00920.

Cun Shen, Tinglei Huang, Xiao Liang, Feng Li, andKun Fu. 2018. Chinese knowledge base question an-swering by attention-based multi-granularity model.Information, 9(4):2078–2489.

Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi,and Claire Cardie. 2019. DREAM: A challenge dataset and models for dialogue-based reading compre-hension. Transactions of the Association for Com-putational Linguistics (TACL), 7:217–231.

Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang,Xiaodong He, and Bowen Zhou. 2020. Select, An-swer and Explain: Interpretable multi-hop readingcomprehension over multiple documents. In Pro-ceedings of the 34th AAAI Conference on ArtificialIntelligence (AAAI-20), pages 9073–9080.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems (NIPS), volume 30, pages 5998–6008.

Li Wang. 1984. Chinese grammar theory. ShandongEducation Press, China.

Wei Wang, Ming Yan, and Chen Wu. 2018. Multi-granularity hierarchical attention fusion networksfor reading comprehension and question answering.In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (ACL),pages 1705–1714.

Yi Xu, Hai Zhao, and Zhuosheng Zhang. 2021. Topic-aware multi-turn dialogue modeling. In The Thirty-Fifth AAAI Conference on Artificial Intelligence(AAAI-21).

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le.2019. XLNet: Generalized autoregressive pretrain-ing for language understanding. In Advances in Neu-ral Information Processing Systems (NeurIPS), vol-ume 32, pages 5753–5763.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,William Cohen, Ruslan Salakhutdinov, and Christo-pher D. Manning. 2018. HotpotQA: A dataset fordiverse, explainable multi-hop question answering.In Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 2369–2380.

Shuailiang Zhang, Hai Zhao, Yuwei Wu, ZhuoshengZhang, Xi Zhou, and Xiang Zhou. 2020a. DCMN+:Dual Co-matching Network for multi-choice read-ing comprehension. In Proceedings of the 34thAAAI Conference on Artificial Intelligence (AAAI-20), pages 9563–9570.

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li,Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020b.Semantics-aware bert for language understanding.In Proceedings of the AAAI Conference on ArtificialIntelligence (AAAI), volume 34, pages 9628–9635.

Zhuosheng Zhang, Yuwei Wu, Junru Zhou, SufengDuan, Hai Zhao, and Rui Wang. 2020c. SG-Net:Syntax guided transformer for language representa-tion. IEEE Transactions on Pattern Analysis andMachine Intelligence (TPAMI).

Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2021.Retrospective reader for machine reading compre-hension. In Proceedings of the 35th AAAI Confer-ence on Artificial Intelligence (AAAI-21).

Zhuosheng Zhang, Hai Zhao, and Rui Wang. 2020d.Machine reading comprehension: The role of con-textualized language models and beyond. Computa-tional Linguistics, 1.

Hai Zhao, Deng Cai, Yang Xin, Yuzhu Wang, andZhongye Jia. 2017. A hybrid model for chi-nese spelling check. ACM Transactions on Asianand Low-Resource Language Information Process-ing (TALLIP), 16(3):1–22.

Junru Zhou, Zhuosheng Zhang, Hai Zhao, and Shuail-iang Zhang. 2020. Limit-bert: Linguistics informedmulti-task bert. In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP): Findings, pages 4450–4461.

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf

https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf

https://doi.org/10.18653/v1/D18-1259

https://doi.org/10.18653/v1/D18-1259

2359

A Details about Sentence PositionEmbedding

Sentence position embedding (Es) indicates thesentence offset of each token. For normal tokens,their sentence positions are the offsets of the seg-mented sentences they belong to. For special to-kens, the sentence position of:

• [CLS]: Set as 0.

• [SEP]: Equal to that of the nearest normaltoken it follows.

• [PAD]: Set as that of the last normal tokenplus 1.

In this way, every token is assigned with a sentenceposition and then a lookup table is used to mapthese positions to vectors, which is the sentenceposition embedding.

B ILF Curves

To concretely show the reflections of ILF to differ-ent predicted span lengths (j − i), we draw curvesof ILF value, as shown in Figure 6. It can be seenthat ILF achieves the minimum value when spanlength is equal to the average length of all candidatespans. Besides, ILF has a more obvious inhibitoryeffect on longer spans than shorter spans.

2 4 6 8 10 12 14j - i

1.75

1.50

1.25

1.00

0.75

0.50

0.25

0.00

ILF

Avg length = 6Avg length = 8Avg length = 10

Figure 6: The curves of ILF value under different aver-age lengths of candidate spans when µ = 0.1.

C Settings of Hyperparameters

For the fine-tuning in our tasks in terms of theadopted PrLMs, we set the initial learning rate in{3e-5, 5e-5}. The warm-up rate is set to be 0.1,with a L2 weight decay of 0. The batch size isselected as 24 for base models and 64 for large

models. The number of epochs is set to be 2 in allthe experiments. Texts are tokenized using word-pieces, with a maximum length of 512 and docstride of 128. The manual weights are α = 0.5,β = 0.8 and µ = 0.1.

D Details of datasets: DRCD and CJRC

DRCD and CJRC are two different types of Chi-nese MRC task from CMRC 2018.

• DRCD: This is also a span-extraction MRCtask but in Traditional Chinese. Besides, com-pared with CMRC 2018, DRCD containsmuch more simple questions with short an-swers and the overall average answer lengthis 4.9.

• CJRC: This is a more complex MRC task,which has yes/no, no-answer and span-extraction questions. This dataset is collectedin judicial scenarios. Note that we only use50% samples of big-train-data.json for train-ing for fair comparison with previous work.

what if sentence-hood is hard to define: a case study in

Documents