conll-2012 shared task: modeling multilingual unrestricted...
TRANSCRIPT
CoNLL-2012 Shared Task:Modeling Multilingual Unrestricted Coreference in OntoNotes
Sameer PradhanRaytheon BBN Technologies,
Cambridge, MA 02138USA
Alessandro MoschittiUniversity of Trento,
38123 Povo (TN)Italy
Nianwen XueBrandeis University,Waltham, MA 02453
Olga UryupinaUniversity of Trento,
38123 Povo (TN)Italy
Yuchen ZhangBrandeis University,Waltham, MA 02453
Abstract
The CoNLL-2012 shared task involved pre-dicting coreference in English, Chinese, andArabic, using the final version, v5.0, of theOntoNotes corpus. It was a follow-on to theEnglish-only task organized in 2011. Un-til the creation of the OntoNotes corpus, re-sources in this sub-field of language process-ing were limited to noun phrase coreference,often on a restricted set of entities, such asACE entities. OntoNotes provides a large-scale corpus of general anaphoric coreferencenot restricted to noun phrases or to a spec-ified set of entity types, and covers multi-ple languages. OntoNotes also provides ad-ditional layers of integrated annotation, cap-turing additional shallow semantic structure.This paper describes the OntoNotes annota-tion (coreference and other layers) and thendescribes the parameters of the shared task in-cluding the format, pre-processing informa-tion, evaluation criteria, and presents and dis-cusses the results achieved by the participat-ing systems. The task of coreference has had acomplex evaluation history. Potentially manyevaluation conditions, have, in the past, madeit difficult to judge the improvement in newalgorithms over previously reported results.Having a standard test set and evaluation pa-rameters, all based on a resource that providesmultiple integrated annotation layers (parses,semantic roles, word senses, named entitiesand coreference) and in multiple languagescould support joint models, and should helpground and energize ongoing research in thetask of entity and event coreference.
1 Introduction
The importance of coreference resolution for theentity/event detection task, namely identifying all
mentions of entities and events in text and clusteringthem into equivalence classes, has been well recog-nized in the natural language processing community.Automatic identification of coreferring entities andevents in text has been an uphill battle for severaldecades, partly because it can require world knowl-edge which is not well-defined and partly owing tothe lack of substantial annotated data. Aside fromthe fact that resolving coreference in text is a veryhard problem, there have been hindrances that con-tributed in damping its progress further:
(i) Smaller sized corpora such as MUC which cov-ered coreference across all noun phrases. Cor-pora such as ACE which were larger in size,but covered a smaller set of entities, and at-tempted to cover multiple coreference phe-nomenon which are not equally annotatablewith high agreement; and
(ii) Complex evaluation with multiple evaluationmetrics and multiple possible evaluation sce-narios, complicated with varying training andtest partitions, led to many researchers report-ing one or few of the metrics and under a sub-set of evaluation scenarios making it harder togauge the improvements in algorithms over theyears(Stoyanov et al., 2009), or to determinewhich particular areas require further attention.Looking at various numbers reported in litera-ture can greatly affect the perceived difficultyof the task. It can seem to be a very hard prob-lem (Soon et al., 2001) or one that is somewhateasier (Culotta et al., 2007).
A step in the right direction for the benefit of thecommunity was to possibly:
(i) Create a large corpus with high annotatoragreement possibly by restricting the corefer-
ence annotating to phenomenon that can be an-notated with high agreement, and covering anunrestricted set of entities and events; and
(ii) Create a standard evaluation scenario with anofficial evaluation setup, and possibly severalablation settings to capture the range of perfor-mance. This can then be used as a standardbenchmark by various researchers.
Creation of the OntoNotes corpus automaticallyaddressed the first issue. A SemEval-2010 coref-erence task was the first attempt at addressing thesecond issue. Among other corpora, a small sub-set (∼120k) of English portion of OntoNotes wasused for this purpose. However, it did not receiveenough participation which prevented the organiz-ers from forming reaching at any strong conclusions.CoNLL-2011 shared task was another attempt to ad-dress the second issue. It was well received, but wasonly limited to the English portion of OntoNotes. Inaddition, the coreference portion of OntoNotes didnot have a concrete baseline prior to the 2011 evalu-ation, thereby making it challenging for participantsto gauge the performance of their algorithms in ab-sence of established state of the art on this flavorof annotation. The closest comparison was to theresults reported by Pradhan et al. (2007b) on thenewswire portion of OntoNotes. Since the corpusalso covers Chinese and Arabic, it provided a greatopportunity to have a follow up task in 2012 cover-ing all three languages. As we will see later, pecu-liarities of each of these languages had to be consid-ered in creating the evaluation framework.
In the language processing community, the fieldof speech recognition probably has longest historyof shared evaluations held primary by NIST1 (Pal-lett, 2002). In the past decade Machine Translationhas been a topic of shared evaluations also by NIST2
There are many phenomenon of syntax and seman-tics that are not quite amenable to such continuedevaluation efforts. CoNLL shared tasks over the past15 years have tried to fill that gap, helping in estab-lishing benchmarks and promoting the advancementof state of the art in various sub-fields within NLP.The importance of shared tasks has now permeatedthe domain of clinical NLP (Chapman et al., 2011)and recently a coreference task was organized as partof the i2b2 workshop (Uzuner et al., 2012).
The rest of the paper is organized as follows: Sec-tion 2 gives a quick overview of the research in
1http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html2http://www.itl.nist.gov/iad/mig/tests/mt/
coreference. Section 3 presents an overview of theOntoNotes corpus. Section 4, describes the rangeof phenomenon annotated in OntoNotes, with lan-guage specific issues. Section 5 describes the sharedtask data and the evaluation parameters, with Sec-tion 5.4.2 looking at the performance of state of theart tools on all/most intermediate layers of annota-tion. Section briefly describes the approaches takenby various participating systems. Section presentsthe system results with some analysis. Section 10concludes the paper.
2 Background
Early work on corpus-based coreference resolutiondates back to the mid-90s by McCarthy and Lenhert(1995) where they experimented with using decisiontrees and hand-written rules. A systematic studywas then conducted using decision trees by Soonet al. (2001). Significant improvements have beenmade in the field of language processing in gen-eral, and improved learning techniques have beendeveloped to push the state of the art in corefer-ence resolution forward (Morton, 2000; Harabagiuet al., 2001; McCallum and Wellner, 2004; Culottaet al., 2007; Denis and Baldridge, 2007; Rahmanand Ng, 2009; Haghighi and Klein, 2010). Re-searchers continued finding novel ways of exploit-ing ontologies such as WordNet. Various differentknowledge sources from shallow semantics to en-cyclopedic knowledge are being exploited (Ponzettoand Strube, 2005; Ponzetto and Strube, 2006; Vers-ley, 2007; Ng, 2007). Given that WordNet is a staticontology and as such has limitation on coverage,more recently, there have been successful attemptsto utilize information from much larger, collabora-tively built resources such as Wikipedia (Ponzettoand Strube, 2006). More recently [??] researchershave used graph based algorithms. For a detailedtreatment of progress in this field, we refer the readerto a recent survey article (Ng, 2010) and a tutorial(Ponzetto and Poesio, 2009) dedicated to this sub-ject.
In spite of all the progress, current techniquesstill rely primarily on surface level features such asstring match, proximity, and edit distance; syntac-tic features such as apposition; and shallow seman-tic features such as number, gender, named entities,semantic class, Hobbs’ distance, etc. Corpora tosupport supervised learning of this task date backto the Message Understanding Conferences (MUC(Hirschman and Chinchor, 1997; Chinchor, 2001;Chinchor and Sundheim, 2003)). The de facto stan-
dard datasets for current coreference studies are theMUC and the ACE3 (G. Doddington et al., 2004)corpora. These corpora were tagged with corefer-ring entities identified by noun phrases in the text.The MUC corpora cover all noun phrases in text,but represent small training and test sets. The ACEcorpora, on the other hand, have much more anno-tation, but are restricted to a small subset of enti-ties. They are also less consistent, in terms of inter-annotator agreement (ITA) (Hirschman et al., 1998)need to add a real citation. This lessens the relia-bility of statistical evidence in the form of lexicalcoverage and semantic relatedness that could be de-rived from the data and used by a classifier to gen-erate better predictive models. The importance of awell-defined tagging scheme and consistent ITA hasbeen well recognized and studied in the past (Poe-sio, 2004; Poesio and Artstein, 2005; Passonneau,2004). There is a growing consensus that in order forthese to be most useful for language understandingapplications such as question answering or distilla-tion – both of which seek to take information accesstechnology to the next level – we need more con-sistent annotation of larger amounts of broad cov-erage data for training better automatic techniquesfor entity and event identification. Identification andencoding of richer knowledge – possibly linked toknowledge sources – and development of learningalgorithms that would effectively incorporate themis a necessary next step towards improving the cur-rent state of the art. The computational learningcommunity, in general, is also witnessing a move to-wards evaluations based on joint inference, with thetwo previous CoNLL tasks (Surdeanu et al., 2008;Hajic et al., 2009) devoted to joint learning of syn-tactic and semantic dependencies. A principle ingre-dient for joint learning is the presence of multiplelayers of semantic information.
One fundamental question still remains, and thatis – what would it take to improve the state of the artin coreference resolution that has not been attemptedso far? Many different algorithms have been triedin the past 15 years, but one thing that was lack-ing until now was a corpus comprehensively taggedon a large scale with consistent, rich semantic infor-mation. One of the many goals of the OntoNotesproject4 (Hovy et al., 2006; Weischedel et al., 2011)was to explore whether it could fill this void andhelp push the progress further – not only in coref-erence, but with the various layers of semantics that
3http://projects.ldc.upenn.edu/ace/data/4http://www.bbn.com/nlp/ontonotes
it tries to capture. As one of its layers, it has createda corpus for general anaphoric coreference that cov-ers entities and events not limited to noun phrases ora limited set of entity types. As mentioned earlier,the coreference layer in OntoNotes constitutes justone part of a multi-layered, integrated annotation ofshallow semantic structure in text with high inter-annotator agreement, which also provides a uniqueopportunity for performing joint inference over asubstantial body of data.
3 The OntoNotes CorpusThe OntoNotes project has created a corpus of large-scale, accurate, and integrated annotation of mul-tiple levels of the shallow semantic structure intext. The English and Chinese language portioncomprises roughly one million words per languagefrom newswire, magazine articles, broadcast news,broadcast conversations, web data and conversa-tional speech. The English corpus also contains afurther 200k of the English translation of the NewTestament. The Arabic portion is smaller, compris-ing 300k of newswire articles. The idea is that thisrich, integrated annotation covering many layers willallow for richer, cross-layer models enabling signif-icantly better automatic semantic analysis. In ad-dition to coreference, this data is also tagged withsyntactic trees, high coverage verb and some nounpropositions, partial verb and noun word senses, and18 named entity types. Over the years of the de-velopment of this corpus, there were various prior-ities that came into play, and therefore not all thedata in the corpus is annotated with all the differentlayers of annotation. However, such multi-layer an-notations, with complex, cross-layer dependencies,demands a robust, efficient, scalable mechanism forstoring them while providing efficient, convenient,integrated access to the the underlying structure. Tothis effect, it uses a relational database representa-tion that captures both the inter- and intra-layer de-pendencies and also provides an object-oriented APIfor efficient, multi-tiered access to this data (Pradhanet al., 2007a). This should facilitate the creation ofcross-layer features in integrated predictive modelsthat will make use of these annotations.
OntoNotes comprises the following layers of an-notation:
• Syntax – A syntactic layer representing a re-vised Penn Treebank (Marcus et al., 1993;Babko-Malaya et al., 2006).
• Propositions – The proposition structure ofverbs in the form of a revised PropBank(Palmer
et al., 2005; Babko-Malaya et al., 2006). men-tion Chinese and Arabic PropBank
• Word Sense – Coarse grained word sensesare tagged for the most frequent polysemousverbs and nouns, in order to maximize cov-erage. The word sense granularity is tailoredto achieve 90% inter-annotator agreement asdemonstrated by Palmer et al. (2007). Thesesenses are defined in the sense inventory filesand each individual sense has been connectedto multiple WordNet senses. This provides adirect access to the WordNet semantic structurefor users to make use of. For the English por-tion of OntoNotes, there is also a mapping fromthe word senses to the PropBank frames andto VerbNet (Kipper et al., 2000) and FrameNet(Fillmore et al., 2003). Unfortunately, owingto lack of comparable resources as comprehen-sive as WordNet, in Chinese or Arabic, neitherhave any inter-resource mappings available.
• Named Entities – The corpus was tagged witha set of 18 proper named entity types thatwere well-defined and well-tested for inter-annotator agreement by Weischedel and Burn-stein (2005).
• Coreference – This layer captures generalanaphoric coreference that covers entities andevents not limited to noun phrases or a limitedset of entity types (Pradhan et al., 2007b). Itconsiders all prepositions (PRP, PRP$), nounphrases (NP) and heads of verb phrases (VP)as potential mentions. Unlike English, Chi-nese and Arabic have dropped subjects and ob-jects which were also considered during coref-erence. We will take a look at this in detail inthe next section.
4 Coreference in OntoNotes
General anaphoric coreference that spans a rich setof entities and events – not restricted to a few types,as has been characteristic of most coreference dataavailable until now – has been tagged with a high de-gree of consistency. Two different types of corefer-ence are distinguished in the OntoNotes data: Iden-tical (IDENT), and Appositive (APPOS). Identifycoreference (IDENT) is used for anaphoric corefer-ence, meaning links between pronominal, nominal,and named mentions of specific referents. It does notinclude mentions of generic, underspecified, or ab-stract entities. Appositives (APPOS) are treated sep-
arately because they function as attributions, as de-scribed further below. Coreference is annotated forall specific entities and events. There is no limit onthe semantic types of NP entities that can be consid-ered for coreference, and in particular, coreferenceis not limited to ACE types.
4.1 Noun PhrasesThe mentions over which IDENT coreference appliesare typically pronominal, named, or definite nomi-nal. The annotation process begins by automaticallyextracting all of the NP mentions from the PennTreebank, though the annotators can also add addi-tional mentions when appropriate. In the followingtwo examples (and later ones), the phrases notatedin bold form the links of an IDENT chain.
(1) She had a good suggestion and it was unani-mously accepted by all.
(2) Elco Industries Inc. said it expects net incomein the year ending June 30, 1990, to fall below arecent analyst’s estimate of $ 1.65 a share. TheRockford, Ill. maker of fasteners also said itexpects to post sales in the current fiscal yearthat are “slightly above” fiscal 1989 sales of $155 million.
Noun phrases (NPs) in Chinese can be complexnoun phrases or bare nouns (nouns that lack a de-terminer such as “the” or “this”). Complex nounphrases contain structures modifying the head noun,as in the following examples:
(3) [他担任 总统 任内 最后 一 次 的 [亚(1) 太经济合作会议 [高峰会]]].[[His last APEC [summit meeting]] as the Pres-ident]
(4) [越南 统一 后 [第一 位 前往 当地 访问 的[美国总统]]][[The first [U.S. president]] who went to visitVietnam after its unification]
In these examples, the smallest phrase in squarebrackets is the bare noun. The longer phrase insquare brackets includes modifying structures. Allthe expressions in square brackets, however, sharethe same head noun, i.e., “高峰会 (summit meet-ing)”, and “美国总统 (U.S. president)” respectively.Nested noun phrases, or nested NPs, are containedwithin longer noun phrases. In the above example,“summit meeting” and “U.S. president” are nestedNPs. Wherever NPs are nested, the largest logicalspan is used in coreference
4.2 VerbsVerbs are added as single-word spans if they canbe coreferenced with a noun phrase or with an-other verb. The intent is to annotate the VP, butmark the single-word head for convenience. This in-cludes morphologically related nominalizations (5)and noun phrases that refer to the same event, evenif they are lexically distinct from the verb (6). In thefollowing two examples, only the chains related tothe growth event are shown. The Arabic translationof the same example identifies mentions through aline on top of the tokens.
(5) Sales of passenger cars grew 22%. The stronggrowth followed year-to-year increases.
,�éJ
�AÖÏ @
�H@ñ
J�Ë@ ÈC
g
�é«Qå��. ú
G.ðPð
B@ XA�
�J�¯B
@ AÖ
ß Y
�®Ë
©P ú
¯ ÑëA�ñÒ
JË @ @
Yë
(6) Japan’s domestic sales of cars, trucks and busesin October rose 18% from a year earlier to500,004 units, a record for the month, the JapanAutomobile Dealers’ Association said. Thestrong growth followed year-to-year increasesof 21% in August and 12% in September.
4.3 PronounsAll pronouns and demonstratives are linked to any-thing that they refer to, and pronouns in quotedspeech are also marked. Expletive or pleonastic pro-nouns (it, there) are not considered for tagging, andgeneric you is not marked. In the following exam-ple, the pronoun you and it would not be marked. (Inthis and following examples, an asterisk (*) before aboldface phrase identifies entity/event mentions thatwould not be tagged as coreferent.)
(7) Senate majority leader Bill Frist likes to tella story from his days as a pioneering heartsurgeon back in Tennessee. A lot of times,Frist recalls, *you’d have a critical patient ly-ing there waiting for a new heart, and *you’dwant to cut, but *you couldn’t start unless *youknew that the replacement heart would make*it to the operating room.
In Chinese, if the subject or object can be recov-ered from the context, or it is of little interest for thereader/listener to know, it can be omitted. In Chi-nese Treebank, the position where subject or objectis omitted is annotated with small *pro*. A *pro*can be replaced by overt NPs if they refer to the sameentity or event. And *pro* and overt NPs do nothave to be in the same sentence. Exactly what *pro*
stands for is determined by the linguistic context inwhich it appears.
(8) 吉林省主管经贸工作的副省长全哲洙说:“[*pro*] 欢迎国际社会同[我们] 一道,共同推进图门江开发事业, 促进区域经济发展,造福东北亚人民。Quan Zhezhu, Vice Governor of JinlinProvince who is in charge of economics andtrade, said: “[*pro*] Welcome international so-cieties to join [us] in the development of Tu-men Jiang, so as to promote regional economicdevelopment and benefit people in NortheastAsia.
Sometimes, *pro*s cannot be recovered in thetext—i.e., they cannot be replaced by an overt NPin the text. For example, *pro* in existential sen-tences usually cannot be recovered or linked in theannotation, as in the following case
(9) [*pro*] 有二十三顶高新技术项目进区开发。There are 23 high-tech projects under develop-ment in the zone.
Also, if *pro* does not refer to a specific en-tity or event, it is considered generic *pro* and notlinked. Finally, *pro*s in idiomatic expressions arenot linked.
(10) 肯德基 、 麦当劳 等 速食店 全 大陆 都 推出 了 [*pro*] 买 套餐赠送 布质 或 棉质 圣诞老人玩具的促销.In Mainland China, fast food restaurants suchas Kentucky Fried Chicken and McDonald’shave launched their promotional packages byproviding free cotton Santa toys for eachcombo [*pro*] purchased
Similar to Chinese, Arabic null subjects and ob-jects are also eligible for coreference. In the ArabicTreebank, these are marked with just an “*”.
4.4 Generic mentionsGeneric nominal mentions can be linked with refer-ring pronouns and other definite mentions, but arenot linked to other generic nominal mentions.
This would allow linking of the bracketed men-tions in (11) and (12), but not (13).
(11) Officials said they are tired of making thesame statements.
(12) Meetings are most productive when they areheld in the morning. Those meetings, how-ever, generally have the worst attendance.
(13) Allergan Inc. said it received approval tosell the PhacoFlex intraocular lens, the firstfoldable silicone lens available for *cataractsurgery. The lens’ foldability enables it to beinserted in smaller incisions than are now pos-sible for *cataract surgery.
Bare plurals, as in (11) and (12), are always con-sidered generic. In example (14) below, there aretwo generic instances of parents. These are markedas distinct IDENT chains (with separate chains dis-tinguished by subscripts X, Y and Z), each contain-ing a generic and the related referring pronouns.
(14) ParentsX should be involved with theirXchildren’s education at home, not in school.TheyX should see to it that theirX kids don’tplay truant; theyX should make certain thatthe children spend enough time doing home-work; theyX should scrutinize the report card.ParentsY are too likely to blame schools for theeducational limitations of theirY children. IfparentsZ are dissatisfied with a school, theyZshould have the option of switching to another.
In (15) below, the verb “halve” cannot be linked to“a reduction of 50%”, since “a reduction” is indefi-nite.
(15) Argentina said it will ask creditor banks to*halve its foreign debt of $64 billion – thethird-highest in the developing world . Ar-gentina aspires to reach *a reduction of 50%in the value of its external debt.
4.5 Pre-modifiersProper pre-modifiers can be coreferenced, butproper nouns that are in a morphologically adjecti-val form are treated as adjectives, and not corefer-enced. For example, adjectival forms of GPEs suchas Chinese in “the Chinese leader”, would not belinked. Thus we could coreference United States in“the United States policy” with another referent, butnot American “the American policy.” GPEs and Na-tionality acronyms (e.g. U.S.S.R. or U.S.). are alsoconsidered adjectival. Pre-modifier acronyms can becoreferenced unless they refer to a nationality. Thusin the examples below, FBI can be coreferenced toother mentions, but U.S. cannot.
(16) FBI spokesman
(17) *U.S. spokesman
Dates and monetary amounts can be consideredpart of a coreference chain even when they occur aspre-modifiers.
(18) The current account deficit on France’s balanceof payments narrowed to 1.48 billion Frenchfrancs ($236.8 million) in August from a re-vised 2.1 billion francs in July, the FinanceMinistry said. Previously, the July figure wasestimated at a deficit of 613 million francs.
(19) The company’s $150 offer was unexpected.The firm balked at the price.
4.6 Copular verbsAttributes signaled by copular structures are notmarked; these are attributes of the referent they mod-ify, and their relationship to that referent will becaptured through word sense and propositional ar-gument tagging.
(20) JohnX is a linguist. PeopleY are nervousaround JohnX, because heX always correctstheirY grammar.
Copular (or ’linking’) verbs are those verbs thatfunction as a copula and are followed by a sub-ject complement. Some common copular verbs are:be, appear, feel, look, seem, remain, stay, become,end up, get. Subject complements following suchverbs are considered attributes, and not linked. SinceCalled is copular, neither IDENT nor APPOS corefer-ence is marked in the following case.
(21) Called Otto’s Original Oat Bran Beer, the brewcosts about $12.75 a case.
4.7 Small clausesLike copulas, small clause constructions are notmarked. The following example is treated as if thecopula were present (“John considers Fred to be anidiot”):
(22) John considers *Fred *an idiot.
4.8 Temporal expressionsTemporal expressions such as the following arelinked:
(23) John spent three years in jail. In that time...
Deictic expressions such as now, then, today, tomor-row, yesterday, etc. can be linked, as well as othertemporal expressions that are relative to the time ofthe writing of the article, and which may thereforerequire knowledge of the time of the writing to re-solve the coreference. Annotators were allowed touse knowledge from outside the text in resolvingthese cases. In the following example, the end ofthis period and that time can be coreferenced, as canthis period and from three years to seven years.
(24) The limit could range from three years toseven yearsX, depending on the compositionof the management team and the nature of itsstrategic plan. At (the end of (this period)X)Y,the poison pill would be eliminated automati-cally, unless a new poison pill were approvedby the then-current shareholders, who wouldhave an opportunity to evaluate the corpora-tion’s strategy and management team at thattimeY.
In multi-date temporal expressions, embeddeddates are not separately connected to to other men-tions of that date. For example in Nov. 2, 1999, Nov.would not be linked to another instance of Novemberlater in the text.
4.9 AppositivesBecause they logically represent attributions, appos-itives are tagged separately from Identity corefer-ence. They consist of a head, or referent (a nounphrase that points to a specific object/concept in theworld), and one or more attributes of that referent.An appositive construction contains a noun phrasethat modifies an immediately-adjacent noun phrase(separated only by a comma, colon, dash, or paren-thesis). It often serves to rename or further definethe first mention. Marking appositive constructionsallows us to capture the attributed property eventhough there is no explicit copula.
(25) Johnhead, a linguistattribute
The head of each appositive construction is distin-guished from the attribute according to the followingheuristic specificity scale, in a decreasing order fromtop to bottom:
Type Example
Proper noun JohnPronoun HeDefinite NP the manIndefinite specific NP a man I knowNon-specific NP man
This leads to the following cases:
(26) Johnhead, a linguistattribute
(27) A famous linguistattribute, hehead studied at ...
(28) a principal of the firmattribute, J. Smithhead
In cases where the two members of the appositiveare equivalent in specificity, the left-most member ofthe appositive is marked as the head/referent. Defi-nite NPs include NPs with a definite marker (the) aswell as NPs with a possessive adjective (his). Thusthe first element is the head in all of the followingcases:
(29) The chairman, the man who never gives up
(30) The sheriff, his friend
(31) His friend, the sheriff
In the specificity scale, specific names of diseasesand technologies are classified as proper names,whether they are capitalized or not.
(32) A dangerous bacteria, bacillium, is found
When the entity to which an appositive refers isalso mentioned elsewhere, only the single span con-taining the entire appositive construction is includedin the larger IDENT chain. None of the nested NPspans are linked. In the example below, the en-tire span can be linked to later mentions to RichardGodown.
The sub-spans are not included separately in theIDENT chain.
(33) Richard Godown, president of the Indus-trial Biotechnology Association
Ages are tagged as attributes (as if they were el-lipses of, for example, a 42-year-old):
(34) Mr.Smithhead, 42attribute,
Similar rules apply for Chinese and Arabic.
4.10 Special IssuesIn addition to the ones above, there are some specialcases such as:
• No coreference is marked between an organi-zation and its members.
• GPEs are linked to references to their govern-ments, even when the references are nestedNPs, or the modifier and head of a single NP.
Type Description
Annotator Error An annotator error. This is a catch-all category for cases of errors that do not fit in the othercategories.
Genuine Ambiguity This is just genuinely ambiguous. Often the case with pronouns that have no clear an-tecedent (especially this & that)
Generics One person thought this was a generic mention, and the other person didn’tGuidelines The guidelines need to be clear about this exampleCallisto Layout Something to do with the usage/design of CallistoReferents Each annotator thought this was referring to two completely different thingsPossessives One person did not mark this possessiveVerb One person did not mark this verbPre Modifiers One person did not mark this Pre ModifierAppositive One person did not mark this appositiveExtent Both people marked the same entity, but one person’s mention was longerCopula Disagreement arose because this mention is part of a copular structure
a) Either each annotator marked a different half of the copulab) Or one annotator unnecessarily marked both
Figure 1: Description of various disagreement types
Figure 1: The distribution of disagreements across the various types in Table 2
Sheet1
Page 1
Copulae 2%Appositives 3%Pre Modifiers 3%Verbs 3%Possessives 4%Referents 7%Callisto Layout 8%Guidelines 8%Generics 11%Genuine Ambiguity 25%Annotator Error 26%
Copulae
Appositives
Pre Modifiers
Verbs
Possessives
Referents
Callisto Layout
Guidelines
Generics
Genuine Ambiguity
Annotator Error
0% 5% 10% 15% 20% 25% 30%
Figure 2: The distribution of disagreements across the various types in Table 1 for a sample of 15K disagreements inthe English portion of the corpus.
• In extremely rare cases, metonymic mentionscan be co-referenced. This is done only whenthe two mentions clearly and without a doubtrefer to the same entity. For example:
(35) In a statement released this afternoon, [10Downing Street] called the bombings inCasablanca “a strike against all peace-loving people.”
(36) In a statement, [Britain] called theCasablanca bombings “a strike against allpeace-loving people.”
In this case, it is obvious that “10 Down-ing Street” and “Britain” are being used inter-changeably in the text. Again, if there is anyambiguity, however, these terms are not coref-erenced with each other.
• In Arabic, verbal inflections are not consideredpronominal and are not coreferenced.
4.11 Annotator Agreement and AnalysisTable 1 shows the inter-annotator and annotator-adjudicator agreement on all the genres ofOntoNotes. We also analyzed about 15K disagree-ments in various parts of the English data, andgrouped them into one of the categories shown inFigure 1. Figure 2 shows the distribution of thesedifferent types that were found in that sample. It canbe seen that genuine ambiguity and annotator errorare the biggest contributors – the latter of which isusually captured during adjudication, thus showingthe increased agreement between the adjudicatedversion and the individual annotator version.
5 CoNLL-2012 Coreference Task
The CoNLL-2012 shared task was held across allthree languages that are from quite different lan-guage families – English, Chinese and Arabic – ofthe OntoNotes 5.0 data. The task was to automat-ically identify mentions of entities and events intext and to link the coreferring mentions together to
Language Genre A1-A2 A1-ADJ A2-ADJ
English Newswire 80.9 85.2 88.3Broadcast News 78.6 83.5 89.4Broadcast Conversation 86.7 91.6 93.7Magazine 78.4 83.2 88.8Web 85.9 92.2 91.2Call Home 81.3 94.1 84.7New Testament 89.4 96.0 92.0
Chinese Newswire 73.6 84.8 75.1Broadcast News 80.5 86.4 91.6Broadcast Conversation 84.1 90.7 91.2Magazine 74.9 81.2 80.0Web 87.6 92.3 93.5Call Home 65.6 86.6 77.1
Arabic Newswire 73.8 88.1 75.6
Table 1: Inter Annotator and Adjudicator agreement forthe Coreference Layer in OntoNotes measured in termsof the MUC score.
form entity/event chains. The coreference decisionshad to be made using automatically predicted infor-mation on the other structural and semantic layersincluding the parses, semantic roles, word senses,and named entities. Given various factors, suchas the lack of resources, state of the art tools, andtime constraints, we could not provide some layersof information for the Chinese and Arabic portionof the data. The morphology of these languagesis quite different. Arabic has a complex morphol-ogy, English has limited morphology, whereas Chi-nese has no morphology. English word segmenta-tion amounts to rule-based tokenization, and is closeto perfect. In case of Chinese and Arabic, althoughit is not as good as English, the accuracies are inthe high 90s. Syntactically, there are many droppedsubjects and objects in Arabic and Chinese, whereas none is the case with English. Another differenceis the amount of resources available for each lan-guage. English has probably the most resources atits disposal, whereas Chinese and Arabic lack sig-nificantly. Arabic more so than Chinese. Giventhis fact, plus the fact that the CoNLL format can-not handle multiple segmentations, and that it wouldcomplicate scoring since we are using exact tokenboundaries (as discussed later in Section xxx), wedecided to allow the use of gold, treebank segmenta-tion for all languages. In case of Chinese, the wordsthemselves are lemmas, so no addition informationneeds to be provided. For Arabic, we decided to alsoprovide correct, gold lemmas, along with the cor-rect vocalized version of the tokens. Table 2 listsall the predicted layers of information provided foreach language.
As is customary for CoNLL tasks, there were two
Layer English Chinese Arabic
Segmentation • • •Lemma
√– •
Parse√ √ √5
Proposition√ √ ×
Predicate Frame√ × ×
Word Sense√ √ √
Name Entities√ × ×
Speaker • • –
Table 2: Summary of predicted layers provided for eachlanguage. A “•” indicates gold annotation.
primary tracks, closed and open. For the closedtrack, systems were limited to using the distributedresources, in order to allow a fair comparison of al-gorithm performance, while the open track allowedfor almost unrestricted use of external resources inaddition to the provided data. Within each closedand open track, we had an optional supplementarytrack which allowed us to run some ablation studiesover a few different input conditions. This allowedus to evaluate the systems given: i) Gold parses; ii)Gold mention boundaries, and iii) Gold mentions.
5.1 Primary EvaluationThe primary evaluation comprises the closed andopen tracks where predicted information is providedon all layers of the test set other than coreference. Asmentioned earlier, we provide gold lemma and vo-calization information for Arabic, and we use gold,treebank, segmentation for all three languages.
5.1.1 Closed TrackIn the closed track, systems were limited to the
provided data. For the training and test data, inaddition to the underlying text, predicted versionsof all the supplementary layers of annotation wereprovided using off-the-shelf tools (parsers, semanticrole labelers, named entity taggers, etc.) retrainedon the training portion of the OntoNotes v5.0 data– as described in Section 5.4.2. For the trainingdata, however, in addition to predicted values for theother layers, we also provided manual gold-standardannotations for all the layers. Participants were al-lowed to use either the gold-standard or predictedannotation for training their systems. They were alsofree to use the gold-standard data to train their ownmodels for the various layers of annotation, if theyjudged that those would either provide more accu-rate predictions or alternative predictions for use asmultiple views, or wished to use a lattice of predic-tions.
More so than previous CoNLL tasks, coreference
predictions depend on world knowledge, and manystate-of-the-art systems use information from exter-nal resources such as WordNet, which can add alayer of information that could help a system recog-nize semantic connections between the various lex-icalized mentions in the text. Therefore, in the caseof English, similar to the previous year’s task, weallowed the use of WordNet for the closed track.Since word senses in OntoNotes are predominantly6
coarse-grained groupings of WordNet senses, sys-tems could also map from the predicted or gold-standard word senses to the sets of underlying Word-Net senses. Another significant piece of knowledgethat is particularly useful for coreference but that isnot available in the layers of OntoNotes is that ofnumber and gender. There are many different waysof predicting these values, with differing accuracies,so in order to ensure that participants in the closedtrack were working from the same data, thus allow-ing clearer algorithmic comparisons, we specified aparticular table of number and gender predictionsgenerated by Bergsma and Lin (2006), for use dur-ing both training and testing. Unfortunately neitherArabic, nor Chinese has comparable resources avail-able that we could allow participants to use.
5.1.2 Open TrackIn addition to resources available in the closed
track, the open track, systems were allowed to useexternal resources such as Wikipedia, gazetteers etc.This track is mainly to get an idea of a performanceceiling on the task at the cost of not getting a com-parison across all systems. Another advantage of theopen track is that it might reduce the barriers to par-ticipation by allowing participants to field existingresearch systems that already depend on external re-sources – especially if there were hard dependencieson these resources – so they can participate in thetask with minimal or no modification to their exist-ing system.
5.2 Supplementary EvaluationIn addition to the option to select between the of-
ficial closed or the open tracks, the participants alsohad an option to run their systems on some ablationsettings.
Gold Parses In this case, for each language, wereplaced the predicted parses in the closed track datawith manual, gold parses.
6There are a few instances of novel senses introduced inOntoNotes which were not present in WordNet, and so lack amapping back to the WordNet senses
Gold Mention Boundaries In this case, we pro-vided all possible correct mention boundaries in thetest data. This essentially entails all NPs, and PRPsin the data extracted from the gold parse trees, aswell as the mentions that do not align with any parseconstituent, for example, verb mentions and somenamed entities.
Gold Mentions In this dataset, we provided onlyand all the correct mentions for the test sets, therebyreducing the task to one of pure coreference linking,and eliminating the task of mention detection andanaphoricity determination7.
5.3 Train, Development and Test SplitsWe used the same algorithm as in CoNLL-2011 tocreate the train/development/test partitions for En-glish, Chinese and Arabic. We tried to reuse pre-viously used training/development/test partitions forChinese and Arabic, but either they were not in theselection used for OntoNotes, or were partially over-lapping. Unfortunately, unlike English WSJ parti-tions, there was no clean way of reusing those par-titions. Algorithm 1 details this procedure. modifythe lists. The list of training, development and testdocument IDs can be found on the task webpage8.Following the recent CoNLL tradition, participantswere allowed to use both the training and the devel-opment data for training the final model.
5.4 Data PreparationThis section gives details of the different annota-tion layers including the automatic models that wereused to predict them, and describes the formats inwhich the data were provided to the participants.
5.4.1 Manual Annotation Gold LayersWe will take a look at the manually annotated, or
gold layers of information that were made availablefor the training data.
Coreference The manual coreference annotationis stored as chains of linked mentions connectingmultiple mentions of the same entity. Coreference is
7Since mention detection interacts with anaphoricity deter-mination since the corpus does not contain any singleton men-tions.
8http://conll.bbn.com/download/conll-train.idhttp://conll.bbn.com/download/conll-dev.idhttp://conll.bbn.com/download/conll-test.id
These are more general list which include docu-ments that had at least one of the layers of annotation.In other words they also include documents that do nothave any coreference annotation
Algorithm 1 Procedure used to create OntoNotes training, developmentand test partitions.Procedure: Generate Partitions(OntoNotes) returns Train, Dev, Test
1: Train ← ∅2: Dev ← ∅3: Test ← ∅4: for all Source ∈ OntoNotes do5: if Source = Wall Street Journal then6: Train ← Train ∪ Sections 02 – 217: Dev ← Dev ∪ Sections 00, 01, 22, 248: Test ← Test ∪ Section 239: else10: if Number of files in Source ≥ 10 then11: Train ← Train ∪ File IDs ending in 1 – 812: Dev ← Dev ∪ File IDs ending in 013: Test ← Test ∪ File IDs ending in 914: else15: Dev ← Dev ∪ File IDs ending in 016: Test ← Test ∪ File ID ending in the highest number17: Train ← Train ∪ Remaining File IDs for the Source18: end if19: end if20: end for21: return Train, Dev, Test
1
the only document-level phenomenon in OntoNotes,and the complexity of annotation increases non-linearly with the length of a document. Unfortu-nately, some of the documents – especially ones inthe broadcast conversation, weblogs, and telephoneconversation genre – are very long which prohib-ited us from efficiently annotating them in entirety.These had to be split into smaller parts. We con-ducted a few passes to join some adjacent parts, butsince some documents had as many as 17 parts, thereare still multi-part documents in the corpus. Sincethe coreference chains are coherent only within eachof these document parts, for this task, each such partis treated as a separate document. Another thingto note is that there were some cases of sub-tokenannotation in the corpus owing to the fact that to-kens were not split at hyphens. Cases such as pro-WalMart had the sub-span WalMart linked with anotherinstance of Walmart. The recent Treebank revisionwhich split tokens at most hyphens, made a majorityof these sub-token annotations go away. There werestill some residual sub-token annotations. Sincesubtoken annotations cannot be represented in theCoNLL format, and they were a very small quan-tity – much less than even half a percent – we de-cided to ignore them. Unlike English, Chinese andArabic has coreference annotation on elided sub-
jects/objects. Recovering these entities in text is ahard problem, and the most recently reported num-bers in literature for Chinese are around a F-scoreof 50 cite, and for Arabic are around mention typ-ical scores. Considering the level of prediction ac-curacy of these tokens, and the relative frequency ofthe same, plus the fact that the CoNLL tabular for-mat is not amenable to a variable number of tokens,we decided to not consider them as part of the task.In other words, we removed the manually identifiedtraces (*pro* and *) respectively in Chinese and Ara-bic Treebanks. We also do not consider the links thatare formed by these tokens (how many?) in the goldevaluation key.
For various reasons, not all the documents inOntoNotes have been annotated with all the differentlayers of annotation, with full coverage.11 There isa core portion, however, which is roughly 1.6M En-glish words, 950K Chinese words, and 300K Arabicwords which has been annotated with all the layers.This is the portion that we used for the shared task.
The number of documents in the corpus for this
11Given the nature of word sense annotation, and changes inproject priorities, we could not annotate all the low frequencyverbs and nouns in the corpus. Furthermore, PropBank anno-tation currently only covers mostly verb predicates and a fewnoun predicates.
Corpora Language Words DocumentsTotal Train Dev Test Total Train Dev Test
MUC-6 English 25K 12K 13K 60 30 30MUC-7 English 40K 19K 21K 67 30 37ACE9 (2000-2004) English 960K 745K 215K - - -
Chinese 615K 455K 150K - - -Arabic 500K 350K 150K - - -
OntoNotes10 English 1.6M 1.3M 160K 170K 2,384(3493)
1,940(2,802)
222(343)
222348()
Chinese 950K 750K 110K 90K 1,729(2,280)
1,391(1,810)
172(252)
166(218)
Arabic 300K 240K 30K 30K 447(447)
359(359)
44(44)
44(44)
Table 3: Number of documents in the OntoNotes data, and some comparison with the MUC and ACE data sets. Thenumbers in parenthesis for the OntoNotes corpus indicate the total number of parts that correspond to the documents.Each part was considered a separate document for evaluation purposes.
Language Syntactic Train Development Testcategory Count % Count % Count %
English NP 81,866 52.89 10,274 53.71 10,357 52.51PRP 65,529 42.33 7,705 40.28 8,157 0.41NNP 2,902 1.87 478 2.50 503 0.03V 2,493 1.61 295 1.54 342 0.02Other N 1,024 0.66 226 1.18 191 0.01Other 978 0.63 150 0.78 175 0.01
Chinese NP 101,049 98.94 13,876 98.63 12,593 99.01PRP 467 0.46 107 0.76 70 0.55NR 71 0.07 8 0.06 8 0.06other N 75 0.07 11 0.08 3 0.02V 187 0.18 37 0.26 13 0.10Other 282 0.28 30 0.21 32 0.25
Arabic NP 23,157 85.79 2,828 86.94 2,673 84.86PRP 2,977 11.03 344 10.57 410 13.02NNP 608 2.25 49 1.51 36 1.14NN 71 0.26 10 0.31 8 0.25V 25 0.09 4 0.12 0 0.00Other 154 0.57 18 0.55 23 0.73
Table 4: Distribution of mentions in the data by their syn-tactic category.
task, for each of the different languages, are shownin Table 3. Tables 4 and 5 shows the distribution ofmentions by the syntactic categories, and the countsof entities, links and mentions in the corpus respec-tively. All of this data has been Treebanked andPropBanked either as part of the OntoNotes effortor some preceding effort.
For comparison purposes, Table 3 also lists thenumber of documents in the MUC-6, MUC-7, andACE (2000-2004) corpora. The MUC-6 data wastaken from the Wall Street Journal, whereas theMUC-7 data was from the New York Times. The ACEdata spanned many different languages and genressimilar to the ones in OntoNotes.
Parse Trees This represents the syntactic layerthat is a revised version of the treebanks in English,Chinese and Arabic. Arabic treebank has probably
Language Type Train Development Test All
English Entities/Chains 35,143 4,546 4,532 44,221Links 120,417 14,610 15,232 150,259Mentions 155,560 19,156 19,764 194,480
Chinese Entities/Chains 28,257 3,875 3,559 35,691Links 74,597 10,308 9,242 94,147Mentions 10,2854 14,183 12,801 129,838
Arabic Entities/Chains 8,330 936 980 10,246Links 19,260 2,381 2,255 23,896Mentions 27,590 3,313 3,235 34,138
Table 5: Number of entities, links and mentions in theOntoNotes 5.0 data.
seen the most revision over the past few years, to in-crease consistency. For purposes of this task, traceswere removed from the syntactic trees, since theCoNLL-style data format, being indexed by tokens,does not provide any good means of conveying thatinformation. As mentioned in the previous section,these include the cases of traces in Chinese and Ara-bic which are legitimate dropped subjects/objects.Function tags were also removed, since the parsersthat we used for the predicted syntax layer did notprovide them. One thing that needs to be dealt within conversational data is the presence of disfluencies(restarts, etc.). Tokens that were part of disfluencieswere removed from the English portion of the data,but kept in the Chinese portion. Since Arabic por-tion of the corpus is all newswire, this had no impacton it. In the English OntoNotes parses the disflu-encies are marked using a special EDITED12 phrasetag – as was the case for the Switchboard Treebank.Given the frequency of disfluencies and the perfor-
12There is another phrase type – EMBED in the telephone con-versation genre which is similar to the EDITED phrase type, andsometimes identifies insertions, but sometimes contains logicalcontinuation of phrases, so we decided not to remove that fromthe data.
mance with which one can identify them automat-ically,13 a probable processing pipeline would fil-ter them out before parsing. Since we did not havea readily available tagger for tagging disfluencies,we decided to remove them using oracle informa-tion available in the Treebank, and the coreferencechains were remapped to trees without disfluencies.Owing to various constraints, we decided to retainthe disfluencies to be kept in the Chinese data.
Propositions The propositions in OntoNotes con-stitute PropBank semantic roles. Most of the verbpredicates in the corpus have been annotated withtheir arguments. Recent enhancements to the Prop-Bank to make it synchronize better with the Tree-bank (Babko-Malaya et al., 2006) have enhancedthe information in the proposition by the addition oftwo types of LINKs that represent pragmatic corefer-ence (LINK-PCR) and selectional preferences (LINK-SLC). More details can be found in the addendum tothe PropBank guidelines14 in the OntoNotes 5.0 re-lease. Since the community is not used to this repre-sentation which relies heavily on the trace structurein the Treebank which we are excluding, we decidedto unfold the LINKs back to their original represen-tation as in the Proposition Bank release 1.0. Thisfunctionality is part of the OntoNotes DB Tool.15
Word Sense Gold word sense annotation wassupplied using sense numbers as specified in theOntoNotes list of senses for each lemma.16
Named Entities Named Entities in OntoNotesdata are specified using a catalog of 18 Name types.
Other Layers Discourse plays a vital role incoreference resolution. In the case of broadcast con-versation, or telephone conversation data, it partiallymanifests in the form of speakers of a given utter-ance, whereas in weblogs or newsgroups it does soas the writer, or commenter of a particular articleor thread. This information provides an importantclue for correctly linking anaphoric pronouns withthe right antecedents. This information could be au-tomatically deduced, but since it would add addi-tional complexity to the already complex task, we
13A study by Charniak and Johnson (2001) shows that onecan identify and remove edits from transcribed conversationalspeech with an F-score of about 78, with roughly 95 Precisionand 67 recall.
14doc/propbank/english-propbank.pdf15http://cemantix.org/ontonotes.html16It should be noted that word sense annotation in OntoNotes
is note complete, so only some of the verbs and nouns haveword sense tags specified.
decided to provide oracle information of this meta-data both during training and testing. In other words,speaker and author identification was not treatedas an annotation layer that needed to be predicted.This information was provided in the form of an-other column in the .conll table. There were somecases of interruptions and interjections that ideallywould associate parts of a sentence to two differentspeakers, but since the frequency of this was quitesmall, we decided to make an assumption of onespeaker/writer per sentence.
5.4.2 Predicted Annotation LayersThe predicted annotation layers were derived us-
ing automatic models trained using cross-validationon other portions of OntoNotes data. As mentionedearlier, there are some portions of the OntoNotescorpus that have not been annotated for coreferencebut that have been annotated for other layers. Fortraining models for each of the layers, where feasi-ble, we used all the data that we could for that layerfrom the training portion of the entire OntoNotes re-lease.
Parse Trees Predicted parse trees for English wereproduced using the Charniak parser (Charniak andJohnson, 2005).17 Some additional tag types usedin the OntoNotes trees were added to the parser’stagset, including the NML tag that has recently beenadded to capture internal NP structure, and the rulesused to determine head words were appropriatelyextended. Chinese and Arabic parses were gener-ated using the Berkeley parser. In case of Arabic, theparsing community uses a mapping from rich Ara-bic part of speech tags, to Penn-style part of speechtags. We used that mapping which is included withthe Arabic treebank. The parser was then re-trainedon the training portion of the release 5.0 data using10-fold cross-validation. Table 6 shows the perfor-mance of the re-trained parsers on the CoNLL-2012test set. We did not get a chance to re-train the re-ranker available for English, and since the stock re-ranker crashes when run on n-best parses containingNMLs, because it has not seen that tag in training, wecould not make use of it.
Word Sense This year we used the IMS (It MakesSense) (Zhong and Ng, 2010) word sense tagger18.It was trained on all the word sense data that is
17http://bllip.cs.brown.edu/download/reranking-parserAug06.tar.gz
18We offer special thanks to Hwee Tou Ng and his studentZhong Zi for training IMS models and providing output for thedevelopment and test sets.
All Sentences Sentence len < 40N POS R P F N R P F
English Broadcast Conversation (BC) 2,194 95.93 84.30 84.46 84.38 2,124 85.83 85.97 85.90Broadcast News (BN) 1,344 96.50 84.19 84.28 84.24 1,278 85.93 86.04 85.98Magazine (MZ) 780 95.14 87.11 87.46 87.28 736 87.71 88.04 87.87Newswire (NW) 2,273 96.95 87.05 87.45 87.25 2,082 88.95 89.27 89.11Telephone Conversation (TC) 1,366 93.52 79.73 80.83 80.28 1,359 79.88 80.98 80.43Weblogs and Newsgroups (WB) 1,658 94.67 83.32 83.20 83.26 1,566 85.14 85.07 85.11New Testament 1,217 96.87 92.48 93.66 93.07 1,217 92.48 93.66 93.07
Overall 9,615 96.03 85.25 85.43 85.34 9145 86.86 87.02 86.94
Chinese Broadcast Conversation (BC) 885 94.79 79.35 80.17 79.76 824 80.92 81.86 81.38Broadcast News (BN) 929 93.85 80.13 83.49 81.78 756 81.82 84.65 83.21Magazine (MZ) 451 97.06 83.85 88.48 86.10 326 85.64 89.80 87.67Newswire (NW) 481 94.07 77.28 82.26 79.69 406 79.06 83.84 81.38Telephone Conversation (TC) 968 92.22 69.19 71.90 70.52 942 69.59 72.24 70.89Weblogs and Newsgroups (WB) 758 92.37 78.92 82.57 80.70 725 79.30 83.10 81.16
Overall 4,472 94.12 78.93 82.23 80.55 3,979 79.80 82.79 81.27
Arabic Newswire (NW) 1,003 94.12 75.67 74.71 75.19 766 77.44 74.99 76.19
Overall 1,003 94.12 75.67 74.71 75.19 766 77.44 74.99 76.19
Table 6: Parser performance on the CoNLL-2011 test set
present in the training portion of the OntoNotes cor-pus using cross-validated predictions on the inputlayers as with the proposition tagging. During test-ing, for English, IMS must first uses the automaticPOS tagger to identify the nouns and verbs in thetest data, and then assign senses to the automaticallyidentified nouns and verbs. Since automatic POStagging is not perfect, IMS does not always output asense to all word tokens that need to be sense taggeddue to wrongly predicted POS tags. As such, recallis not the same as precision on the English test data.For Chinese, sense tags are defined with respect toa Chinese lemma. Since we provide gold word seg-mentation, IMS attempts to sense tag all correctlysegmented Chinese words, so recall and precisionare same and so is F1. Note that in Chinese, the wordsenses are defined against lemmas and are indepen-dent of the part of speech. For Arabic, sense tagsare defined with respect to a lemma. Since we usedgold standard lemmas, IMS attempts to sense tag allcorrectly determined lemmas, therefore, in this casealso the recall and precision are identical and so isF1. Table 7 gives the number of lemmas covered bythe word sense inventory in the English, Chinese andArabic portion of OntoNotes.
Table 8 shows the performance of this classifierover both the verbs and nouns in the CoNLL-2012test set.
For English, genres pt and tc, and for Chinesegenres tc and wb, no gold standard senses wereavailable, and so their accuracies could be com-puted.
Layer English Chinese ArabicVerb Noun All Verb Noun
Sense Inventories 2702 2194 763 150 111Frames 5672 1335 20134 2743 532
Table 7: Number of senses defined for English, Chineseand Arabic in the OntoNotes v5.0 corpus
Propositions To predict propositional structure,ASSERT19 (Pradhan et al., 2005) was used, re-trained also on all the training portion of theOntoNotes 5.0 data using cross-validated predictedparses. Given time constraints, we had to performtwo modifications: i) Instead of a single model thatpredicts all arguments including NULL arguments,we had to use the two-stage mode where the NULLarguments are first filtered out and the remainingNON-NULL arguments are classified into one of theargument types, and ii) The argument identificationmodule used an ensemble of ten classifiers – eachtrained on a tenth of the training data and performedan unweighted voting among them. This should stillgive a close to state of the art performance giventhat the argument identification performance tendsto start to be asymptotic around 10k training in-stances. The CoNLL-2005 scorer was used to com-pute the scores. At first glance, the performance onthe newswire genre is much lower than what hasbeen reported for WSJ Section 23. This could beattributed to two factors: i) the fact that we had tocompromise on the training method, but more im-
19http://cemantix.org/assert.html
AccuracyR P F
English Broadcast Conversation (BC) 81.2 81.3 81.2Broadcast News 82.0 81.5 81.7Magazine 79.1 78.8 79.0Newswire 85.7 85.7 85.7Weblogs 77.5 77.6 77.5All 82.5 82.5 82.5
Chinese Broadcast Conversation - - 80.5Broadcast News - - 85.4Magazine - - 82.4Newswire - - 89.1All - - 84.3
Arabic Newswire - - 77.6All - - 77.6
Table 8: Word sense performance over both verbs and nouns in the CoNLL-2012 test set
Frameset Total Total % Perfect Argument ID + ClassAccuracy Sentences Propositions Propositions P R F
English Broadcast Conversation (BC) 0.92 2,037 5,021 52.18 82.55 64.84 72.63Broadcast News (BN) 0.91 1,252 3,310 53.66 81.64 64.46 72.04Magazine (MZ) 0.89 780 2,373 47.16 79.98 61.66 69.64Newswire (NW) 0.93 1,898 4,758 39.72 80.53 62.68 70.49Telephone Conversation (TC) 0.90 1,366 1,725 45.28 79.60 63.41 70.59Weblogs and Newsgroups (WB) 0.92 929 2,174 39.19 81.01 60.65 69.37Pivot Corpus (PT) 0.92 1,217 2,853 50.54 86.40 72.61 78.91
Overall 0.91 9,479 24,668 44.69 81.47 61.56 70.13
Chinese Broadcast Conversation (BC) - 885 2,323 31.34 53.92 68.60 60.38Broadcast News (BN) - 929 4,419 35.44 64.34 66.05 65.18Magazine (MZ) - 451 2,620 31.68 65.04 65.40 65.22Newswire (NW) - 481 2,210 27.33 69.28 55.74 61.78Telephone Conversation (TC) - 968 1,622 32.74 48.70 59.12 53.41Weblogs and Newsgroups (WB) - 758 1,761 35.21 62.35 68.87 65.45
Overall - 4,472 14,955 32.62 61.26 64.48 62.83
Table 9: Performance on the propositions and framesets in the CoNLL-2012 test set.
Framesets Lemmas
1 2,7222 321
> 2 181
Table 10: Frameset polysemy across lemmas
portantly because ii) the newswire in OntoNotes notonly contains WSJ data, but also Xinhua news. Onecould try to verify using just the WSJ portion of thedata, but it would be hard as it is not only a sub-set of the documents that the performance has beenreported on previously, but also the annotation hasbeen significantly revised; it includes propositionsfor be verbs missing from the original PropBank,and the training data is a subset of the original dataas well. It looks like the newly added New Testa-ment data shows very good performance. This isnot surprising since the same is the trend for the au-tomatic parses. Table 9 shows the detailed perfor-mance numbers. In addition to automatically pre-dicting the arguments, we also trained a classifier totag PropBank frameset IDs for the English data. Ta-ble ?? lists the number of framesets available acrossthe three languages20 An overwhelming number ofthem are monosemous, but the more frequent verbstend to be polysemous. Table 10 gives the distribu-tion of number of framesets per lemma in the Prop-Bank layer of the English OntoNotes 5.0 data. Dur-ing automatic processing of the data, we tagged allthe tokens that were tagged with a part of speechVBx. This means that there would be cases where thewrong token would be tagged with propositions.
Named Entities BBN’s IdentiFinderTMsystemwas used to predict the named entities. Given thetime constraints, we could not re-train it on theChinese and Arabic data, so we only retrained iton the English portion of the OntoNotes trainingdata. Table 11 shows the overall performance ofthe tagger on the CoNLL-2012 English test set, aswell as the performance broken down by individualname types.
Other Layers As noted earlier, systems were al-lowed to make use of gender and number predic-tions for NPs using the table from Bergsma and Lin(Bergsma and Lin, 2006), and the speaker meta-data for broadcast conversations, telephone conver-
20The number of lemmas for English in Table 10 do not addup to this number because not all of them have examples inthe training data, where the total number of instantiated sensesamounts to 4229.
sations and author or poster metadata for weblogsand newsgroups.
5.4.3 Data FormatIn order to organize the multiple, rich layers of
annotation, the OntoNotes project has created adatabase representation for the raw annotation layersalong with a Python API to manipulate them (Prad-han et al., 2007a). In the OntoNotes distributionthe data is organized as one file per layer, per docu-ment. The API requires a certain hierarchical struc-ture with documents at the leaves inside a hierarchyof language, genre, source and section. It comeswith various ways of cleanly querying and manip-ulating the data and allows convenient access to thesense inventory and propbank frame files instead ofhaving to interpret the raw .xml versions. How-ever, maintaining format consistency with earlierCoNLL tasks was deemed convenient for sites thatalready had tools configured to deal with that for-mat. Therefore, in order to distribute the data sothat one could make the best of both worlds, we cre-ated a new file type called .conll which logicallyserved as another layer in addition to the .parse,.prop, .name and .coref layers. Each .conll file con-tained a merged representation of all the OntoNoteslayers in the CoNLL-style tabular format with oneline per token, and with multiple columns for eachtoken specifying the input annotation layers rele-vant to that token, with the final column specifyingthe target coreference layer. Because OntoNotes isnot authorized to distribute the underlying text, andmany of the layers contain inline annotation, we hadto provide a skeletal form (.skel of the .conll filewhich was essentially the .conll file, but with theword column replaced with a dummy string. Weprovided an assembly script that participants coulduse to create a .conll file taking as input the .skel
file and the top-level directory of the OntoNotes dis-tribution that they had separately downloaded fromthe LDC21 Once the .conll file is created, it can beused to create the individual layers such as .parse,.name, .coref etc. In the CoNLL-2011 task, therewere a few issues, where some teams used the testdata accidentally during training. To prevent this,this year, we distributed the data in two installments.First one was for training and development and theother for testing. The test data release from LDCdid not contain the coreference layer. Unlike previ-ous CoNLL tasks, this test data contained some truly
21OntoNotes is deeply grateful to the Linguistic Data Con-sortium for making the source data freely available to the taskparticipants.
All Genre BC BN MZ NW TC WBF F F F F F F
English Cardinal 68.76 58.52 75.34 72.57 83.62 32.26 57.14Date 78.60 73.46 80.61 71.60 84.12 63.89 65.48Event 44.63 30.77 50.00 36.36 50.00 0.00 66.67Facility 47.29 64.20 43.14 40.00 54.17 0.00 28.57GPE 89.77 89.40 93.83 92.87 92.56 81.19 91.36Language 47.06 - 75.00 50.00 33.33 22.22 66.67Law 48.00 0.00 100.00 0.00 50.98 0.00 100.00Location 59.00 54.55 61.36 54.84 67.10 - 44.44Money 75.45 33.33 63.64 77.78 79.12 92.31 58.18NORP 88.58 94.55 93.92 94.87 90.70 78.05 85.15Ordinal 71.39 74.16 80.49 79.07 74.34 84.21 55.17Organization 76.00 60.90 78.57 69.97 84.76 48.98 51.08Percent 89.11 100.00 83.33 75.00 91.41 83.33 72.73Person 78.75 93.35 94.36 87.47 85.80 73.39 76.49Product 52.76 0.00 77.65 0.00 42.55 0.00 0.00Quantity 50.00 17.14 66.67 62.86 81.82 0.00 30.77Time 60.65 66.13 67.33 66.67 64.29 27.03 55.56Work of Art 34.03 42.42 35.62 28.57 54.24 0.00 8.70
All NE 77.95 77.02 84.95 80.33 84.73 62.17 69.47
Table 11: Named Entity performance on the CoNLL-2012 test set
Column Type Description
1 Document ID This is a variation on the document filename2 Part number Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.3 Word number This is the word index in the sentence4 Word The word itself5 Part of Speech Part of Speech of the word6 Parse bit This is the bracketed structure broken before the first open parenthesis in the parse, and the
word/part-of-speech leaf replaced with a *. The full parse can be created by substitutingthe asterisk with the ([pos] [word]) string (or leaf) and concatenating the items inthe rows of that column.
7 Predicate lemma The predicate lemma is mentioned for the rows for which we have semantic role informa-tion. All other rows are marked with a -
8 Predicate Frameset ID This is the PropBank frameset ID of the predicate in Column 7.9 Word sense This is the word sense of the word in Column 3.10 Speaker/Author This is the speaker or author name where available. Mostly in Broadcast Conversation and
Web Log data.11 Named Entities These columns identifies the spans representing various named entities.12:N Predicate Arguments There is one column each of predicate argument structure information for the predicate
mentioned in Column 7.N Coreference Coreference chain information encoded in a parenthesis structure.
Table 12: Format of the .conll file used on the shared task
#begin document (nw/wsj/07/wsj_0771); part 000......nw/wsj/07/wsj_0771 0 0 ‘‘ ‘‘ (TOP(S(S* - - - - * * (ARG1* * * -nw/wsj/07/wsj_0771 0 1 Vandenberg NNP (NP* - - - - (PERSON) (ARG1* * * * (8|(0)nw/wsj/07/wsj_0771 0 2 and CC * - - - - * * * * * -nw/wsj/07/wsj_0771 0 3 Rayburn NNP *) - - - - (PERSON) *) * * *(23)|8)nw/wsj/07/wsj_0771 0 4 are VBP (VP* be 01 1 - * (V*) * * * -nw/wsj/07/wsj_0771 0 5 heroes NNS (NP(NP*) - - - - * (ARG2* * * * -nw/wsj/07/wsj_0771 0 6 of IN (PP* - - - - * * * * * -nw/wsj/07/wsj_0771 0 7 mine NN (NP*)))) - - 5 - * *) * * * (15)nw/wsj/07/wsj_0771 0 8 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 9 ’’ ’’ *) - - - - * * *) * * -nw/wsj/07/wsj_0771 0 10 Mr. NNP (NP* - - - - * * (ARG0* (ARG0* * (15nw/wsj/07/wsj_0771 0 11 Boren NNP *) - - - - (PERSON) * *) *) * 15)nw/wsj/07/wsj_0771 0 12 says VBZ (VP* say 01 1 - * * (V*) * * -nw/wsj/07/wsj_0771 0 13 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 14 referring VBG (S(VP* refer 01 2 - * * (ARGM-ADV* (V*) * -nw/wsj/07/wsj_0771 0 15 as RB (ADVP* - - - - * * * (ARGM-DIS* * -nw/wsj/07/wsj_0771 0 16 well RB *) - - - - * * * *) * -nw/wsj/07/wsj_0771 0 17 to IN (PP* - - - - * * * (ARG1* * -nw/wsj/07/wsj_0771 0 18 Sam NNP (NP(NP* - - - - (PERSON* * * * * (23nw/wsj/07/wsj_0771 0 19 Rayburn NNP *) - - - - *) * * * * -nw/wsj/07/wsj_0771 0 20 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 21 the DT (NP(NP* - - - - * * * * (ARG0* -nw/wsj/07/wsj_0771 0 22 Democratic JJ * - - - - (NORP) * * * * -nw/wsj/07/wsj_0771 0 23 House NNP * - - - - (ORG) * * * * -nw/wsj/07/wsj_0771 0 24 speaker NN *) - - - - * * * * *) -nw/wsj/07/wsj_0771 0 25 who WP (SBAR(WHNP*) - - - - * * * * (R-ARG0*) -nw/wsj/07/wsj_0771 0 26 cooperated VBD (S(VP* cooperate 01 1 - * * * * (V*) -nw/wsj/07/wsj_0771 0 27 with IN (PP* - - - - * * * * (ARG1* -nw/wsj/07/wsj_0771 0 28 President NNP (NP* - - - - * * * * * -nw/wsj/07/wsj_0771 0 29 Eisenhower NNP *))))))))))) - - - - (PERSON) * *) *) *) 23)nw/wsj/07/wsj_0771 0 30 . . *)) - - - - * * * * * -nw/wsj/07/wsj_0771 0 0 ‘‘ ‘‘ (TOP(S* - - - - * * * -nw/wsj/07/wsj_0771 0 1 They PRP (NP*) - - - - * (ARG0*) * (8)nw/wsj/07/wsj_0771 0 2 allowed VBD (VP* allow 01 1 - * (V*) * -nw/wsj/07/wsj_0771 0 3 this DT (S(NP* - - - - * (ARG1* (ARG1* (6nw/wsj/07/wsj_0771 0 4 country NN *) - - 3 - * * *) 6)nw/wsj/07/wsj_0771 0 5 to TO (VP* - - - - * * * -nw/wsj/07/wsj_0771 0 6 be VB (VP* be 01 1 - * * (V*) (16)nw/wsj/07/wsj_0771 0 7 credible JJ (ADJP*))))) - - - - * *) (ARG2*) -nw/wsj/07/wsj_0771 0 8 . . *)) - - - - * * * -#end document
Figure 3: Sample portion of the .conll file.
unseen documents which made it easier to spot po-tential training errors such as ones identified in theCoNLL-2011 task. Since the propositions and wordsense layers are inherently standoff annotation, theywere provided as is, and did not require that extramerging step. One thing thing that made the Englishdata creation process a bit tricky was the fact thatwe had dissected some of the trees for the conversa-tion data to remove the EDITED phrases. Table 12describes the data provided in each of the column ofthe .conll format. Figure 3 shows a sample from a.conll file.
5.5 Evaluation
This section describes the evaluation criteria used.Unlike for propositions, word sense and named en-tities, where it is simply a matter of counting thecorrect answers, or for parsing, where there are sev-eral established metrics, evaluating the accuracy ofcoreference continues to be contentious. Various al-ternative metrics have been proposed, as mentionedbelow, which weight different features of a proposedcoreference pattern differently. The choice is notclear in part because the value of a particular setof coreference predictions is integrally tied to theconsuming application. A further issue in defining
a coreference metric concerns the granularity of thementions, and how closely the predicted mentionsare required to match those in the gold standardfor a coreference prediction to be counted as cor-rect. Our evaluation criterion was in part driven bythe OntoNotes data structures. OntoNotes corefer-ence distinguishes between identity coreference andappositive coreference, treating the latter separatelybecause it is already captured explicitly by other lay-ers of the OntoNotes annotation. Thus we evaluatedsystems only on the identity coreference task, whichlinks all categories of entities and events togetherinto equivalent classes. The situation with mentionsfor OntoNotes is also different than it was for MUCor ACE. OntoNotes data does not explicitly iden-tify the minimum extents of an entity mention, but itdoes include hand-tagged syntactic parses. Thus forthe official evaluation, we decided to use the exactspans of mentions for determining correctness. TheNP boundaries for the test data were pre-extractedfrom the hand-tagged Treebank for annotation, andevents triggered by verb phrases were tagged usingthe verbs themselves. This choice means that scoresfor the CoNLL-2011 coreference task are likely tobe lower than for coref evaluations based on MUC,
where the mention spans are specified in the input,22
or those based on ACE data, where an approximatematch is often allowed based on the specified headof the NP mention.
5.5.1 MetricsAs noted above, the choice of an evaluation met-
ric for coreference has been a tricky issue and theredoes not appear to be any silver bullet approach thataddresses all the concerns. Three metrics have beenproposed for evaluating coreference performanceover an unrestricted set of entity types: i) The linkbased MUC metric (Vilain et al., 1995), ii) The men-tion based B-CUBED metric (Bagga and Baldwin,1998) and iii) The entity based CEAF (ConstrainedEntity Aligned F-measure) metric (Luo, 2005). Veryrecently BLANC (BiLateral Assessment of Noun-Phrase Coreference) measure (Recasens and Hovy,2011) has been proposed as well. Each of the met-ric tries to address the shortcomings or biases of theearlier metrics. Given a set of key entities K, anda set of response entities R, with each entity com-prising one or more mentions, each metric generatesits variation of a precision and recall measure. TheMUC measure if the oldest and most widely used. Itfocuses on the links (or, pairs of mentions) in thedata.23 The number of common links between en-tities in K and R divided by the number of linksin K represents the recall, whereas, precision is thenumber of common links between entities in K andR divided by the number of links in R. This met-ric prefers systems that have more mentions per en-tity – a system that creates a single entity of allthe mentions will get a 100% recall without signifi-cant degradation in its precision. And, it ignores re-call for singleton entities, or entities with only onemention. The B-CUBED metric tries to addressesMUCS’s shortcomings, by focusing on the mentionsand computes recall and precision scores for eachmention. If K is the key entity containing mention M,and R is the response entity containing mention M,then recall for the mention M is computed as |K∩R||K|and precision for the same is is computed as |K∩R||R| .Overall recall and precision are the average of theindividual mention scores. CEAF aligns every re-sponse entity with at most one key entity by findingthe best one-to-one mapping between the entities us-ing an entity similarity metric. This is a maximumbipartite matching problem and can be solved bythe Kuhn-Munkres algorithm. This is thus a entity
22as is the case in this evaluation with Gold Mentions23The MUC corpora did not tag single mention entities.
based measure. Depending on the similarity, thereare two variations – entity based CEAF – CEAFe anda mention based CEAF – CEAFe. Recall is the totalsimilarity divided by the number of mentions in K,and precision is the total similarity divided by thenumber of mentions in R. Finally, BLANC uses avariation on the Rand index (Rand, 1971) suitablefor evaluating coreference. There are a few othermeasures – one being the ACE value, but since thisis specific to a restricted set of entities (ACE types),we did not consider it.
5.5.2 Official Evaluation MetricIn order to determine the best performing system
in the shared task, we needed to associate a singlenumber with each system. This could have beenone of the metrics above, or some combination ofmore than one of them. The choice was not sim-ple, and while we consulted various researchers inthe field, hoping for a strong consensus, their con-clusion seemed to be that each metric had its prosand cons. We settled on the MELA metric by Denisand Baldridge (2009), which takes a weighted av-erage of three metrics: MUC, B-CUBED, and CEAF.The rationale for the combination is that each of thethree metrics represents a different important dimen-sion, the MUC measure being based on links, theB-CUBED based on mentions, and the CEAF basedon entities. We decided to use CEAFe instead ofCEAFm. For a given task, a weighted average of thethree might be optimal, but since we don’t have anend task in mind, we decided to use the unweightedmean of the three metrics as the score on which thewinning system was judged. This still gives us ascore for one language. We wanted to encourage re-searchers to run their systems on all three languages.Therefore, we decided to compute the final scorethat would determine the winning submission as theaverage of the MELA metric across all the three lan-guages. We decided to give a MELA score of zero toevery language that a particular group did not run itssystem on.
5.5.3 Scoring Metrics ImplementationWe used the same core scorer implementation24
that was used for the SEMEVAL-2010 task, andwhich implemented all the different metrics. Therewere a couple of modifications done to this scorerafter it was used for the SEMEVAL-2010 task.
1. Only exact matches were considered cor-rect. Previously, for SEMEVAL-2010 non-
24http://www.lsi.upc.edu/ esapena/downloads/index.php?id=3
exact matches were judged partially correctwith a 0.5 score if the heads were the sameand the mention extent did not exceed the goldmention.
2. The modifications suggested by Cai and Strube(2010) were incorporated in the scorer.
Since there are differences in the version used forCoNLL and the one available on the download site,and it is possible that the latter would be revised inthe future, we have archived the version of the scoreron the CoNLL-2012 task webpage.25
6 ParticipantsA total of 41 different groups demonstrated interestin the shared task by registering on the task web-page. Of these, 16 groups from 6 countries submit-ted system outputs on the test set during the evalu-ation week. 15 groups participated in at least onelanguage in the closed task, and only one group par-ticipated solely in the open track. One participantdid not submit a final task paper Tables 13 and 14list the distribution of the participants by country andthe participation by language and task type.
Country Participants
Brazil 1China 8Germany 3Italy 1Switzerland 1USA 2
Table 13: Participation by Country
Closed Open Combined
English 1 15 16Chinese 3 13 14Arabic 1 7 8
Table 14: Participation across languages and tracks
7 ApproachesTables 15 and 16 summarize the approaches takenby the participating systems along some importantdimensions. Most of the systems divided the prob-lem into the typical two phases – first identifyingthe potential mentions in the text, and then linkingthe mentions to form coreference chains, or enti-ties. Many systems used rule-based approaches for
25http://conll.bbn.com/download/scorer.v4.tar.gz
mention detection, though one, yang did use trainedmodels, and li used a hybrid approach by addingmentions from a trained model to the ones identi-fied using rules. All systems ran a post process-ing stage, after linking potential mentions together,to delete the remaining unlinked mentions. It wascommon for the systems to represent the markables(mentions) internally in terms of the parse tree NPconstituent span, but some systems used a sharedattribute model, where the attributes of the mergedentity are determined collectively by heuristicallymerging the attribute types and values of the differ-ent constituent mentions. Various types of trainedmodels were used for predicting coreference. Fora learning-based system generation of positive andnegative examples is very important. The partici-pating systems used a range of sentence windowssurrounding the anaphor in generating these exam-ples. In the systems that used trained models, manysystems used the approach described in Soon et al.(2001) for selecting the positive and negative train-ing examples, while others used some of the alter-native approaches that have been introduced in theliterature more recently. Following on the success ofrule-based linking model in the CoNLL-2011 sharedtask, many systems used a completely rule-basedlinking model, or used it as a initializing, or inter-mediate step in a learning based system. A hybridapproach seems to be a central theme of many highscoring systems. Taking cue from last year’s sys-tems, almost all systems trained pleonastic it classi-fiers, and used speaker-based constraints/features forthe conversation genre. Many systems used the Ara-bic POS that were mapped-down to Penn-style POS,but stamborg used heuristic to convert them backto the complex POS type using more frequent map-ping to get better performance for Arabic. fernandessystem uses feature templates defined on mentionpairs. bjorkelund mentions that disallowing transi-tive closures gave performance improvement of 0.6and 0.4 respectively for English and Chinese/Arabic.bjorkelund also mention seeing a considerable in-crease in performance after added features that cor-respond to the Shortest Edit Script (Myers, 1986)between surface forms and unvocalised Buckwalterforms, respectively. These could be better at captur-ing the differences in gender and number signaledby certain morphemes than hand-crafted rules. chenbuild upon the sieve architecture proposed in cite?raghunathan added one sieve – head match – forChinese and modified two sieves. Some participantstried to incorporate some peculiarities of the corpusin their systems. For example, martschat excluded
Part
icip
ant
Trac
kL
angu
ages
Synt
axL
earn
ing
Fram
ewor
kM
arka
ble
Iden
tifica
tion
Verb
Feat
ure
Sele
ctio
n#
Feat
ures
Trai
n
fern
ande
sC
A,C
,EP
Lat
entS
truc
ture
Perc
eptr
onA
llno
unph
rase
s,pr
onou
nsan
dna
me
entit
ies
×L
aten
tfea
ture
indu
ctio
nan
dfe
atur
ete
mpl
ates
196
tem
plat
es(E
);19
7(C
)and
223
(A)
T+
D
bjor
kelu
ndC
A,C
,EP
LIB
LIN
EA
Rfo
rlin
king
,and
Max
imum
Ent
ropy
(Mal
let)
for
anap
hori
city
NP,
PR
Pan
dP
RP$
inal
llan
guag
es;P
Nan
dN
Rin
Chi
nese
;all
NE
inE
nglis
h.C
lass
ifier
toex
clud
eno
n-re
fere
ntia
lpro
noun
sin
Eng
lish
(with
apr
obab
ility
of0.
95).
××
–T
+D
chen
C,O
A,C
,EP
Hyb
rid
–Si
eve
appr
oach
follo
wed
byla
ngua
ge-s
peci
fiche
uris
ticpr
unin
gan
dla
ngua
ge-i
ndep
ende
ntle
arni
ngba
sed
prun
ing;
Gen
resp
ecifi
cm
odel
s
NP,
PR
Pan
dP
RP$
and
sele
cted
NE
inE
nglis
h.N
Pan
dQ
Pin
Chi
nese
.Exc
lude
Chi
nese
inte
rrog
ativ
epr
onou
nsw
hata
ndw
here
.NP
and
sele
cted
NE
inA
rabi
c.L
earn
ing
topr
une
non-
refe
rent
ialm
entio
ns
×B
ackw
ard
elim
inat
ion
–T
stam
borg
CA
,C,E
DL
ogis
ticR
egre
ssio
n(L
IBL
INE
AR
)
NP,
PR
Pan
dP
RP$
inal
llan
guag
es;P
Nin
Chi
nese
;all
NE
inE
nglis
h.E
xclu
depl
eona
stic
itin
Eng
ish.
Prun
esm
alle
rmen
tions
with
sam
ehe
ad.
×Fo
rwar
d+
Bac
kwar
dst
artin
gfr
omC
oNL
L-2
011
feat
ure
set
forE
nglis
han
dSo
onfe
atur
ese
tfor
Chi
nese
and
Ara
bic
18–3
7T
+D
mar
tsch
atC
A,C
D
Dir
ecte
dm
ultig
raph
repr
esen
tatio
nw
here
the
wei
ghts
are
lear
ned
over
the
trai
ning
data
(on
top
ofB
AR
T(V
ersl
eyet
al.,
2008
))
Eig
htdi
ffer
entm
entio
nty
pes
forE
nglis
h,an
dad
ject
ival
use
forn
atio
nsan
da
few
NE
sar
efil
tere
das
wel
las
embe
dded
men
tions
and
pleo
nast
icpr
onou
ns.F
ourm
entio
nty
pes
inC
hine
se.C
opul
asar
eal
soha
ndle
dap
poro
pria
tely
.
××
Inth
efo
rmof
nega
tive
and
posi
tive
rela
tions
20%
ofT
(E);
15%
ofT
(C)
chan
gC
E,C
PL
aten
tStr
uctu
reL
earn
ing
All
noun
phra
ses,
pron
ouns
and
nam
een
titie
s×
×C
hang
,eta
l.,20
11T
+D
uryu
pina
CA
,C,E
P
mod
ifica
tion
ofB
AR
Tus
ing
mul
ti-ob
ject
ive
optim
izat
ion.
Dom
ain
spec
ific
clas
sifie
rsfo
rnw
and
bcge
nre.
Stan
dard
rule
sfo
rEng
lish
and
Cla
ssifi
erto
iden
tify
mar
kabl
eN
Psin
Chi
nese
and
Ara
bic.
××
∼45
T+
D
zhek
ova
CA
,C,E
PM
emor
yba
sed
lear
ning
(TIM
BL
)N
P,P
RP
and
PR
P$in
Eng
lish,
and
allN
Pin
Chi
nese
and
Ara
bic.
Sing
leto
ncl
assi
fier.
××
33T
+D
liC
A,C
,EP
Max
Ent
All
phra
sety
pes
that
are
men
tions
intr
aini
ngar
eco
nsid
ered
asm
entio
nsan
da
clas
sifie
ris
trai
ned
toid
entif
ypo
tent
ialm
entio
ns.
√×
11fe
atur
egr
oups
T+
D
yuan
C,O
E,C
PC
4.5
and
dete
rmin
istic
rule
sA
llno
unph
rase
s,pr
onou
nsan
dna
me
entit
ies
××
–T
+D
xuC
E,C
PD
ecis
ion
tree
clas
sifie
rand
dete
rmin
istic
rule
s
All
noun
phra
ses,
pron
ouns
and
sele
cted
nam
eden
titie
sse
lect
edan
dov
erla
ppin
gm
entio
nsar
eco
nsid
ered
whe
nth
eyar
ese
cond
-lev
elN
Psin
side
anN
P,fo
rexa
mpl
eco
ordi
natin
gN
Ps
××
51(E
)and
61(C
)T
chun
yang
CE
,CP
Rul
e-ba
sed
(ada
ptat
ion
ofL
eeet
al.2
011’
ssi
eve
stru
ctur
e)
Chi
nese
NP
and
pron
ouns
usin
gpa
rtof
spee
chP
Nan
dna
mes
usin
gpa
rtof
spee
chN
Rex
clud
ing
mea
sure
wor
dsan
dce
rtai
nna
mes
××
––
yang
CE
PM
axE
nt(O
penN
LP)
Men
tion
dete
ctio
ncl
assi
fier
×Sa
me
feat
ure
set,
butp
ercl
assi
fier
40T
xinx
inC
E,C
PM
axE
ntN
P,P
RP
and
PR
P$in
Eng
lish
and
Chi
nese
×G
reed
yfo
rwar
dba
ckw
ard
71T
+D
shou
CE
PM
odifi
edve
rsio
nof
Lee
etal
.,20
11si
eve
syst
emxi
ong
OA
,C,E
PL
eeet
al.,
2011
syst
em
Tabl
e15
:Pa
rtic
ipat
ing
syst
empr
ofile
s–
Part
I.In
the
Task
colu
mn,
C/O
repr
esen
tsw
heth
erth
esy
stem
part
icip
ated
inth
ecl
osed
,op
enor
both
trac
ks.
Inth
eSy
ntax
colu
mn,
aP
repr
esen
tsth
atth
esy
stem
sus
eda
phra
sest
ruct
ure
gram
mar
repr
esen
tatio
nof
synt
ax,w
here
asa
Dre
pres
ents
that
they
used
ade
pend
ency
repr
esen
tatio
n.In
the
Trai
nco
lum
nT
repr
esen
tstr
aini
ngda
taan
dD
repr
esen
tsde
velo
pmen
tdat
a.
Part
icip
ant
Posi
tive
Trai
ning
Exa
mpl
esN
egat
ive
Trai
ning
Exa
mpl
esD
ecod
ing
fern
ande
sId
entif
ylik
ely
men
tions
with
anai
mto
gene
rate
high
reca
llus
ing
the
siev
em
etho
dpr
opos
edin
(Lee
etal
.,20
11).
Cre
ate
dire
cted
arcs
betw
een
men
tion
pair
sus
ing
ase
tofr
ules
Aco
nstr
aine
dla
tent
pred
icto
rfind
sth
em
axim
umsc
orin
gdo
cum
entt
ree
amon
gpo
ssib
leca
ndid
ates
bjor
kelu
ndC
lose
stA
ntec
eden
t(So
on,2
001)
Neg
ativ
eex
ampl
esin
betw
een
anap
hora
ndcl
oses
tant
eced
ent(
Soon
,200
1)
Stac
ked
reso
lver
s–
i)B
estfi
rst,
ii)Pr
onou
ncl
oses
tfirs
t–cl
oses
tfirs
tfor
pron
ouns
and
best
first
foro
ther
men
tions
and
iii)c
lust
er-m
entio
n;di
sallo
wtr
ansi
tive
nest
ing;
prop
erno
unm
entio
nspr
oces
sed
first
,fol
low
edby
othe
rno
uns
and
pron
ouns
chen
Rul
e-ba
sed
siev
eap
proa
chfo
llow
edby
heur
istic
and
lear
ning
base
dpr
unin
g
stam
borg
Clo
sest
Ant
eced
ent(
Soon
,200
1)N
egat
ive
exam
ples
inbe
twee
nan
apho
rand
clos
esta
ntec
eden
t(So
on,2
001)
Chi
nese
and
Ara
bic
–C
lose
st-fi
rstc
lust
erin
gfo
rpro
noun
san
dB
est-
first
clus
teri
ngot
herw
ise.
Eng
lish
–cl
oses
t-fir
stfo
rpro
noun
san
dav
erag
edbe
st-fi
rstc
lust
erin
got
herw
ise.
mar
tsch
atW
eigh
tsar
etr
aine
don
part
ofth
etr
aini
ngda
taG
reed
ycl
uste
ring
forE
nglis
h;Sp
ectr
alcl
uste
ring
follo
wed
bygr
eedy
clus
teri
ngfo
rChi
nese
tore
duce
num
bero
fcan
dida
tean
tece
dent
s.
chan
gC
lose
stA
ntec
eden
t(So
on,2
001)
All
prec
edin
gm
entio
nsin
aun
ion
ofof
gold
and
pred
icte
dm
entio
ns.M
entio
nsw
here
the
first
ispr
onou
nan
dot
hern
otar
eno
tco
nsid
ered
Bes
tlin
kst
rate
gy;s
epar
ate
clas
sifie
rsfo
rpro
nom
inal
and
non-
pron
omin
alm
entio
nsfo
rEng
lish.
Sing
lecl
assi
fierf
orC
hine
se.
uryu
pina
Clo
sest
Ant
eced
ent(
Soon
,200
1)N
egat
ive
exam
ples
inbe
twee
nan
apho
rand
clos
esta
ntec
eden
t(So
on,2
001)
men
tion
pair
mod
elw
ithou
tran
king
asin
Soon
2001
zhek
ova
Clo
sest
Ant
eced
ent(
Soon
,200
1)N
egat
ive
exam
ples
inbe
twee
nan
apho
rand
clos
esta
ntec
eden
t(So
on,2
001)
All
defin
iteph
rase
sus
edto
crea
tea
pair
fore
ach
anap
horw
ithea
chm
entio
npr
eced
ing
itw
ithin
aw
indo
wof
10(E
nglis
h,C
hine
se)o
r7(A
rabi
c)se
nten
ces.
li—
Bes
t-fir
stcl
uste
ring
yuan
—D
eter
min
istic
NP-N
Pfo
llow
edby
PP-N
P
xuW
indo
wof
sent
ence
sis
used
tode
term
ine
posi
tive
and
nega
tive
exam
ples
.For
Eng
lish
aw
indo
wof
5se
nten
ces
isus
edw
here
asfo
rChi
nese
aw
indo
wof
10se
nten
ces
isus
ed
All-
pair
linki
ngfo
llow
edby
prun
ing
orco
rrec
tion
usin
ga
seto
frul
esfo
rN
E-N
Ean
dN
P-N
Pm
entio
nsfo
rsen
tenc
esou
tsid
eof
a5/
10se
nten
cew
indo
win
Eng
lish
and
Chi
nese
resp
ectiv
ely
chun
yang
Lee
etal
.,20
11sy
stem
yang
Pre-
clus
terp
airm
odel
sse
para
tefo
reac
hpa
irN
P-N
P,N
P-P
RP
and
PR
P-P
RP
Pre-
clus
ters
,with
sing
leto
npr
onou
npr
e-cl
uste
rs,a
ndus
ecl
oses
t-fir
stcl
uste
ring
.Diff
eren
tlin
km
odel
sba
sed
onth
ety
peof
linki
ngm
entio
ns–
NP-P
RP,
PR
P-P
RP
and
NP-N
P
xinx
inC
lose
stA
ntec
eden
t(So
on,2
001)
Neg
ativ
eex
ampl
esin
betw
een
anap
hora
ndcl
oses
tant
eced
ent(
Soon
,200
1)B
est-
first
clus
teri
ngm
etho
d
shou
Mod
ified
vers
ion
ofL
eeet
al.,
2011
core
fere
nce
syst
emxi
ong
Lee
etal
.,20
11sy
stem
Tabl
e16
:Par
ticip
atin
gsy
stem
profi
les
–Pa
rtII
.Thi
sfo
cuse
son
the
way
posi
tive
and
nega
tive
exam
ples
wer
ege
nera
ted
and
the
deco
ding
stra
tegy
used
.
adjectival nation names. Unlike English, and espe-cially in absence of a external resource, it is hardto make a gender distinction in Arabic and Chinese.martschat used the information that先生(sir) and女士(lady) and often suggest gender information. boand martschat used plurality markers们 to identifyplurals. eg. 同学 (student) is singular and 同学们(students) is plural. bo also a heuristic that if theword和 (and) appears in the middle of a mention A,and the two parts separated by和 are sub-mentionsof A, then mention A is considered to be plural.Other words which have the similar meaning of和’,such as同’,与 and跟, are also considered. uryupinaused the rich POS tags to classify pronouns into sub-type, person number and gender. Chinese and Ara-bic do not have definite noun phrase markers like thein English. In contrast to English there is no strictenforcement of using definite noun phrases when re-ferring to an antecedent in Chinese. Both 这次 演说 (the talk) and演说 (talk) can corefer with the an-tecedent 克林顿在河内大选的演说 (Clinton’s talkduring Hanoi election). This makes it very difficultto distinguish generic expressions from referentialones. martschat checks whether the phrase start witha definite/demonstrative indicator (e.g. 这(this) or那(that)) in order to identify demonstrative and defi-nite noun phrases. For Arabic, uryupina consider asdefinite all mentions with definite head nouns (pre-fixed with “Al”) and all the idafa constructs with adefinite modifier. chang use training data to identifyinappropriate mention boundaries. They perform arelaxed matching between predicted mentions andgold mentions ignoring punctuation marks and men-tions that start with one of the following: adverb,verb, determiner, and cardinal number. In anotherextreme, xiong translated Chinese and Arabic to En-glish, and ran an English system and projected men-tions back to the source languages. It did not workquite well by itself. One issue that they faced wasthat many instances of pronouns did not have a cor-responding mention in the source language (sincewe do not consider mentions formed by droppedsubjects/objects). Nevertheless, using this in addi-tion to performing coreference resolution in theselanguages could be useful. Similar to last year, mostparticipants appear not to have focused much oneventive coreference, those coreference chains thatbuild off verbs in the data. This usually meant thatmentions that should have linked to the eventive verbwere instead linked in with some other entity, or re-mained unlinked. Participants may have chosen notto focus on events because they pose unique chal-lenges while making up only a small portion of the
data. Roughly 91% of mentions in the data are NPsand pronouns. Many of the trained systems werealso able to improve their performance by using fea-ture selection, though things varied some dependingon the example selection strategy and the classifierused.
8 Results
In this section we will take a look at the perfor-mance overview of various systems and then lookat the performance for each language in various set-ting separately. For the official test, beyond the rawsource text, coreference systems were provided onlywith the predictions (from automatic engines) of theother annotation layers (parses, semantic roles, wordsenses, and named entities). While referring to theparticipating systems, as a convention, we will usethe last name of the contact person from the par-ticipating team. It is almost always the last nameof the first author of the system papers, or the firstname in case of conflicting last names (xinxin26).The only exception is – chunyang which is the firstname of the second author for that system. A high-level summary of the results for the systems on theprimary evaluation for both open and closed tracksis shown in Table 17. The scores under the columnsfor each language are the average of MUC, BCUBEDand CEAFe for that language. The column OfficialScore is the average of those per-language averages,but only for the closed track. If a participant didnot participate in all three languages, then they gota score of zero for the languages that were not at-tempted. The systems are sorted in descending orderof this final Official Score The last two columns in-dicate whether the systems used only the training orboth training and development for the final submis-sions. Note that all the results reported here still usedthe same, predicted information for all input layers.Most top performing systems used both training anddevelopment data for training the final system
It can be seen that the fernandes system got thehighest combined score (58.69) across all three lan-guages and metrics. While this is lower than the fig-ures cited for other corpora, it is as expected, giventhat the task here includes predicting the underly-ing mentions and mention boundaries, the insistenceon exact match, and given that the relatively easierappositive coreference cases are not included in thismeasure. The combined score across all languages ispurely for ranking purposes, and does not really tellmuch about each individual language. Looking at
26They did not submit a final system description paper.
Participant Open Closed Official Final model
English Chinese Arabic English Chinese Arabic Score Train Dev
fernandes 63.37 58.49 54.22 58.69√ √
bjorkelund 61.24 59.97 53.55 58.25√ √
chen 63.53 59.69 62.24 47.13 56.35√ ×
stamborg 59.36 56.85 49.43 55.21√ √
uryupina 56.12 53.87 50.41 53.47√ √
zhekova 48.70 44.53 40.57 44.60√ √
li 45.85 46.27 33.53 41.88√ √
yuan 61.02 58.68 60.69 39.79√ √
xu 57.49 59.22 38.90√ ×
martschat 61.31 53.15 38.15√ ×
chunyang 59.24 51.83 37.02 – –yang 55.29 18.43
√ ×chang 60.18 45.71 35.30
√ ×xinxin 48.77 51.76 33.51
√ √shou 58.25 19.42
√ ×xiong 59.23 44.35 44.37 0.00
√ √
Table 17: Performance on primary open and closed tracks using all predicted information
Participant Open Closed Suppl. Final model
English Chinese Arabic English Chinese Arabic Score Train Dev
fernandes 63.16 61.48 53.90 59.51√ √
bjorkelund 60.75 62.76 53.50 59.00√ √
chen 70.00 60.33 68.55 47.27 58.72√ ×
stamborg 57.35 54.30 49.59 53.75√ √
zhekova 49.30 44.93 40.24 44.82√ √
li 43.04 43.28 31.46 39.26√ √
yuan 59.50 64.42 41.31√ √
xu 56.47 64.08 40.18√ ×
chang 60.89 20.30√ √
Table 18: Performance on supplementary open and closed tracks using all predicted information, given gold mentionboundaries
Participant Open Closed Suppl. Final model
English Chinese Arabic English Chinese Arabic Score Train Dev
fernandes 69.35 66.36 63.49 66.40√ √
bjorkelund 68.20 69.92 59.14 65.75√ √
chen 78.98 70.46 77.77 52.26 66.83√ ×
stamborg 68.66 66.97 53.35 62.99√ √
zhekova 59.06 51.44 55.72 55.41√ √
li 51.40 59.93 40.62 50.65√ √
yuan 69.88 76.05 48.64√ √
xu 63.46 69.79 44.42√ ×
chang 77.22 25.74√ √
Table 19: Performance on supplementary open and closed tracks using all predicted information, given gold mentions
the the Engilsh performance, we can see that the fer-nandes system gets the best average across the threeselected metrics (MUC, BCUBED and CEAFe) Thenext best system for English is martschat (61.31)followed very closely by bjorkelund (61.24) andthen chang (60.18). Owing to the ordering based onofficial score, not all the best performing systems fora particular language are in sequential order. There-fore, for easier reading, the scores of the top rankingsystem are in red, and the top four systems are un-derlined in the table. The performance differencesbetween the better-scoring systems were not large,with only about three points separating the top foursystems, and only 5 out of a total of 16 systems gota score lower than 58 points27
In case of Chinese, it is seen that the chen sys-tem performs the best with a score of 62.24. This isthen followed by yuan (60.69), and then bjorkelund(59.97) and xu (59.22). It is interesting to note thatthe scores for the top performing systems for bothEnglish and Chinese are very close. For all we know,this is just a coincidence. Also, for both Englishand Chinese, the top performing system is almost2 points higher than the second best system.
On the Arabic language front, once again, fernan-des has the highest score of 54.22, followed closelyby bjorkelund (53.55) and then uryupina (53.47)
Tables 18 and 19 show similar information for thetwo supplementary tasks – one given gold mentionboundaries and one given correct, gold mentions.We have however, kept the same relative orderingof the system participants as in Table 17 for ease ofreading. Looking at Table 18 carefully, we can seethat for English and Arabic the relative ranking ofthe systems remain almost the same, except for a fewoutliers. The chang system performs the best per-formance given gold mentions – by almost 7 pointsover the next best performing system. In the case ofChinese, chen system performs almost 6 points bet-ter than the official performance given gold bound-aries, and another 9 points given gold mentions andalmost 8 points better than the next best system incase of the latter. We will look at more details in thefollowing sections.
Figure 4 shows a performance plot for eight par-ticipating systems that attempted both the supple-mentary tasks – GB and GM in addition to the mainNB for at least one of the three languages. These areall in the closed setting. At the bottom of the plotyou can see dots that indicate what test condition to
27Which also happens to be the highest performing score lastyear. More precise comparison later in Section 9.
which a particular point refers. In most cases, forthe hardest task – NB – the English and Chinese per-formances track are quite close to each other. Whenprovided with gold mention boundaries (GB), sys-tems, chen, xu and yuan do significantly better forthe Chinese language. There is almost no positiveeffect on the English performance across the board.In fact, performance of the stamborg and li systemsdrops noticeably. There is also a drop in perfor-mance for the bjorkelund system, but the differenceis probably not significant. Finally, when providedwith gold mentions, the performance of all systemsincreases across all languages, with the chang sys-tem showing the highest gain for English, and thechen system showing the highest gain for Chinese.
Figure 5 is a box and whiskers plot of the per-formance for all the systems for each language andvariations – NB, GB, and GM. The circle in the cen-ter indicates the mean of the performances. The hor-izontal line in between the box indicates the median,and the bottom and top of the boxes indicate the firstand third quartiles respectively, with the whiskers in-dicating the highest and lowest performance on thattask. It can be easily seen that the English systemshave the least divergence, with the divergence largefor the GM case probably owing to the chang sys-tem. This is somewhat expected as this is the sec-ond year for the English task, and so it does show amore mature, more stable performance. On the otherhand, both Chinese and Arabic plots show muchmore divergence, with the Chinese and Arabic GBcase showing the highest divergence. Also, exceptfor Chinese GM condition, there is some skewnessin the score distribution one way or the other.
Some participants ran their systems on six ofthe twelve possible combinations for all three lan-guages. Figure 6 shows a plot for these tree par-ticipants – fernandes, bjorkelund, and chen. As inFigure 4, the dots at the bottom help identify whichparticular combination of parameters the point onthe plot represents. In addition to the three testcondition related to mention quality, we now havealso two more test conditions relating to the syntax.We can see that the fernandes and bjorkelund, sys-tem performance tracks very close to each other. Inother words, using gold standard parses during test-ing does not show much benefit in those cases. Incase of the chen system, however, using gold parsesshows a significant jump in scores for the NB condi-tion. It seems that somehow, the chen system makesmuch better use of the gold parses. In fact, the per-formance is very close to the one with the GB condi-tion. It is not clear what this system is doing differ-
Testing Mention Quality No Boundaries Gold Boundaries Gold Mention
20
30
40
50
60
70
80
Arabic Chinese English
fernandes bjorklund
chen stamborg
20
30
40
50
60
70
80
xu yuan chang
li
Figure 4: Performance for eight participating systems for the three languages, across the three mention qualities.
ently that makes this possible. Adding more infor-mation, i.e., the GM condition improves the perfor-mance by almost the same delta as going from NB toGB.
Finally, Figure 7 show the plot for one system –bjorkelund – that was ran on ten of the twelve dif-ferent settings. As usual the dots at the bottom helpidentify the conditions for a point on the plot. Now,there is a condition related to the quality of syntaxduring training as well. For some reason, using goldsyntax hurts performance – though slightly – in theNB and GB settings. Chinese does show some im-provement when gold parse is used for training, onlywhen gold mentions are available during testing.
One point to note is that we cannot comparethese results to the ones obtained in the SEMEVAL-2010 coreference task which used a small portion ofOntoNotes data because it was only using nominalentities, and had heuristically added singleton men-
tions28.In the following sections we will look at the re-
sults for the three languages, in various settings inmore detail. It might help to describe the format ofthe tables first. Given that our choice of the officialmetric was somewhat arbitrary, it is also useful tolook at the individual metrics The tables are simi-lar in structure to Table 20. Each table provides re-
28The documentation that comes with the SEMEVAL datapackage from LDC (LDC2011T01) states: “Only nominalmentions and identical (IDENT) types were taken from theOntoNotes coreference annotation, thus excluding coreferencerelations with verbs and appositives. Since OntoNotes is onlyannotated with multi-mention entities, singleton referential ele-ments were identified heuristically: all NPs and possessive de-terminers were annotated as singletons excluding those func-tioning as appositives or as pre-modifiers but for NPs in thepossessive case. In coordinated NPs, single constituents as wellas the entire NPs were considered to be mentions. There is noreliable heuristic to automatically detect English expletive pro-nouns, thus they were (although inaccurately) also annotated assingletons.”
sults across multiple dimensions. For completeness,the tables include the raw precision and recall scoresfrom which the F-scores were derived. Each tableshows the scores for a particular system for the taskof mention detection and coreference resolution sep-arately. The tables also include two additional scores(BLANC and CEAFm) that did not factor into the of-ficial score. Useful further analysis may be possiblebased on these results beyond the preliminary resultspresented here. As you recall, OntoNotes does notcontain any singleton mentions. Owing to this pecu-liar nature of the data, the mention detection scorescannot be interpreted independently of the corefer-ence resolution scores. In this scenario, a mentionis effectively an anaphoric mention that has at leastone other mention coreferent with it in the docu-ment. Most systems removed singletons from theresponse as a post-processing step, so not only willthey not get credit for the singleton entities that theyincorrectly removed from the data, but they will be
penalized for the ones that they accidentally linkedwith another mention. What this number does in-dicate is the ceiling on recall that a system wouldhave got in absence of being penalized for makingmistakes in coreference resolution. The tables aresub-divided into several logical horizontal sectionsseparated by two horizontal lines. Each horizon-tal section can be categorized by a combination ofthree parameters. Two of these apply to the test set,and one to the training set. We have divided the pa-rameters into two types: i) Syntax and ii) MentionQuality. Syntax can take two values – automatic orgold, and the mention quality can be of three types:i) No boundaries (NB), ii) Gold mention boundaries(GB) and iii) Gold mentions (GM). There are a to-tal of 12 combinations that we can form of usingthese parameters. Out of these, we thought six wereparticularly interesting. This is the product of thethree cases of mention quality – NB, GB and GM,and whether or not gold syntax (GS) or predicted
0
20
40
60
80
100
Ara
bic
(NB
)
Ara
bic
(GB
)
Ara
bic
(GM
)
Chi
nese
(N
B)
Chi
nese
(G
B)
Chi
nese
(G
M)
Eng
lish
(NB
)
Eng
lish
(GB
)
Eng
lish
(GM
)
Figure 5: A box plot of the performance for the three languages across the three mention qualities.
0
10
20
30
40
50
60
70
80
90
100
Arabic Chinese English
Testing Syntax Gold Predicted
Mention Quality No Boundaries Gold Boundaries Gold Mention
fernandes chenbjorklund
Figure 6: Performance of the fernandes, bjorkelund and chen systems in six different settings.
syntax (PS) was used for the test set. Just like weused the dots below the graphs earlier to indicate theparameters that were chosen for a particular point onthe plot, we use small black squares in the tables af-ter the participant name, to indicate the conditionschosen for the results on that particular row. Sincethere are many rows to each table, in order to facil-itate finding which number we are referring to, wehave added a ID column which uses letters e, c, anda to refer to the three languages – English, Chineseand Arabic. This is followed by a decimal number,in which the number before the decimal identifiesthe logical block within the table that share the sameexperiment parameters, and the one after the deci-mal indicates the index of a particular system in thatblock. Systems are sorted by the official score withineach block. All the systems with NB are listed first,followed by GB, followed by GM. One participant(bjorkelund) ran more variations than we had orig-inally planned, but since it falls under the generalpermutation and combination of the settings that wewere considering, it makes sense to list those resultshere as well.
8.1 English Closed
Table 20 shows the performance for the English lan-guage in greater detail.
Official Setting Recall is quite important in themention detection stage because the full coreferencesystem has no way to recover if the mention de-tection stage misses a potentially anaphoric men-tion. The linking stage indirectly impacts the finalmention detection accuracy. After a complete passthrough the system some correct mentions could re-main unlinked with any other mentions and wouldbe deleted thereby lowering recall. Most systemstend to get a close balance between recall and preci-sion for the mention detection task. A few systemshad a considerable gap between the final mentiondetection recall and precision (fernandes, xu, yang,li and xinxin). It is not clear why this might be thecase. One commonality between the ones that hada much higher precision than recall was that theyused machine learned classifiers for mention detec-tion. This could be possible because any classifierthat is trained will not normally contain singletonmentions (as none have been annotated in the data)unless one explicitly adds them to the set of train-ing examples (which is not mentioned in any of therespective system papers). A hybrid rule-based andmachine learned model (fernandes) performed thebest. Apart from some local differences, the rank-ing for all the systems is roughly the same irrespec-tive of which metric is chosen. The CEAFe mea-
Training Syntax Gold Predicted
Testing Syntax Gold Predicted
Mention Quality No Boundaries Gold Boundaries Gold Mention
Arabic Chinese English
40
50
60
70
80
Figure 7: Performance of the bjorkelund system in twelve different settings.
sure seems to penalize systems more harshly thanthe other measures. If the CEAFe measure does in-dicate the accuracy of entities in the response, thissuggests that the fernandes system is doing betteron getting coherent entities than any other system.
Gold Mention Boundaries One difficulty withthis supplementary evaluation using gold mentionboundaries is that those boundaries alone provideonly very partial information. For the roughly 10%of mentions that the automatic parser did not cor-rectly identify, while the systems knew the correctboundaries, they had no structural syntactic or se-mantic information, and they also had to further ap-proximate the already heuristic head word identifi-cation. This incomplete data complicated the sys-tems’ task and also complicates interpretation of theresults. While most systems did slightly better herein terms of raw scores, the performance was notmuch different from the official task, indicating thatmention boundary errors resulting from problems inparsing do not contribute significantly to the finaloutput.29
29It would be interesting to measure the overlap between theentity clusters for these two cases, to see whether there wasany substantial difference in the mention chains, besides the ex-pected differences in boundaries for individual mentions.
Gold Mentions Another supplementary conditionthat we explored was if the systems were suppliedwith the manually-annotated spans for all and onlythose mentions that did participate in the gold stan-dard coreference chains. This supplies significantlymore information than the previous case, where ex-act spans were supplied for all NPs, since the goldmentions will also include verb headwords that arelinked to event NPs, but will not include singletonmentions, which do not end up as part of any chain.The latter constraint makes this test seem artificial,since it directly reveals part of what the systems aredesigned to determine, but it still has some value inquantifying the impact that mention detection andanaphoricity determination has on the overall taskand what the results are if they are perfectly known.The results show that performance does go up sig-nificantly, indicating that it is markedly easier forthe systems to generate better entities given goldmentions. Although, ideally, one would expect aperfect mention detection score, it is the case thatmany of the systems did not get a 100% recall. Thiscould possibly be owing to unlinked singletons thatwere removed in post-processing. The chang sys-tem along with the fernandes system are the onlysystems that got a perfect 100% recall. The reasonis most likely because they had a hard constraint tolink all mentions with at least one other mention.
The chang system (77.22 [e7.00]) stands out in thatit has a 7 point lead on the next best system in thiscategory ([e7.00]) This indicates that the linking al-gorithm for this system is significantly superior thanthe other systems – especially since the performanceof the only other system that gets 100% mentionscore is much lower (69.35 [e7.03])
Gold Test Parses Looking at Table 20 it can beseen that there is a slight increase (∼1 point) in per-formance across all the systems when gold parsesacross all settings – NB, GB, and GM. In the case ofthe bjorkelund system, for the NB setting, the over-all performance improves by a percent when usinggold test parse during testing (61.24 [e0.02] vs 62.23[e1.02]), but strangely if gold parses are used duringtraining as well, the performance is slightly lower(61.71 [e3.00]), although this difference is probablynot statistically significant.
8.2 Chinese ClosedTable 21 shows the performance for the Chinese lan-guage in greater detail.
8.2.1 Official SettingIn this case, it turns out that the chen system does
about 2 points better than the next best system acrossall the metrics. We know that this system had somemore Chinese-specific improvements. It is strangethat the fernandes system has a much lower mentionrecall with a much higher precision as compared tochen. As far as the system descriptions go, both sys-tems seem to have used the same set of mentions –except for the chen system including QP phrases andnot considering interrogative pronouns. One thingwe found about the chen system was that they dealtwith nested NPs differently in case of the NW genre.This unfortunately seems to be addressing a quirk inthe Chinese newswire data owing to a possible datainconsistency.
8.2.2 Gold Mention BoundariesUnlike English, just the addition of gold mention
boundaries improves the performance of almost allsystems significantly. The delta improvement for thefernandes system turns out to be small, but it doesgain on the mention recall as compared to the NBcase. It is not clear why this might be the case. Oneexplanation could be that the parser performance forconstituents that represent mentions – primarily NPmight be significantly worse than that for English.The mention recall of all the systems is boosted byroughly 10%.
8.2.3 Gold MentionsProviding gold mention information further sig-
nificantly boosts all systems. More so is the casewith chen system [e8.00] which gains another 8points over the gold mention boundary condition inspite of the fact that they don’t have a perfect re-call. On the other hand, the fernandes system getsa perfect mention recall and precision, but ends upgetting a 10 point lower performance [c8.05] thanthe chen system. Another thing to note is that for theCEAFe metric, the incremental drop in performancefrom the best to the next best and so on, is substan-tial, with a difference of 17 points between the chensystem and the fernandes system. It does seem thatthe chen and yuan system algorithm for linking ismuch better than the others.
8.2.4 Gold Test ParsesWhen provided with gold parses for the test set,
there is a substantial increase in performance for theNB condition – numerically more so than in case ofEnglish. The degree of improvement decreases forthe GB and GM conditions.
8.3 Arabic Closed
Table 22 shows the performance for the Arabic lan-guage in greater detail.
8.3.1 Official SettingUnlike English, of Chinese, none of the system
was particularly tuned for Arabic. This gives us anunique opportunity to test the performance variationof a mostly statistical, roughly language indepen-dent mechanism. Although, there could possibly bea significant bias that Arabic languages brings to themix. The overall performance for Arabic seems tobe about ten points below both English and Chinese.On the mention detection front, most of the systemshave a balanced precision and recall, and the dropin performance seems quite steady. The bjorkelundsystem has a slight edge on the fernandes system onthe MUC, BCUBED and BLANC metrics, but fernan-des has a much larger lead on both the CEAF met-rics, putting it on the top in the official score. Wehaven’t reported the development set numbers here,but another thing to note especially for Arabic is thatperformance on Arabic test set is significantly bet-ter than on the development set cite relevant paper.This is probably because of the smaller size of thetraining set and therefore a higher relative incrementover training set. The size of the training set (whichis roughly about a third of either Engish or Chinese)
IDPa
rtic
ipan
tTr
ain
Test
ME
NT
ION
DE
TE
CT
ION
CO
RE
FE
RE
NC
ER
ESO
LU
TIO
NO
ffici
alSy
ntax
Synt
axM
entio
nQ
lty.
MU
CB
CU
BE
DC
EA
F mC
EA
F eB
LA
NC
AG
AG
NB
GB
GM
RP
FR
PF 1
RP
F 2R
PF
RP
F 3R
PF
F1+
F2+
F3
3
e0.0
0fe
rnan
des
��
�72
.75
83.4
577
.73
65.8
375
.91
70.5
165
.79
77.6
971
.24
61.9
461
.94
61.9
455
.00
43.1
748
.37
77.3
578
.07
77.7
063
.37
e0.0
1m
arts
chat
��
�74
.23
76.1
075
.15
65.2
168
.83
66.9
766
.50
74.6
970
.36
59.6
159
.61
59.6
148
.64
44.7
246
.60
73.2
978
.94
75.7
361
.31
e0.0
2bj
orke
lund
��
�73
.75
77.0
975
.38
65.2
370
.10
67.5
865
.90
75.2
470
.26
59.2
059
.20
59.2
048
.60
43.4
245
.87
73.4
778
.81
75.8
061
.24
e0.0
3ch
ang
��
�72
.48
76.2
574
.32
64.7
768
.06
66.3
867
.04
71.8
169
.34
58.2
858
.28
58.2
846
.64
43.1
144
.81
75.2
474
.73
74.9
860
.18
e0.0
4ch
en�
��
75.0
872
.60
73.8
263
.47
63.9
663
.71
66.5
771
.52
68.9
658
.13
58.1
358
.13
46.6
846
.15
46.4
171
.30
79.1
574
.48
59.6
9e0
.05
stam
borg
��
�75
.51
72.3
973
.92
66.2
663
.98
65.1
069
.09
69.5
469
.31
56.7
656
.76
56.7
642
.53
44.8
943
.68
74.0
377
.28
75.5
259
.36
e0.0
6ch
unya
ng�
��
75.2
372
.24
73.7
164
.08
63.5
763
.82
66.4
570
.71
68.5
157
.24
57.2
457
.24
45.1
345
.67
45.4
071
.12
77.9
273
.95
59.2
4e0
.07
yuan
��
�73
.19
71.8
872
.53
62.0
863
.02
62.5
566
.23
70.4
568
.27
57.2
857
.28
57.2
845
.74
44.7
445
.23
72.0
576
.86
74.1
658
.68
e0.0
8sh
ou�
��
75.3
572
.08
73.6
863
.46
62.3
962
.92
65.3
168
.90
67.0
555
.68
55.6
855
.68
44.2
045
.35
44.7
769
.43
75.0
871
.81
58.2
5e0
.09
xu�
��
62.6
484
.55
71.9
659
.11
75.1
866
.18
62.2
873
.29
67.3
453
.19
53.1
953
.19
48.2
932
.64
38.9
575
.16
68.3
770
.95
57.4
9e0
.10
uryu
pina
��
�71
.82
69.9
670
.88
61.0
060
.78
60.8
963
.59
68.4
865
.95
52.4
452
.44
52.4
441
.42
41.6
441
.53
67.4
072
.83
69.6
556
.12
e0.1
1ya
ng�
��
64.2
873
.99
68.7
955
.29
65.2
059
.84
60.3
872
.74
65.9
951
.90
51.9
051
.90
45.2
035
.95
40.0
568
.69
71.6
670
.03
55.2
9e0
.12
xinx
in�
��
73.9
354
.55
62.7
855
.48
42.7
248
.27
66.9
356
.67
61.3
744
.83
44.8
344
.83
31.6
843
.55
36.6
864
.77
66.1
465
.42
48.7
7e0
.13
zhek
ova
��
�65
.78
68.4
967
.11
54.2
852
.79
53.5
262
.26
54.9
058
.35
43.5
443
.54
43.5
433
.52
34.9
634
.22
67.2
358
.77
60.6
348
.70
e0.1
4li
��
�45
.78
86.7
259
.93
39.1
272
.57
50.8
443
.03
80.0
655
.98
41.9
741
.97
41.9
749
.44
22.3
030
.74
64.0
166
.86
65.2
445
.85
e1.0
0fe
rnan
des
��
�74
.76
84.8
179
.47
67.7
377
.25
72.1
866
.42
78.0
171
.75
63.0
563
.05
63.0
556
.16
44.5
149
.66
76.8
978
.60
77.7
164
.53
e1.0
1m
arts
chat
��
�76
.87
77.3
377
.10
67.9
070
.37
69.1
167
.83
75.3
471
.39
60.9
260
.92
60.9
249
.38
46.6
147
.96
74.0
779
.77
76.5
462
.82
e1.0
2bj
orke
lund
��
�75
.53
77.8
676
.68
67.0
071
.17
69.0
266
.56
75.7
170
.84
59.9
059
.90
59.9
049
.22
44.6
846
.84
73.4
280
.19
76.2
762
.23
e1.0
3ch
en�
��
77.1
375
.15
76.1
366
.30
66.9
966
.65
67.7
372
.77
70.1
659
.70
59.7
059
.70
47.9
947
.21
47.6
072
.63
80.0
875
.70
61.4
7e1
.04
zhek
ova
��
�66
.05
69.6
267
.79
54.4
553
.59
54.0
261
.66
55.6
258
.48
43.7
143
.71
43.7
133
.82
34.6
534
.23
66.8
459
.27
61.2
048
.91
e2.0
0bj
orke
lund
��
�76
.23
72.6
474
.39
66.5
065
.22
65.8
568
.45
71.0
169
.71
58.3
558
.35
58.3
544
.85
46.2
145
.52
74.3
677
.29
75.7
260
.36
e3.0
0bj
orke
lund
��
�78
.68
73.8
376
.18
68.9
666
.73
67.8
369
.70
71.4
770
.58
59.5
559
.55
59.5
545
.55
47.9
746
.73
74.9
877
.92
76.3
561
.71
e4.0
0fe
rnan
des
��
�71
.91
85.2
978
.03
64.9
277
.53
70.6
764
.25
78.9
570
.85
61.5
961
.59
61.5
956
.48
41.6
947
.97
76.2
877
.87
77.0
563
.16
e4.0
1ch
ang
��
�72
.01
79.8
375
.72
64.5
871
.36
67.8
066
.07
73.8
769
.75
58.5
158
.51
58.5
149
.14
41.7
145
.12
75.9
273
.95
74.8
860
.89
e4.0
2bj
orke
lund
��
�71
.95
78.9
775
.30
63.4
471
.63
67.2
963
.95
76.5
969
.70
58.4
858
.48
58.4
850
.00
41.3
545
.27
73.0
578
.99
75.5
960
.75
e4.0
3ch
en�
��
74.7
875
.70
75.2
463
.26
66.7
864
.97
65.3
873
.56
69.2
358
.57
58.5
758
.57
48.8
144
.92
46.7
971
.85
80.2
275
.20
60.3
3e4
.04
yuan
��
�75
.73
70.8
473
.20
64.5
062
.79
63.6
468
.53
69.9
969
.25
57.9
057
.90
57.9
044
.73
46.5
345
.61
72.8
676
.93
74.6
959
.50
e4.0
5st
ambo
rg�
��
76.4
667
.30
71.5
966
.16
58.8
062
.26
71.2
164
.98
67.9
555
.20
55.2
055
.20
38.4
645
.83
41.8
376
.19
73.5
874
.80
57.3
5e4
.06
xu�
��
66.3
175
.82
70.7
462
.15
67.1
364
.55
67.5
366
.97
67.2
552
.02
52.0
252
.02
40.0
435
.45
37.6
077
.29
67.4
270
.74
56.4
7e4
.07
zhek
ova
��
�66
.45
70.9
168
.61
54.9
654
.67
54.8
261
.85
55.6
058
.56
43.9
643
.96
43.9
634
.38
34.6
734
.53
68.4
959
.51
61.5
249
.30
e4.0
8li
��
�60
.00
44.4
751
.08
44.1
733
.66
38.2
166
.37
53.9
359
.51
39.3
039
.30
39.3
027
.53
36.5
131
.39
63.2
660
.00
61.3
343
.04
e5.0
0fe
rnan
des
��
�72
.69
86.1
178
.83
65.6
578
.26
71.4
064
.36
79.0
970
.97
62.0
062
.00
62.0
057
.36
42.2
348
.65
75.8
978
.28
77.0
263
.67
e5.0
1bj
orke
lund
��
�73
.24
79.2
576
.13
64.7
572
.29
68.3
164
.62
77.0
170
.27
59.2
059
.20
59.2
050
.45
42.3
646
.05
73.2
280
.28
76.1
761
.54
e5.0
2ch
en�
��
75.5
976
.58
76.0
864
.67
68.0
766
.33
66.0
974
.02
69.8
359
.38
59.3
859
.38
49.3
645
.55
47.3
872
.09
80.4
175
.43
61.1
8e5
.03
stam
borg
��
�77
.17
67.1
471
.81
66.8
858
.82
62.5
971
.55
64.9
768
.10
55.3
055
.30
55.3
038
.28
46.3
441
.93
76.1
673
.62
74.8
157
.54
e5.0
4zh
ekov
a�
��
65.8
271
.72
68.6
554
.68
55.5
155
.09
61.2
256
.59
58.8
243
.90
43.9
043
.90
34.8
534
.04
34.4
468
.10
59.7
661
.79
49.4
5
e6.0
0bj
orke
lund
��
�76
.26
75.4
775
.86
66.5
568
.00
67.2
767
.60
73.0
570
.22
59.0
459
.04
59.0
446
.99
45.4
246
.19
74.6
078
.38
76.3
261
.23
e7.0
0ch
ang
��
�10
0.00
100.
0010
0.00
83.1
688
.48
85.7
475
.36
79.6
977
.46
73.7
673
.76
73.7
675
.38
62.7
168
.46
81.2
378
.99
80.0
577
.22
e7.0
1ch
en�
��
80.8
210
0.00
89.3
972
.29
89.4
079
.94
64.6
085
.92
73.7
568
.35
68.3
568
.35
76.2
546
.40
57.6
975
.41
84.8
079
.12
70.4
6e7
.02
yuan
��
�80
.03
100.
0088
.91
72.2
289
.16
79.8
064
.75
84.6
873
.39
67.7
667
.76
67.7
674
.49
45.4
656
.46
75.6
083
.10
78.6
969
.88
e7.0
3fe
rnan
des
��
�10
0.00
100.
0010
0.00
70.6
991
.21
79.6
565
.46
85.6
174
.19
67.6
467
.64
67.6
474
.71
42.5
554
.22
79.4
180
.17
79.7
869
.35
e7.0
4st
ambo
rg�
��
78.1
710
0.00
87.7
471
.22
88.1
278
.77
64.7
583
.16
72.8
166
.74
66.7
466
.74
71.9
443
.74
54.4
178
.68
81.4
779
.99
68.6
6e7
.05
bjor
kelu
nd�
��
75.6
910
0.00
86.1
669
.51
90.7
178
.70
62.3
787
.03
72.6
766
.32
66.3
266
.32
74.1
441
.52
53.2
376
.49
82.9
079
.22
68.2
0e7
.06
xu�
��
67.0
910
0.00
80.3
064
.81
89.9
475
.34
62.8
880
.57
70.6
359
.13
59.1
359
.13
65.2
833
.67
44.4
278
.60
72.3
774
.87
63.4
6e7
.07
zhek
ova
��
�78
.55
100.
0087
.98
68.3
878
.11
72.9
263
.04
58.6
060
.74
50.3
550
.35
50.3
552
.64
37.1
043
.53
69.8
761
.19
62.7
459
.06
e7.0
8li
��
�57
.18
99.9
972
.75
48.0
481
.51
60.4
544
.13
81.2
157
.18
47.8
247
.82
47.8
261
.82
25.9
836
.58
65.2
669
.93
67.1
251
.40
e8.0
0ch
en�
��
81.4
610
0.00
89.7
873
.22
89.6
980
.62
65.3
285
.87
74.2
068
.87
68.8
768
.87
76.4
347
.26
58.4
075
.66
84.8
479
.31
71.0
7e8
.01
fern
ande
s�
��
100.
0010
0.00
100.
0071
.18
91.2
479
.97
65.8
185
.51
74.3
867
.83
67.8
367
.83
74.9
343
.09
54.7
279
.65
80.3
179
.97
69.6
9e8
.02
stam
borg
��
�78
.83
100.
0088
.16
71.9
688
.40
79.3
365
.31
83.2
873
.21
67.2
667
.26
67.2
672
.29
44.4
855
.07
78.9
281
.59
80.1
769
.20
e8.0
3bj
orke
lund
��
�76
.94
100.
0086
.97
70.6
691
.15
79.6
062
.90
87.5
673
.21
66.9
066
.90
66.9
074
.88
42.6
554
.35
76.4
284
.43
79.7
169
.05
e8.0
4zh
ekov
a�
��
78.7
310
0.00
88.1
068
.54
78.1
073
.01
63.1
458
.63
60.8
050
.61
50.6
150
.61
52.8
437
.44
43.8
370
.28
61.5
563
.23
59.2
1
e9.0
0bj
orke
lund
��
�79
.55
100.
0088
.61
72.5
990
.04
80.3
864
.94
85.7
273
.89
68.1
468
.14
68.1
474
.77
45.2
756
.40
78.2
383
.31
80.4
870
.22
Tabl
e20
:Per
form
ance
ofsy
stem
sin
the
prim
ary
and
supp
lem
enta
ryev
alua
tions
fort
hecl
osed
trac
kfo
rEng
lish
IDPa
rtic
ipan
tTr
ain
Test
ME
NT
ION
DE
TE
CT
ION
CO
RE
FE
RE
NC
ER
ESO
LU
TIO
NO
ffici
alSy
ntax
Synt
axM
entio
nQ
lty.
MU
CB
CU
BE
DC
EA
F mC
EA
F eB
LA
NC
AG
AG
NB
GB
GM
RP
FR
PF 1
RP
F 2R
PF
RP
F 3R
PF
F1+
F2+
F3
3
c0.0
0ch
en�
��
71.1
272
.16
71.6
459
.92
64.6
962
.21
69.7
377
.81
73.5
562
.18
62.1
862
.18
53.4
348
.73
50.9
772
.79
84.5
377
.34
62.2
4c0
.01
yuan
��
�72
.75
64.0
968
.15
62.3
658
.42
60.3
373
.12
72.6
772
.90
59.5
959
.59
59.5
947
.10
50.7
048
.83
73.7
278
.22
75.7
660
.69
c0.0
2bj
orke
lund
��
�69
.45
63.5
466
.37
58.7
258
.49
58.6
171
.23
75.0
773
.10
59.0
159
.01
59.0
148
.09
48.2
948
.19
72.2
581
.61
76.0
759
.97
c0.0
3xu
��
�64
.33
66.1
065
.20
55.0
261
.47
58.0
768
.39
76.3
872
.16
57.4
657
.46
57.4
650
.40
44.8
147
.44
71.6
474
.57
73.0
059
.22
c0.0
4fe
rnan
des
��
�57
.24
78.2
866
.13
52.6
970
.58
60.3
462
.99
80.5
770
.70
57.7
357
.73
57.7
353
.75
37.8
844
.44
75.0
679
.59
77.1
158
.49
c0.0
5st
ambo
rg�
��
57.6
571
.93
64.0
152
.56
64.1
357
.77
64.4
377
.55
70.3
855
.57
55.5
755
.57
47.9
038
.04
42.4
172
.74
77.8
475
.00
56.8
5c0
.06
uryu
pina
��
�50
.98
70.0
959
.03
45.6
263
.13
52.9
759
.17
80.7
868
.31
52.4
052
.40
52.4
048
.47
34.5
240
.32
68.7
280
.76
73.1
153
.87
c0.0
7m
arts
chat
��
�48
.49
74.0
258
.60
42.7
167
.80
52.4
155
.37
85.2
467
.13
51.3
051
.30
51.3
051
.81
32.4
639
.92
63.9
682
.81
69.1
853
.15
c0.0
8ch
unya
ng�
��
61.1
162
.12
61.6
150
.02
49.6
449
.83
65.8
165
.50
65.6
649
.88
49.8
849
.88
39.8
440
.17
40.0
067
.12
65.8
366
.45
51.8
3c0
.09
xinx
in�
��
55.6
856
.09
55.8
947
.64
48.5
548
.09
66.1
670
.60
68.3
149
.92
49.9
249
.92
39.2
638
.53
38.8
969
.48
73.9
171
.44
51.7
6c0
.10
li�
��
36.6
087
.01
51.5
332
.48
71.4
444
.65
45.5
186
.06
59.5
445
.70
45.7
045
.70
55.1
125
.24
34.6
264
.99
76.6
368
.92
46.2
7c0
.11
chan
g�
��
39.4
359
.97
47.5
830
.85
49.2
237
.93
53.0
278
.31
63.2
344
.89
44.8
944
.89
44.9
229
.99
35.9
759
.82
71.2
463
.16
45.7
1c0
.12
zhek
ova
��
�35
.12
72.5
247
.32
31.1
957
.97
40.5
649
.49
77.6
560
.45
41.8
641
.86
41.8
645
.92
25.2
432
.58
64.2
961
.64
62.7
944
.53
c1.0
0ch
en�
��
83.2
279
.25
81.1
972
.14
72.7
772
.46
75.3
280
.20
77.6
868
.67
68.6
768
.67
58.3
757
.64
58.0
076
.48
87.0
780
.80
69.3
8c1
.01
yuan
��
�83
.69
70.3
376
.43
73.7
365
.50
69.3
877
.97
74.3
776
.13
64.7
764
.77
64.7
749
.99
58.3
853
.86
77.1
279
.84
78.4
166
.46
c1.0
2xu
��
�75
.02
73.0
774
.03
65.9
569
.28
67.5
873
.28
77.6
575
.40
62.5
362
.53
62.5
353
.90
50.7
252
.26
76.5
075
.17
75.8
165
.08
c1.0
3bj
orke
lund
��
�77
.77
68.0
372
.57
66.7
663
.36
65.0
174
.21
75.8
475
.02
62.1
762
.17
62.1
749
.74
52.9
251
.28
74.3
882
.36
77.7
763
.77
c1.0
4fe
rnan
des
��
�63
.83
81.7
371
.68
59.3
574
.49
66.0
766
.31
81.4
373
.10
61.1
961
.19
61.1
955
.97
41.5
047
.66
78.1
181
.28
79.6
062
.28
c1.0
5st
ambo
rg�
��
63.0
775
.21
68.6
158
.32
67.9
562
.76
67.4
778
.54
72.5
858
.98
58.9
858
.98
49.8
041
.15
45.0
676
.29
80.0
978
.05
60.1
3c1
.06
uryu
pina
��
�56
.61
71.3
863
.14
50.7
464
.53
56.8
161
.78
80.1
169
.76
54.5
954
.59
54.5
948
.94
37.4
942
.45
70.3
780
.87
74.4
256
.34
c1.0
7m
arts
chat
��
�52
.94
77.1
762
.80
47.3
072
.19
57.1
557
.31
86.6
969
.00
53.9
253
.92
53.9
254
.11
34.4
042
.06
65.3
984
.86
70.9
556
.07
c1.0
8ch
unya
ng�
��
67.5
265
.33
66.4
156
.11
52.8
154
.41
68.2
764
.49
66.3
351
.92
51.9
251
.92
40.3
943
.47
41.8
869
.16
65.0
366
.80
54.2
1c1
.09
xinx
in�
��
62.5
958
.21
60.3
253
.49
50.4
451
.92
68.7
069
.78
69.2
451
.95
51.9
551
.95
39.6
542
.18
40.8
771
.40
74.6
672
.90
54.0
1c1
.10
yang
��
�58
.66
56.5
257
.57
48.7
449
.49
49.1
165
.61
72.9
169
.07
49.6
149
.61
49.6
140
.11
39.5
239
.81
64.5
173
.91
67.9
552
.66
c1.1
1li
��
�40
.34
89.7
455
.66
34.8
573
.77
47.3
346
.00
86.0
659
.95
46.7
946
.79
46.7
956
.58
25.7
735
.42
66.0
177
.39
69.9
647
.57
c1.1
2ch
ang
��
�43
.28
62.1
651
.03
33.3
650
.29
40.1
153
.96
77.6
063
.65
45.9
445
.94
45.9
445
.06
31.1
036
.80
60.4
071
.57
63.7
946
.85
c1.1
3zh
ekov
a�
��
37.8
474
.84
50.2
733
.95
60.2
943
.44
50.9
577
.28
61.4
143
.34
43.3
443
.34
46.6
826
.13
33.5
065
.98
62.1
563
.73
46.1
2
c2.0
0bj
orke
lund
��
�74
.86
55.0
763
.46
61.0
648
.92
54.3
276
.08
67.3
571
.45
57.3
957
.39
57.3
942
.39
53.1
047
.15
73.4
277
.96
75.4
857
.64
c3.0
0bj
orke
lund
��
�85
.29
59.8
770
.36
70.9
254
.07
61.3
679
.51
67.8
373
.21
60.8
060
.80
60.8
043
.61
59.3
150
.26
76.2
678
.94
77.5
361
.61
c4.0
0ch
en�
��
81.9
978
.97
80.4
570
.76
72.1
271
.43
74.3
779
.91
77.0
467
.87
67.8
767
.87
57.9
556
.41
57.1
775
.95
86.7
580
.32
68.5
5c4
.01
yuan
��
�82
.89
66.8
674
.02
72.1
261
.59
66.4
477
.96
72.3
075
.02
62.9
662
.96
62.9
647
.17
57.4
751
.81
75.7
778
.48
77.0
564
.42
c4.0
2xu
��
�73
.21
72.6
872
.94
63.5
468
.73
66.0
371
.36
78.7
074
.85
61.5
761
.57
61.5
753
.90
49.0
751
.37
72.9
077
.06
74.8
064
.08
c4.0
3bj
orke
lund
��
�71
.88
70.1
871
.02
61.6
965
.54
63.5
670
.42
79.1
474
.52
61.3
261
.32
61.3
252
.03
48.5
150
.20
72.4
485
.49
77.4
062
.76
c4.0
4fe
rnan
des
��
�64
.21
79.1
870
.91
58.7
671
.46
64.4
966
.62
79.8
872
.65
60.4
060
.40
60.4
054
.09
42.0
247
.29
77.6
980
.96
79.2
261
.48
c4.0
5st
ambo
rg�
��
71.2
155
.28
62.2
461
.13
47.2
053
.27
75.4
763
.09
68.7
352
.82
52.8
252
.82
35.8
147
.71
40.9
174
.53
71.1
272
.69
54.3
0c4
.06
zhek
ova
��
�36
.97
73.9
849
.30
32.0
958
.30
41.3
949
.43
77.3
860
.32
42.0
942
.09
42.0
946
.35
25.7
133
.07
63.4
161
.47
62.3
444
.93
c4.0
7li
��
�68
.22
41.8
851
.90
52.5
630
.62
38.7
076
.15
48.5
259
.27
41.0
641
.06
41.0
625
.14
43.5
131
.86
67.8
258
.70
61.4
743
.28
c5.0
0ch
en�
��
83.2
279
.25
81.1
972
.14
72.7
772
.46
75.3
280
.20
77.6
868
.67
68.6
768
.67
58.3
757
.64
58.0
076
.48
87.0
780
.80
69.3
8c5
.01
yuan
��
�83
.75
67.4
674
.72
73.2
562
.49
67.4
478
.46
72.6
875
.46
63.5
263
.52
63.5
247
.54
58.1
352
.30
76.1
078
.79
77.3
765
.07
c5.0
2bj
orke
lund
��
�76
.53
70.2
873
.27
65.4
665
.87
65.6
672
.79
78.3
875
.49
62.9
162
.91
62.9
152
.21
51.8
252
.01
73.7
085
.52
78.3
664
.39
c5.0
3fe
rnan
des
��
�63
.83
81.7
371
.68
59.3
574
.49
66.0
766
.31
81.4
373
.10
61.1
961
.19
61.1
955
.97
41.5
047
.66
78.1
181
.28
79.6
062
.28
c5.0
4st
ambo
rg�
��
71.3
655
.32
62.3
261
.17
47.2
753
.33
75.6
463
.41
68.9
953
.26
53.2
653
.26
36.1
148
.05
41.2
375
.75
72.5
374
.02
54.5
2c5
.05
zhek
ova
��
�37
.89
74.7
950
.30
33.9
360
.19
43.3
950
.87
77.2
761
.35
43.2
943
.29
43.2
946
.62
26.1
333
.49
65.9
162
.31
63.8
146
.08
c5.0
6li
��
�74
.10
40.2
352
.14
56.1
528
.50
37.8
179
.00
43.6
756
.25
39.3
239
.32
39.3
222
.49
45.7
230
.15
68.1
257
.08
59.8
341
.40
c6.0
0bj
orke
lund
��
�84
.09
61.1
970
.83
69.8
555
.71
61.9
878
.57
69.9
374
.00
61.5
261
.52
61.5
245
.23
58.4
350
.99
75.3
881
.33
78.0
262
.32
c7.0
0bj
orke
lund
��
�81
.07
100.
0089
.55
72.7
292
.28
81.3
469
.47
91.3
878
.93
72.2
472
.24
72.2
481
.32
52.4
563
.77
77.3
089
.40
82.0
974
.68
c8.0
0ch
en�
��
84.7
210
0.00
91.7
376
.60
92.4
183
.77
72.9
591
.43
81.1
575
.83
75.8
375
.83
83.5
657
.86
68.3
879
.21
91.3
884
.09
77.7
7c8
.01
yuan
��
�81
.72
99.7
989
.85
74.7
792
.74
82.7
970
.91
91.2
179
.79
73.6
773
.67
73.6
781
.98
54.6
565
.58
77.4
888
.14
81.8
176
.05
c8.0
2bj
orke
lund
��
�71
.63
100.
0083
.47
65.3
693
.23
76.8
564
.44
93.5
176
.30
68.3
068
.30
68.3
078
.59
44.2
456
.61
75.7
791
.56
81.5
669
.92
c8.0
3xu
��
�71
.51
100.
0083
.38
65.6
394
.07
77.3
265
.05
91.2
375
.95
66.2
266
.22
66.2
278
.13
43.7
756
.11
73.9
879
.71
76.4
869
.79
c8.0
4st
ambo
rg�
��
68.9
710
0.00
81.6
363
.52
88.2
373
.86
63.5
488
.12
73.8
465
.60
65.6
065
.60
72.5
642
.01
53.2
176
.96
83.7
079
.89
66.9
7c8
.05
fern
ande
s�
��
100.
0010
0.00
100.
0061
.64
90.8
173
.43
63.5
589
.43
74.3
065
.10
65.1
065
.10
72.7
839
.68
51.3
680
.21
83.3
981
.71
66.3
6c8
.06
li�
��
63.5
999
.95
77.7
355
.28
82.2
866
.13
55.9
582
.99
66.8
457
.50
57.5
057
.50
66.7
636
.06
46.8
370
.53
77.6
773
.47
59.9
3c8
.07
zhek
ova
��
�47
.53
100.
0064
.43
42.0
279
.57
55.0
050
.22
80.8
161
.94
46.8
846
.88
46.8
860
.27
27.0
837
.37
68.6
063
.62
65.5
851
.44
c9.0
0ch
en�
��
85.9
210
0.00
92.4
277
.88
92.8
584
.71
74.0
291
.67
81.9
176
.76
76.7
676
.76
84.3
359
.45
69.7
479
.63
91.5
584
.45
78.7
9c9
.01
yuan
��
�82
.58
99.8
090
.38
75.6
993
.06
83.4
871
.62
91.3
980
.30
74.2
374
.23
74.2
382
.40
55.6
066
.40
77.7
788
.30
82.0
676
.73
c9.0
2bj
orke
lund
��
�75
.93
100.
0086
.32
68.9
493
.76
79.4
666
.88
93.4
877
.97
70.3
070
.30
70.3
080
.53
47.7
359
.93
76.0
691
.11
81.6
672
.45
c9.0
3st
ambo
rg�
��
69.0
810
0.00
81.7
163
.52
88.2
473
.87
63.5
688
.56
74.0
065
.89
65.8
965
.89
72.9
342
.22
53.4
877
.43
85.6
580
.91
67.1
2c9
.04
fern
ande
s�
��
100.
0010
0.00
100.
0061
.70
91.4
573
.69
63.5
789
.76
74.4
365
.06
65.0
665
.06
72.8
439
.49
51.2
180
.08
83.2
181
.55
66.4
4c9
.05
li�
��
68.6
810
0.00
81.4
358
.88
81.0
468
.20
58.3
780
.25
67.5
958
.74
58.7
458
.74
66.4
438
.85
49.0
371
.31
76.9
273
.72
61.6
1c9
.06
zhek
ova
��
�48
.82
100.
0065
.61
44.1
280
.89
57.1
051
.79
80.5
363
.04
47.8
447
.84
47.8
460
.37
27.6
937
.96
70.4
964
.06
66.4
552
.70
Tabl
e21
:Per
form
ance
ofsy
stem
sin
the
prim
ary
and
supp
lem
enta
ryev
alua
tions
fort
hecl
osed
trac
kfo
rChi
nese
IDPa
rtic
ipan
tTr
ain
Test
ME
NT
ION
DE
TE
CT
ION
CO
RE
FE
RE
NC
ER
ESO
LU
TIO
NO
ffici
alSy
ntax
Synt
axM
entio
nQ
lty.
MU
CB
CU
BE
DC
EA
F mC
EA
F eB
LA
NC
AG
AG
NB
GB
GM
RP
FR
PF 1
RP
F 2R
PF
RP
F 3R
PF
F1+
F2+
F3
3
a0.0
0fe
rnan
des
��
�62
.72
67.0
064
.79
43.6
349
.69
46.4
662
.70
72.1
967
.11
55.5
955
.59
55.5
952
.49
46.0
949
.08
63.9
871
.91
66.9
754
.22
a0.0
1bj
orke
lund
��
�56
.78
64.8
660
.55
43.9
052
.51
47.8
262
.89
75.3
268
.54
53.4
253
.42
53.4
248
.45
40.8
044
.30
66.4
574
.61
69.6
353
.55
a0.0
2ur
yupi
na�
��
56.4
754
.35
55.3
941
.33
41.6
641
.49
65.7
769
.23
67.4
650
.82
50.8
250
.82
42.4
342
.13
42.2
865
.58
70.5
667
.69
50.4
1a0
.03
stam
borg
��
�56
.10
63.2
859
.47
39.1
143
.49
41.1
861
.57
67.9
564
.61
50.1
650
.16
50.1
644
.86
40.3
642
.49
66.8
066
.94
66.8
749
.43
a0.0
4ch
en�
��
56.1
663
.95
59.8
038
.13
39.9
639
.02
60.5
962
.51
61.5
347
.49
47.4
947
.49
41.8
939
.84
40.8
466
.45
61.8
463
.69
47.1
3a0
.05
zhek
ova
��
�27
.54
80.3
441
.02
19.6
462
.13
29.8
541
.91
90.7
257
.33
42.7
442
.74
42.7
456
.79
24.8
134
.53
57.1
079
.19
60.6
540
.57
a0.0
6li
��
�18
.17
80.4
329
.65
10.7
755
.60
18.0
536
.17
93.3
452
.14
37.0
337
.03
37.0
355
.45
20.9
530
.41
52.9
173
.93
54.1
233
.53
a1.0
0fe
rnan
des
��
�65
.03
68.7
166
.82
46.3
851
.78
48.9
363
.53
72.3
767
.66
56.4
956
.49
56.4
952
.57
46.8
849
.56
64.8
472
.97
67.9
455
.38
a1.0
1bj
orke
lund
��
�58
.33
64.6
061
.30
45.1
452
.15
48.3
963
.73
74.4
568
.68
53.5
253
.52
53.5
247
.78
41.5
344
.44
66.8
173
.83
69.6
553
.84
a1.1
2ch
en�
��
56.4
163
.41
59.7
038
.22
39.5
738
.89
60.9
162
.06
61.4
847
.73
47.7
347
.73
41.8
040
.27
41.0
266
.70
61.8
663
.78
47.1
3a1
.13
zhek
ova
��
�28
.00
82.2
141
.78
15.4
745
.92
23.1
539
.22
84.8
653
.65
39.5
239
.52
39.5
255
.10
24.2
233
.65
54.1
361
.78
55.6
336
.82
a2.0
0bj
orke
lund
��
�61
.88
62.5
262
.20
46.1
147
.66
46.8
765
.83
69.7
467
.73
53.7
753
.77
53.7
745
.82
44.3
345
.06
67.6
970
.71
69.0
653
.22
a3.0
0bj
orke
lund
��
�67
.07
62.4
464
.67
51.5
749
.76
50.6
569
.53
69.8
869
.71
56.2
156
.21
56.2
146
.26
47.9
847
.11
71.0
972
.67
71.8
555
.82
a4.0
0fe
rnan
des
��
�65
.34
64.8
265
.08
45.1
847
.39
46.2
664
.56
69.4
466
.91
54.8
854
.88
54.8
849
.73
47.3
948
.53
64.2
870
.09
66.6
453
.90
a4.0
1bj
orke
lund
��
�57
.77
63.7
460
.61
44.7
851
.47
47.9
063
.75
74.2
768
.61
53.1
853
.18
53.1
847
.16
41.2
444
.00
66.9
473
.43
69.6
153
.50
a4.0
2st
ambo
rg�
��
57.4
364
.62
60.8
140
.22
44.1
742
.10
61.4
567
.24
64.2
249
.92
49.9
249
.92
44.6
040
.50
42.4
666
.79
66.0
866
.42
49.5
9a4
.03
chen
��
�57
.21
62.5
559
.76
38.6
639
.24
38.9
561
.52
61.7
761
.65
47.8
447
.84
47.8
441
.55
40.9
041
.22
66.7
861
.94
63.8
747
.27
a4.0
4zh
ekov
a�
��
27.4
875
.53
40.2
918
.75
56.4
728
.16
42.6
789
.25
57.7
442
.57
42.5
742
.57
55.5
325
.36
34.8
256
.61
76.3
559
.86
40.2
4a4
.05
li�
��
52.9
520
.71
29.7
820
.62
7.78
11.3
079
.37
41.2
154
.25
33.6
833
.68
33.6
821
.73
42.8
728
.84
54.0
451
.10
51.4
631
.46
a5.0
0fe
rnan
des
��
�65
.03
68.7
166
.82
46.3
851
.78
48.9
363
.53
72.3
767
.66
56.4
956
.49
56.4
952
.57
46.8
849
.56
64.8
472
.97
67.9
455
.38
a5.0
1bj
orke
lund
��
�58
.29
64.6
361
.30
45.1
452
.20
48.4
163
.71
74.5
068
.68
53.5
253
.52
53.5
247
.80
41.5
144
.44
66.8
173
.84
69.6
553
.84
a5.0
2st
ambo
rg�
��
57.6
864
.18
60.7
640
.53
43.9
842
.18
61.7
066
.75
64.1
349
.55
49.5
549
.55
44.0
140
.47
42.1
665
.23
64.8
965
.06
49.4
9a5
.03
chen
��
�56
.41
63.4
559
.72
38.2
239
.59
38.8
960
.90
62.0
761
.48
47.7
347
.73
47.7
341
.81
40.2
641
.02
66.7
061
.86
63.7
847
.13
a5.0
4zh
ekov
a�
��
28.0
682
.39
41.8
715
.56
46.1
823
.28
39.2
384
.95
53.6
739
.52
39.5
239
.52
55.1
024
.20
33.6
354
.15
61.9
555
.66
36.8
6
a6.0
0bj
orke
lund
��
67.0
462
.47
64.6
751
.57
49.8
050
.67
69.5
269
.92
69.7
256
.21
56.2
156
.21
46.2
747
.95
47.1
071
.10
72.7
071
.86
55.8
3
a7.0
0fe
rnan
des
��
�10
0.00
100.
0010
0.00
57.2
576
.48
65.4
860
.27
79.8
168
.68
62.5
662
.56
62.5
672
.61
46.0
056
.32
69.0
374
.87
71.4
963
.49
a7.0
1bj
orke
lund
��
�61
.85
100.
0076
.43
49.5
778
.62
60.8
155
.55
85.3
567
.29
59.5
059
.50
59.5
070
.28
37.9
949
.32
70.6
980
.85
74.6
159
.14
a7.0
2zh
ekov
a�
��
57.9
510
0.00
73.3
842
.48
80.3
655
.58
50.8
789
.69
64.9
255
.42
55.4
255
.42
71.9
634
.52
46.6
661
.36
82.0
066
.12
55.7
2a7
.03
stam
borg
��
�56
.13
100.
0071
.90
41.9
969
.78
52.4
350
.45
81.3
062
.26
54.0
054
.00
54.0
066
.16
34.5
245
.37
67.3
773
.46
69.8
753
.35
a7.0
4ch
en�
��
58.2
910
0.00
73.6
541
.72
63.2
350
.28
50.0
075
.25
60.0
853
.16
53.1
653
.16
64.6
036
.24
46.4
367
.15
66.6
566
.90
52.2
6a7
.05
li�
��
35.6
710
0.00
52.5
822
.43
64.6
233
.31
38.6
788
.07
53.7
442
.25
42.2
542
.25
60.9
524
.36
34.8
155
.64
68.5
257
.96
40.6
2
a8.0
0fe
rnan
des
��
�10
0.00
100.
0010
0.00
56.8
976
.27
65.1
760
.07
80.0
268
.62
62.6
262
.62
62.6
272
.24
45.5
855
.90
69.3
575
.51
71.9
363
.23
a8.0
1bj
orke
lund
��
�61
.05
100.
0075
.81
49.1
778
.31
60.4
155
.51
85.4
067
.28
59.4
159
.41
59.4
170
.01
37.7
149
.02
70.9
780
.92
74.8
458
.90
a8.0
2zh
ekov
a�
��
65.6
810
0.00
79.2
945
.58
73.2
756
.20
52.2
782
.35
63.9
555
.11
55.1
155
.11
70.1
737
.54
48.9
159
.94
72.0
763
.28
56.3
5a8
.03
stam
borg
��
�56
.72
100.
0072
.38
42.8
870
.42
53.3
051
.17
80.8
362
.67
54.1
254
.12
54.1
266
.21
34.8
545
.66
67.1
072
.32
69.2
953
.88
a8.0
4ch
en�
��
58.2
610
0.00
73.6
341
.81
63.2
850
.36
50.1
075
.19
60.1
353
.19
53.1
953
.19
64.5
936
.27
46.4
667
.19
66.5
266
.85
52.3
2
a9.0
0bj
orke
lund
��
�68
.50
100.
0081
.30
55.2
178
.84
64.9
459
.85
83.7
569
.81
63.1
263
.12
63.1
272
.24
42.7
553
.71
73.3
580
.61
76.4
162
.82
Tabl
e22
:Per
form
ance
ofsy
stem
sin
the
prim
ary
and
supp
lem
enta
ryev
alua
tions
fort
hecl
osed
trac
kfo
rAra
bic
also could be a factor that explains the lower per-formance, and that Arabic performance might gainfrom more data. The chen system did not use de-velopment data for the final models. It could haveincreased their score.
8.3.2 Gold Mention BoundariesThe system performance given gold boundaries
followed more of the trend in English than Chinese.There was not much improvement over the primaryNB evaluation. Interestingly, the chen system thatuses gold boundaries for Chinese so well, does notget any performance improvement. This might in-dicate that either the technique that helped that sys-tem in Chinese does not generalize well across lan-guages.
8.3.3 Gold MentionsPerformance given gold mentions seems to be
about ten points higher than in the NB case. Thebjorkelund system does well on BLANC metric thanfernandes even after getting a big hit in recall formention detection. In absence of the chang system,it seems like the fernandes system is the only onethat explicitly adds a constraint for the GM case andgets a perfect mention detection score. All other sys-tems look significantly on recall.
8.3.4 Gold Test ParsesFinally, providing gold parses during testing does
not have much of an impact on the scores.
8.3.5 All Languages OpenTables 23, 24 and 25, give the performance for
the systems that participated in the open track. Notmany systems participated in this track, so thereis not a lot to observe. One thing to note is thatthe chen system modified precise constructs sieveto add named entity information in the open tracksieve which gave them a point improvement in per-formance. With gold mentions and gold syntax dur-ing testing the chen system performance almost ap-proaches an F-score of 80 (79.79)
8.3.6 Headword-based and Genre specificscores
Since last year’s task showed that there was onlysome very local difference in ranking between sys-tems scored using the strict boundaries versus theones using head-word based scoring, we did notcompute the head-word based evaluation. Also,since there was no particular pattern in scores acrossgenre for the CoNLL test set, we did not computegenre-specific scores as well. Actually we have
computed them, but there is no place in the paper.These could be made available on the task webpage.
9 Comparison with CoNLL-2011
Table 26 shows the performance of the systems onCoNLL-2011 test set. The models use about 200kmore training data for English, but it is a small frac-tion of the total data, and given that the total size oftraining data in CoNLL-2011 was more than 80%of CoNLL-2012 training data (1M vs 1.2M wordsrespectively); and, the fact that coreference scoreshave shown to asymptote after a small fraction ofthe total training data, it can be inferred that the 5%absolute gap between the best performing systemsof last year and this year, that the improvement wasmost likely owing to algorithmic improvement andpossibly using better rules. also since 2011 systemwas purely rule-based, it is unlikely that the 200Kmore data would have added to the rules.
It is interesting to note that although the winningsystem in the CoNLL-2011 task was a completelyrule-based one, modified version of the same systemused by shou and xiong ranked close to 10. Thisdoes indicate that a hybrid approach has some ad-vantage over a purely rule-based system. Improve-ment seems to be mostly owing to higher precisionin mention detection, MUC, BCUBED, and higher re-call in CEAFe
10 Conclusions
In this paper we described the anaphoric corefer-ence information and other layers of annotation inthe OntoNotes corpus, over three languages – En-glish, Chinese and Arabic, and presented the resultsfrom an evaluation on learning such unrestricted en-tities and events in text. The following represent ourconclusions on reviewing the results:
• Most top performing systems used a hybrid-approach combining rule-based strategies withmachine learning. Rule-based approach doesseem to bring a system to a close to best per-formance region. The most significant advan-tage of the rule-based approach seems to be thecapturing of most confident links before con-sidering less confident ones. Discourse infor-mation when present is quite helpful to disam-biguate pronominal mentions. Using informa-tion from appositives and copular constructionsseems beneficial to bridge across various lex-icalized mentions. It is not clear how muchmore can be gained using further strategies.
Part
icip
ant
Trai
nTe
stM
EN
TIO
ND
ET
EC
TIO
NC
OR
EF
ER
EN
CE
RE
SOL
UT
ION
Offi
cial
Synt
axSy
ntax
Men
tion
Qlty
.M
UC
BC
UB
ED
CE
AF m
CE
AF e
BL
AN
C
AG
AG
NB
GB
GM
RP
FR
PF 1
RP
F 2R
PF
RP
F 3R
PF
F1+
F2+
F3
3
xion
g�
��
75.2
272
.23
73.6
964
.08
63.5
763
.82
66.4
770
.69
68.5
257
.23
57.2
357
.23
45.0
945
.64
45.3
671
.12
77.9
073
.94
59.2
3
xion
g�
��
77.8
573
.44
75.5
867
.03
65.2
766
.14
68.0
371
.14
69.5
558
.58
58.5
858
.58
45.6
047
.53
46.5
472
.03
78.5
874
.78
60.7
4
Tabl
e23
:Per
form
ance
ofsy
stem
sin
the
prim
ary
and
supp
lem
enta
ryev
alua
tions
fort
heop
entr
ack
forE
nglis
h
Part
icip
ant
Trai
nTe
stM
EN
TIO
ND
ET
EC
TIO
NC
OR
EF
ER
EN
CE
RE
SOL
UT
ION
Offi
cial
Synt
axSy
ntax
Men
tion
Qlty
.M
UC
BC
UB
ED
CE
AF m
CE
AF e
BL
AN
C
AG
AG
NB
GB
GM
RP
FR
PF 1
RP
F 2R
PF
RP
F 3R
PF
F1+
F2+
F3
3
chen
��
�71
.45
73.4
572
.44
62.4
867
.08
64.7
071
.21
78.3
574
.61
63.4
863
.48
63.4
853
.64
49.1
051
.27
75.1
584
.29
78.9
463
.53
yuan
��
�73
.71
63.9
768
.49
63.6
758
.48
60.9
674
.04
72.1
673
.09
60.0
560
.05
60.0
546
.75
51.5
249
.02
74.3
277
.99
76.0
261
.02
xion
g�
��
39.4
767
.55
49.8
230
.00
51.2
037
.83
49.3
777
.45
60.3
042
.71
42.7
142
.71
46.1
028
.12
34.9
357
.98
67.0
860
.59
44.3
5
chen
��
�83
.50
80.4
481
.95
74.7
774
.93
74.8
577
.14
80.8
078
.93
70.1
370
.13
70.1
358
.64
58.4
658
.55
78.8
786
.63
82.2
470
.78
yuan
��
�83
.92
69.8
576
.24
74.2
465
.11
69.3
778
.61
73.7
476
.10
64.8
464
.84
64.8
449
.41
58.7
053
.66
77.6
079
.45
78.4
966
.38
xion
g�
��
42.7
871
.10
53.4
232
.57
53.8
940
.60
49.5
077
.38
60.3
743
.66
43.6
643
.66
47.0
428
.83
35.7
558
.24
67.3
760
.90
45.5
7
chen
��
�82
.39
80.1
181
.24
73.5
074
.28
73.8
876
.30
80.4
978
.34
69.4
069
.40
69.4
058
.22
57.3
257
.77
78.4
486
.39
81.8
870
.00
chen
��
�83
.50
80.4
481
.95
74.7
774
.93
74.8
577
.14
80.8
078
.93
70.1
370
.13
70.1
358
.64
58.4
658
.55
78.8
786
.63
82.2
470
.78
chen
��
�84
.80
100.
0091
.77
78.1
293
.19
84.9
975
.04
91.5
982
.50
77.5
077
.50
77.5
084
.03
59.1
769
.44
81.4
690
.73
85.4
178
.98
chen
��
�85
.71
100.
0092
.31
79.0
793
.59
85.7
275
.83
91.9
483
.11
78.2
678
.26
78.2
684
.77
60.4
270
.55
81.7
191
.00
85.6
779
.79
Tabl
e24
:Per
form
ance
ofsy
stem
sin
the
prim
ary
and
supp
lem
enta
ryev
alua
tions
fort
heop
entr
ack
forC
hine
se
Part
icip
ant
Trai
nTe
stM
EN
TIO
ND
ET
EC
TIO
NC
OR
EF
ER
EN
CE
RE
SOL
UT
ION
Offi
cial
Synt
axSy
ntax
Men
tion
Qlty
.M
UC
BC
UB
ED
CE
AF m
CE
AF e
BL
AN
C
AG
AG
NB
GB
GM
RP
FR
PF 1
RP
F 2R
PF
RP
F 3R
PF
F1+
F2+
F3
3
xion
g�
��
55.0
852
.75
53.8
928
.20
28.4
328
.31
60.8
962
.81
61.8
347
.31
47.3
147
.31
43.1
242
.82
42.9
757
.05
60.7
558
.46
44.3
7
xion
g�
��
57.5
552
.98
55.1
730
.99
30.1
030
.54
62.1
662
.55
62.3
647
.73
47.7
347
.73
42.4
843
.59
43.0
357
.78
61.3
959
.20
45.3
1
Tabl
e25
:Per
form
ance
ofsy
stem
sin
the
prim
ary
and
supp
lem
enta
ryev
alua
tions
fort
heop
entr
ack
forA
rabi
c
Part
icip
ant
Trai
nTe
stM
EN
TIO
ND
ET
EC
TIO
NC
OR
EF
ER
EN
CE
RE
SOL
UT
ION
Offi
cial
Synt
axSy
ntax
Men
tion
Qlty
.M
UC
BC
UB
ED
CE
AF m
CE
AF e
BL
AN
CA
GA
GN
BG
BG
MR
PF
RP
F 1R
PF 2
RP
FR
PF 3
RP
FF1+
F2+
F3
3
2011
best
��
�75
.07
66.8
170
.70
61.7
657
.53
59.5
768
.40
68.2
368
.31
56.3
756
.37
56.3
743
.41
47.7
545
.48
70.6
376
.21
73.0
257
.79
fern
ande
s�
��
70.1
281
.27
75.2
862
.74
73.4
667
.68
65.0
378
.05
70.9
560
.61
60.6
160
.61
54.1
842
.50
47.6
375
.31
78.5
376
.80
62.0
9m
arts
chat
��
�71
.96
73.5
772
.76
62.8
066
.23
64.4
767
.09
74.5
070
.60
58.8
458
.84
58.8
448
.05
44.5
546
.23
73.1
778
.04
75.3
260
.43
bjor
kelu
nd�
��
70.9
974
.91
72.9
062
.25
67.7
264
.87
65.6
675
.32
70.1
657
.99
57.9
957
.99
48.1
242
.67
45.2
372
.81
77.6
474
.94
60.0
9ch
ang
��
�69
.94
73.3
671
.61
62.5
165
.54
63.9
967
.76
71.9
769
.80
57.4
957
.49
57.4
945
.86
42.8
244
.29
75.3
574
.13
74.7
259
.36
chen
��
�73
.13
69.5
671
.30
60.8
460
.70
60.7
767
.34
71.3
769
.30
57.4
757
.47
57.4
745
.98
46.1
346
.05
70.8
978
.69
74.0
658
.71
stam
borg
��
�72
.83
69.7
871
.27
63.4
761
.16
62.2
969
.10
69.1
969
.14
55.3
755
.37
55.3
741
.89
44.1
342
.98
72.8
675
.67
74.1
658
.14
chun
yang
��
�73
.14
69.3
171
.17
61.7
860
.57
61.1
767
.43
70.3
168
.84
56.6
256
.62
56.6
244
.48
45.7
045
.08
71.3
176
.91
73.7
258
.36
yuan
��
�70
.44
68.8
769
.64
58.9
960
.09
59.5
366
.66
71.3
868
.94
56.9
556
.95
56.9
545
.48
44.3
844
.92
72.2
078
.69
74.9
557
.80
shou
��
�73
.26
69.1
671
.15
61.0
259
.24
60.1
266
.14
68.3
467
.22
54.8
854
.88
54.8
843
.52
45.3
544
.42
69.0
073
.39
70.9
257
.25
xu�
��
58.9
082
.61
68.7
755
.14
73.2
862
.93
60.5
075
.93
67.3
552
.81
52.8
152
.81
48.7
132
.17
38.7
572
.80
70.0
571
.30
56.3
4ur
yupi
na�
��
69.6
466
.80
68.1
958
.92
57.7
258
.31
65.0
568
.26
66.6
252
.25
52.2
552
.25
40.7
141
.83
41.2
668
.07
72.2
569
.89
55.4
0ya
ng�
��
61.5
170
.97
65.9
052
.15
61.8
856
.60
60.3
773
.02
66.0
951
.07
51.0
751
.07
44.6
935
.98
39.8
766
.20
71.2
768
.29
54.1
9xi
nxin
��
�71
.16
50.9
859
.40
52.8
839
.37
45.1
368
.73
56.1
661
.81
44.6
844
.68
44.6
831
.27
43.5
036
.38
65.7
164
.79
65.2
347
.77
zhek
ova
��
�63
.16
65.7
364
.42
50.6
449
.51
50.0
761
.71
56.5
359
.01
43.4
043
.40
43.4
034
.01
35.0
934
.54
65.4
958
.37
60.2
147
.87
li�
��
43.3
984
.81
57.4
036
.45
70.5
348
.06
43.1
181
.37
56.3
641
.88
41.8
841
.88
49.5
222
.69
31.1
262
.31
67.0
264
.12
45.1
8
Tabl
e26
:per
form
ance
on20
11te
stse
t
The features for coreference prediction are cer-tainly more complex than for many other lan-guage processing tasks, which makes it morechallenging to generate effective feature com-binations.
• Gold parse during testing does seem to helpquite a bit. Gold boundaries are not of muchsignificance for English (?and Arabic), butseem to be very useful for Chinese. The reasonprobably has some roots in the parser perfor-mance gap for Chinese.
• It does seem that collecting information aboutan entity by merging information across thevarious attributes of the mentions that compriseit can be useful, though not all systems that at-tempted this achieved a benefit, and has to bedone carefully.
• It is noteworthy that systems did not seem toattempt the kind of joint inference that couldmake use of the full potential of various lay-ers available in OntoNotes, but this could wellhave been owing to the limited time availablefor the shared task.
• We had expected to see more attention paid toevent coreference, which is a novel feature inthis data, but again, given the time constraintsand given that events represent only a smallportion of the total, it is not surprising that mostsystems chose not to focus on it.
• Scoring coreference seems to remain a signif-icant challenge. There does not seem to be anobjective way to establish one metric in prefer-ence to another in the absence of a specific ap-plication. On the other hand, the system rank-ings do not seem terribly sensitive to the par-ticular metric chosen. It is interesting that bothversions of the CEAF metric – which tries tocapture the goodness of the entities in the out-put – seem much lower than the other metric,though it is not clear whether that means thatour systems are doing a poor job of creatingcoherent entities or whether that metric is justespecially harsh.
Acknowledgments
We gratefully acknowledge the support of theDefense Advanced Research Projects Agency(DARPA/IPTO) under the GALE program,DARPA/CMO Contract No. HR0011-06-C-0022.
We would like to thank all the participants. Withouttheir hard work, patience and perseverance this eval-uation would not have been a success. We wouldalso like to thank the Linguistic Data Consortium formaking the OntoNotes 5.0 corpus freely and timelyavailable in training/development/test sets to theparticipants. Emili Sapena, who graciously allowedthe use of his scorer implementation. Finally wewould like to thank Hwee Tou Ng and his studentZhong Zhi for training the word sense models andproviding outputs for the training/development andtest sets.
References
Olga Babko-Malaya, Ann Bies, Ann Taylor, Szuting Yi,Martha Palmer, Mitch Marcus, Seth Kulick, and Li-bin Shen. 2006. Issues in synchronizing the Englishtreebank and propbank. In Workshop on Frontiers inLinguistically Annotated Corpora 2006, July.
Amit Bagga and Breck Baldwin. 1998. Algorithms forScoring Coreference Chains. In The First Interna-tional Conference on Language Resources and Eval-uation Workshop on Linguistics Coreference, pages563–566.
Shane Bergsma and Dekang Lin. 2006. Bootstrappingpath-based pronoun resolution. In Proceedings of the21st International Conference on Computational Lin-guistics and 44th Annual Meeting of the Associationfor Computational Linguistics, pages 33–40, Sydney,Australia, July.
Jie Cai and Michael Strube. 2010. Evaluation metricsfor end-to-end coreference resolution systems. In Pro-ceedings of the 11th Annual Meeting of the Special In-terest Group on Discourse and Dialogue, SIGDIAL’10, pages 28–36.
Wendy W Chapman, Prakash M Nadkarni, LynetteHirschman, Leonard W D’Avolio, Guergana KSavova, and Ozlem Uzuner. 2011. Overcoming bar-riers to NLP for clinical text: the role of shared tasksand the need for additional creative solutions. Journalof American Medical Informatics Association, 18(5),September.
Eugene Charniak and Mark Johnson. 2001. Edit detec-tion and parsing for transcribed speech. In Proceed-ings of the Second Meeting of North American Chapterof the Association of Computational Linguistics, June.
Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and maxent discriminative rerank-ing. In Proceedings of the 43rd Annual Meeting of theAssociation for Computational Linguistics (ACL), AnnArbor, MI, June.
Nancy Chinchor and Beth Sundheim. 2003. Messageunderstanding conference (MUC) 6. In LDC2003T13.
Nancy Chinchor. 2001. Message understanding confer-ence (MUC) 7. In LDC2001T02.
Aron Culotta, Michael Wick, Robert Hall, and AndrewMcCallum. 2007. First-order probabilistic models forcoreference resolution. In HLT/NAACL, pages 81–88.
Pascal Denis and Jason Baldridge. 2007. Joint de-termination of anaphoricity and coreference resolu-tion using integer programming. In Proceedings ofHLT/NAACL.
Pascal Denis and Jason Baldridge. 2009. Global jointmodels for coreference resolution and named entityclassification. Procesamiento del Lenguaje Natural,(42):87–96.
Charles Fillmore, Christopher Johnson, and Miriam R. L.Petruck. 2003. Background to framenet. Interna-tional Journal of Lexicography, 16(3).
G. G. Doddington, A. Mitchell, M. Przybocki,L. Ramshaw, S. Strassell, and R. Weischedel.2004. The automatic content extraction (ACE)program-tasks, data, and evaluation. In Proceedingsof LREC.
Aria Haghighi and Dan Klein. 2010. Coreference reso-lution in a modular, entity-centered model. In HumanLanguage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Association forComputational Linguistics, pages 385–393, Los An-geles, California, June.
Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antonia Martı, LluısMarquez, Adam Meyers, Joakim Nivre, SebastianPado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: Syntactic and semantic dependen-cies in multiple languages. In Proceedings of the Thir-teenth Conference on Computational Natural Lan-guage Learning (CoNLL 2009): Shared Task, pages1–18, Boulder, Colorado, June.
Sanda M. Harabagiu, Razvan C. Bunescu, and Steven J.Maiorano. 2001. Text and knowledge mining forcoreference resolution. In NAACL.
L. Hirschman and N. Chinchor. 1997. Coreference taskdefinition (v3.0, 13 jul 97). In Proceedings of the Sev-enth Message Understanding Conference.
Eduard Hovy, Mitchell Marcus, Martha Palmer, LanceRamshaw, and Ralph Weischedel. 2006. OntoNotes:The 90% solution. In Proceedings of HLT/NAACL,pages 57–60, New York City, USA, June. Associationfor Computational Linguistics.
Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2000. A large-scale classification ofenglish verbs. Language Resources and Evaluation,42(1):21 – 40.
Heeyoung Lee, Yves Peirsman, Angel Chang, NathanaelChambers, Mihai Surdeanu, and Dan Jurafsky. 2011.Stanford’s multi-pass sieve coreference resolution sys-tem at the conll-2011 shared task. In Proceedingsof the Fifteenth Conference on Computational Natu-ral Language Learning: Shared Task, pages 28–34,Portland, Oregon, USA, June. Association for Com-putational Linguistics.
Xiaoqiang Luo. 2005. On coreference resolution perfor-mance metrics. In Proceedings of Human LanguageTechnology Conference and Conference on EmpiricalMethods in Natural Language Processing, pages 25–32, Vancouver, British Columbia, Canada, October.
Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotated cor-pus of English: The Penn treebank. ComputationalLinguistics, 19(2):313–330, June.
Andrew McCallum and Ben Wellner. 2004. Conditionalmodels of identity uncertainty with application to nouncoreference. In Advances in Neural Information Pro-cessing Systems (NIPS).
Joseph McCarthy and Wendy Lehnert. 1995. Using de-cision trees for coreference resolution. In Proceedingsof the Fourteenth International Conference on Artifi-cial Intelligence, pages 1050–1055.
Thomas S. Morton. 2000. Coreference for nlp applica-tions. In Proceedings of the 38th Annual Meeting ofthe Association for Computational Linguistics, Octo-ber.
Vincent Ng. 2007. Shallow semantics for coreferenceresolution. In Proceedings of the IJCAI.
Vincent Ng. 2010. Supervised noun phrase coreferenceresearch: The first fifteen years. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, pages 1396–1411, Uppsala, Swe-den, July.
David S. Pallett. 2002. The role of the National Insti-tute of Standards and Technology in DARPA’s Broad-cast News continuous speech recognition research pro-gram. Speech Communication, 37(1-2), May.
Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The Proposition Bank: An annotated corpus ofsemantic roles. Computational Linguistics, 31(1):71–106.
Martha Palmer, Hoa Trang Dang, and Christiane Fell-baum. 2007. Making fine-grained and coarse-grainedsense distinctions, both manually and automatically.
R. Passonneau. 2004. Computing reliability for corefer-ence annotation. In Proceedings of LREC.
Massimo Poesio and Ron Artstein. 2005. The reliabilityof anaphoric annotation, reconsidered: Taking ambi-guity into account. In Proceedings of the Workshop onFrontiers in Corpus Annotations II: Pie in the Sky.
Massimo Poesio. 2004. The mate/gnome scheme foranaphoric annotation, revisited. In Proceedings ofSIGDIAL.
Simone Paolo Ponzetto and Massimo Poesio. 2009.State-of-the-art nlp approaches to coreference resolu-tion: Theory and practical recipes. In Tutorial Ab-stracts of ACL-IJCNLP 2009, page 6, Suntec, Singa-pore, August.
Simone Paolo Ponzetto and Michael Strube. 2005. Se-mantic role labeling for coreference resolution. InCompanion Volume of the Proceedings of the 11thMeeting of the European Chapter of the Associa-tion for Computational Linguistics, pages 143–146,Trento, Italy, April.
Simone Paolo Ponzetto and Michael Strube. 2006.Exploiting semantic role labeling, WordNet andWikipedia for coreference resolution. In Proceedingsof the HLT/NAACL, pages 192–199, New York City,N.Y., June.
Sameer Pradhan, Kadri Hacioglu, Valerie Krugler,Wayne Ward, James Martin, and Dan Jurafsky. 2005.Support vector learning for semantic argument classi-fication. Machine Learning Journal, 60(1):11–39.
Sameer Pradhan, Eduard Hovy, Mitchell Marcus, MarthaPalmer, Lance Ramshaw, and Ralph Weischedel.2007a. OntoNotes: A Unified Relational SemanticRepresentation. International Journal of SemanticComputing, 1(4):405–419.
Sameer Pradhan, Lance Ramshaw, Ralph Weischedel,Jessica MacBride, and Linnea Micciulla. 2007b.Unrestricted Coreference: Indentifying Entities andEvents in OntoNotes. In in Proceedings of theIEEE International Conference on Semantic Comput-ing (ICSC), September 17-19.
Altaf Rahman and Vincent Ng. 2009. Supervised mod-els for coreference resolution. In Proceedings of the2009 Conference on Empirical Methods in NaturalLanguage Processing, pages 968–977, Singapore, Au-gust. Association for Computational Linguistics.
W. M. Rand. 1971. Objective criteria for the evaluationof clustering methods. Journal of the American Statis-tical Association, 66(336).
Marta Recasens and Eduard Hovy. 2011. Blanc: Im-plementing the rand index for coreference evaluation.Natural Language Engineering.
W. Soon, H. Ng, and D. Lim. 2001. A machine learn-ing approach to coreference resolution of noun phrase.Computational Linguistics, 27(4):521–544.
Veselin Stoyanov, Nathan Gilbert, Claire Cardie, andEllen Riloff. 2009. Conundrums in noun phrase coref-erence resolution: Making sense of the state-of-the-art. In Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing ofthe AFNLP, pages 656–664, Suntec, Singapore, Au-gust. Association for Computational Linguistics.
Mihai Surdeanu, Richard Johansson, Adam Meyers,Lluıs Marquez, and Joakim Nivre. 2008. The CoNLL2008 shared task on joint parsing of syntactic and se-mantic dependencies. In CoNLL 2008: Proceedingsof the Twelfth Conference on Computational Natu-ral Language Learning, pages 159–177, Manchester,England, August.
Ozlem Uzuner, Andreea Bodnari, Shuying Shen, TylerForbush, John Pestian, and Brett R South. 2012. Eval-uating the state of the art in coreference resolution forelectronic medical records. Journal of American Med-ical Informatics Association, 19(5), September.
Y Versley, S.P. Ponzetto, M. Poesio, V. Eidelman,A. Jern, J. Smith, X. Yang, and A. Moschitti. 2008.BART: A modular toolkit for coreference resolution.In Proceedings of the 6th International Conference onLanguage Resources and Evaluation, Marrakech, Mo-rocco, May.
Yannick Versley. 2007. Antecedent selection techniquesfor high-recall coreference resolution. In Proceedingsof the 2007 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL).
M. Vilain, J. Burger, J. Aberdeen, D. Connolly, andL. Hirschman. 1995. A model theoretic coreferencescoring scheme. In Proceedings of the Sixth MessageUndersatnding Conference (MUC-6), pages 45–52.
Ralph Weischedel and Ada Brunstein. 2005. BBN pro-noun coreference and entity type corpus LDC catalogno.: LDC2005T33. BBN Technologies.
Ralph Weischedel, Eduard Hovy, Mitchell Marcus,Martha Palmer, Robert Belvin, Sameer Pradhan,Lance Ramshaw, and Nianwen Xue. 2011.OntoNotes: A large training corpus for enhanced pro-cessing. In Joseph Olive, Caitlin Christianson, andJohn McCary, editors, Handbook of Natural LanguageProcessing and Machine Translation: DARPA GlobalAutonomous Language Exploitation. Springer.
Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:A wide-coverage word sense disambiguation systemfor free text. In Proceedings of the ACL 2010 SystemDemonstrations, pages 78–83, Uppsala, Sweden.