multilingual corpora workshop, 27 march 2003 corpora and evaluation tools for multilingual named...

Multilingual Corpora Workshop, 27 March 2003

Corpora and Evaluation Tools forMultilingual Named Entity Grammar Development

Christian Bering, Witold Drożdżyński, Gregor Erbach,

Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li,

Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer,

Atsuko Shimada, Melanie Siegel, Feiyu Xu, Dorothee Ziegler-Eisele

DFKI GmbH, LT Lab, Saarbrücken

Saarland University Computational Linguistics Dept, Saarbrücken

Acrolinx GmbH, Berlin


Outline

Motivation SPROUT – shallow processing toolkit Multilingual NE grammar development

Shared output structures Shared token classes Shared grammars

Multilingual NE corpora Evaluation tool


Motivation

Named Entity Recognition is fundamental to a number of information management applications (search engines, question answering, text mining …)

Many of these applications deal with different languages

Development of multilingual named entity grammars, supported by BMBF in the projects WHITEBOARD and COLLATE, and by the EU in the project AIRFORCE


Challenges in multilingual NER

Different alphabets, character sets and character encodings

Different tokenization conventions Different time and currency formats Different representations of proper names

Identical (New York, George Bush, IBM) Different for some languages (London vs. Londres, Firenze

vs. Florence vs. Florenz, NATO vs. OTAN, München vs. Munich vs. Monaco)


SProUT - Objectives

platform for the development of multilingual and domain adaptive shallow text processing and information extraction systems

trade-off between efficiency and expressiveness

modularity (fine-grained modeling of linguistic components into clear-cut modules)

portability and industrial standards


FINITE-STATETOOLKIT

REGULARCOMPILER

SHALLOWGRAMMAR

INTERPRETER

JTFS

SHALLOWGRAMMAR

EXTENDEDOPTIMIZED

FSTREPRES.

LEXICALRESOURCES

INPUT

DATA

STRUCTURED

OUTPUT DATA

G R A M M A R D E V E L O P M E N T E N V I R O N M E N T

System Architecture

O N L I N E P R O C E S S I N G

STREAM OFTEXT ITEMS

…. [..] [..] [..] ….

LINGUISTICPROCESSINGRESOURCES


System Components

linguistic processing resources

tokenizer (easily adaptable for indo-european languages)

gazetteer

morphology component (8 languages)

named entity recognition (6 languages)

core tools

JTFS

FSM toolkit

regular compiler

shallow grammar interpreter

tries for NLP processing


TFS and TFS-XML

TFS as data interchange format in SProUT

unification and subsumption check as basic operations for evaluation

compact XML encoding of typed feature structures (following TEI-SGML)

exchange format for linguistic resources:

grammars

feature structure tree banks

exchange format for visualization


TFS-XML: Example

<FS type="pred_argument"><F name="PRED"> <FS type=„übernehmen"/> </F><F name="AGENT"><FS coref="1" type="argument">

<F name="NAME"> <FS type="Maria_Müller"/> </F></FS>

</F><F name="THEME"><FS coref="2" type="argument">

<F name="NOM"> <FS type="Vorsitz"/> </F></FS>

</F></FS>


Morphological Resources

English200,000 entries (Mmorph (Multext))

German 830,000 entries (Mmorph (Multext))

French 225,000 entries (Mmorph (Multext))

Spanish 570,000 entries (Mmorph (Multext))

Italian 330,000 entries (Mmorph (Multext))

Czech 600,000 entries (Institue of Formal and Applied Linguistics in Prague)

Chinese Shanxi-Tokenizer

Japanese ChaSen

Asian language resources

Indo-European language resources


Architecture

Mmorph fullform lexica are stored as trie external modules (Asian and Czech) are integrated via Client/Server

Parser

Tokeniser

Mmorph

ChaSen

Czech

Shanxi


A SProUT Grammar Rule (XTDL)

*


UnificationMatched input structure Extended Rule Structure

After Match

Fully Unified Structure


Title of Slid

Item 1 Item 2

COLLATE, Scientific Advisory Board Meeting, Saarland University, 22 November 2002


Multilingual Named Entity Grammars

Languages

English, French, German, Spanish

Chinese, Japanese

Grammar Style

MUC-7/MET-2 named entity classes with some variations• ENAMEX: person, location, organisation

• TIMEX: time point, time span (instead of date, time)

• NUMEX: percentage, money

Named entity types with internal attribute-value structures, e.g.,

span := timex & [FROM point,

TO point ].


Multilingual NE grammar development

Our approach

Shared output structures Shared token classes Shared grammars


Shared Output Structures

The grammars for all six languages produce the same, semantically oriented output structures, defined in TDL

ne_type := sign & [DESCRIPTOR string].enamex := ne_type.ne-person := enamex & [TITLE list-of-strings, GIVEN_NAME list-of-strings, SURNAME list-of-strings, P-POSITION list-of-strings, NAME-SUFFIX string].ne-location := enamex & [LOCTYPE loc-type, LOCNAME string].loc-type :< atom.river := loc-type.continent := loc-type.country := loc-type.province := loc-type. city := loc-type.


Shared Token Classes

A single set of token classes is used for the European languages

NATURAL_NUMBER 12344 FLOATING_POINT_NUMBER 123,43NUMBER_PERCENT_COMPOUND 34,4%NUMBER_DOT_COMPOUND 234.345.545.NUMBER_WORD_COMPOUND 2,4-fachenDIGIT_SLASH_COMPOUND 12/01/1998DIGIT_DASH_COMPOUND 12-01-1998DIGIT_COLON_COMPOUND 15:13ALL_CAPS_WORD ABCLOWERCASE_WORD tokenizationFIRST_CAPITAL_WORD MicrosoftMIXED_WORD_FIRST_CAPITAL GmbHMIXED_WORD_FIRST_LOWER dKK


Shared Grammars

SPROUT supports re-use and extension of grammars This feature has been used for the development of

multilingual parallel grammars for English, Spanish and French

Common parts of the grammar for different languages (e.g. date formats like „20.10.2003“) are stored in one file, and combined with the language-specific parts of the grammars (for structures like „20 de octubre del 2003“)

Common proper names such as „Amsterdam“ are stored in generic gazeteer, while language-specific names such as „Brussels“, „Bruxelles“, „Bruselas“ are stored in language-specific lists


Advantages of shared grammars

Grammars are more easily re-usable and extendible Consistency is improved, as changes must only be made in

one place for shared structures Grammar development is more efficient, and less time-

consuming and error-prone The same methodology has been applied for combining

general-language grammars with domain-specific grammars


Re-use of corpora

We use NE-annotated corpora for grammar development and evaluation of grammars

Special-purpose annotation of corpora is only feasible for large-scale evaluations such as MUC, but exceeds the resources of most application-oriented projects

Corpora from other projects are re-used in order to save labour and have larger evaluation resources

There may be mismatches between corpus annotation and grammar output


Multilingual NE corpora

English corpora from the MUC7 evaluations Japanese and Chinese corpora annotated according to

MUC7 conventions German corpora annotated in the COLLATE project with a

superset of MUC7 annotations German, English, French and Spanish texts annotated with

Named Entities, from Joint Research Centre Spanish data from the CoNLL-2002 Language-Independent

NER task English and French corpora from the business domain

annotated with named entities according to the MUC7 guidelines within our project


Issues with re-use of corpora

The corpora contain differences in Annotation format Types of named entities annotated Attributes used to describe each NE

Superficial differences in annotation format are handled by conversion to XML

Differences in the content of the annotation are not handled by modification of the corpora, but rather by making our evaluation tool more flexible


Structure of Annotated Articles

<Firmenmeldung Annotator=“…” ID=“…” Status=“…”><teiHeader>

<fileDesc><titleStmt>

<author>…</author></titleStmt>

<publicationStmt> <publisher>…</publisher> <pubPlace>…</pubPlace> <date>…</date> </publicationStmt> <sourceDesc> <bibl>

<agency>…</agency> <page>…</page>

<topic>…</topic> <domain>…</domain> </bibl> </sourceDesc> </fileDesc>

</teiHeader><sourceText>… </sourceText>…<text>…</text>

</Firmenmeldung>

semantic relations

named entities+ coreference


Annotation of Semantic Relations

acquisition company corporateStructure dividends newBusiness offer occupation premiumIncome profit relocation revenue turnover

Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto-matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT.

<Firma Branche="Kfz-Zulieferkonzern" Firma="Robert Bosch" Rechtsform="GmbH" Sitz="Stuttgart"/>

<Firma Firma="van Doorne's Transmissie" Land="NL" Rechtsform="b. v." Sitz="Tilburg"/>

<Beschaeftigung Firma="van Doorne's Transmissie" Mitarbeiter="220"/>

<Umsatz Betrag="45 Mill." Firma="van Doorne's Transmissie" Waehrung="DEM"/>

<Uebernahme Kaeufer="Robert Bosch" Objekt="van Doorne's Transmissie"/>


Annotation of Named Entities

function location money number ordinalNumber organization percent personName productName scaleUnit time

Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto-matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT.

<NE Organisation="Robert Bosch GmbH">Robert Bosch GmbH</NE> ,<NE Ort="Stuttgart">Stuttgart</NE> : Der Kfz-Zulieferkonzern übernimmt zum<NE Zeit="01.01."> 1. Januar</NE> die <NE Organisation="van Doorne's Transmissie b. v.">van Doorne's Transmissie</NE> ,<NE Ort="Tilburg">Tilburg</NE> .


Annotation of Coreference

3rd person personal pronouns 3rd person possessive

pronouns and determiners demonstrative pronouns and

determiners indefinite pronouns and

determiners anaphoric and cataphoric

adverbs elliptical nominal phrases anaphoric and cataphoric

nominal phrases

LM Ericsson AB, Stockholm: Der schwedischeElektronikkonzern hat …

<exp id="101">LM Ericsson AB</exp>, Stockholm: <exp id="102"><ptr src="101"/>Der schwedische Elektronikkonzern</exp> hat …


Cooperation: Annotation of FR

<REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST>

<CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>

<s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN"

morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s>

TIGER: syntactic annotation

LLX: FrameNet annotation


Cooperation: Multi-layer Annotation

<s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verkörpert" pos="VVFIN"

morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s>

LLX: FrameNet annotation

<REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST>

<CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>

TIGER: syntactic annotation

<Firmenmeldung Annotator="keku" ID="SZ_401" Status="1"><teiHeader>

<fileDesc><titleStmt>

<author/></titleStmt><publicationStmt>

<publisher>SZ</publisher><date>1995-03-31</date>

</publicationStmt><sourceDesc>

<bibl> <agency>vwd</agency> <page>22</page> <topic>Wirtschaft</topic> <domain>Firmenmeldungen</domain></bibl>

</sourceDesc></fileDesc>

</teiHeader><sourceText>Datev eG, NÃ¼rnberg: Der EDV-Dienstleister fÃ¼r Steuerberater hat 1994 den Umsatz laut vorlÃ¤ufigen Zahlen um 5% auf rund 980 Mill. DM gesteigert. Die Anzahl der Mitarbeiter ist auf 4605 (4474) BeschÃ¤ftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf 34246 (33551) an. Die Investitionen von 115 (93) Mill. DM haben sich in erster Linie auf die Modernisierung der GroÃŸrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</sourceText><Firma Branche1="EDV-Dienstleister fÃ¼r Steuerberater" Firma="Datev eG" Sitz1="NÃ¼rnberg" Rechtsform="eG"/><Umsatz Firma="Datev eG" Differenz="5%" Trend="plus" Betrag1="980 Mill." Waehrung1="DEM" Beschreibung1="rund" Zeit="1994"/><Beschaeftigung Firma="Datev eG" Trend="plus" Mitarbeiter1_alt="4474" Mitarbeiter1_neu="4605" Zeit="1994"/><text><NE Organisation="Datev eG">Datev eG</NE>, <NE Ort="NÃ¼rnberg">NÃ¼rnberg</NE>: Der EDV-Dienstleister fÃ¼r Steuerberater hat <NE Zeit="1994">1994</NE> den Umsatz laut vorlÃ¤ufigen Zahlen um <NE Prozentzahl="5%">5%</NE> auf <NE Geld="rund 980 Mill. DEM">rund 980 Mill. DM</NE> gesteigert. Die Anzahl der Mitarbeiter ist auf <NE Zahl="4605">4605</NE> (<NE Zahl="4474">4474</NE>) BeschÃ¤ftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf <NE Zahl="34246">34246</NE> (<NE Zahl="33551">33551</NE>) an. Die Investitionen von <NE Geld="115 (93) Mill. DEM">115 (93) Mill. DM</NE> haben sich in erster Linie auf die Modernisierung der GroÃŸrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</text>

</Firmenmeldung>

COLLATE: semantic annotation

=> multi-layer annotated language resource


Evaluation Tool: jTaCo

Evaluates grammars wrt. an annotated corpus Removes annotations from corpus, and feeds unannotated

text to grammar Compares grammar output with original annotated texts Produces detailed statistics, evaluation scores, and

diagnostic output


Configuration of jTaCo

jTaCo can be configured to deal with various problems in evaluating grammars wrt. a corpus:

Use of different classes of NE, or different granularities (e.g. organization and subclasses company, university etc.) Declaration of class equivalence and subclass relationships.

Extent of NE may be different (CEO Bill Gates vs. Bill Gates) Left or right boundary may be mismatched. Size of allowable

mismatch can be specified for each NE class. Markup of corpus may be textually oriented (XML tags)

while grammar output may be a different datastructure (e.g. semantics encoded in feature structure) No general solution is possible. In case of SPROUT, feature

structures are linked with input tokens, so that a correspondence can be established (under development).


Architecture of jTaCo


Conclusion

We discussed a fundamental problem in re-using heterogeneously annotated corpora for multilingual grammar development

With increasing availability of annotated corpora, re-use becomes attractive and cost-effective

We described methods and tools for re-using annotated corpora for development and evaluation of NE grammars

multilingual corpora workshop, 27 march 2003 corpora and evaluation tools for multilingual named...

Documents

multilingual corpora

visualization folie

berlin folie

monaco folie

example folie

nlp processing folie

unified structure folie

neutrum femmascneutrum