multilingual corpora workshop, 27 march 2003 corpora and evaluation tools for multilingual named...

33
Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński, Gregor Erbach, Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li, Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer, Atsuko Shimada, Melanie Siegel, Feiyu Xu, Dorothee Ziegler-Eisele DFKI GmbH, LT Lab, Saarbrücken Saarland University Computational Linguistics Dept, Saarbrücken Acrolinx GmbH, Berlin

Upload: adelheid-kazimir

Post on 06-Apr-2015

107 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Corpora and Evaluation Tools forMultilingual Named Entity Grammar Development

Christian Bering, Witold Drożdżyński, Gregor Erbach,

Clara Guasch, Petr Homola, Sabine Lehmann, Hong Li,

Hans-Ulrich Krieger, Jakub Piskorski, Ulrich Schäfer,

Atsuko Shimada, Melanie Siegel, Feiyu Xu, Dorothee Ziegler-Eisele

DFKI GmbH, LT Lab, Saarbrücken

Saarland University Computational Linguistics Dept, Saarbrücken

Acrolinx GmbH, Berlin

Page 2: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Outline

Motivation SPROUT – shallow processing toolkit Multilingual NE grammar development

Shared output structures Shared token classes Shared grammars

Multilingual NE corpora Evaluation tool

Page 3: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Motivation

Named Entity Recognition is fundamental to a number of information management applications (search engines, question answering, text mining …)

Many of these applications deal with different languages

Development of multilingual named entity grammars, supported by BMBF in the projects WHITEBOARD and COLLATE, and by the EU in the project AIRFORCE

Page 4: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Challenges in multilingual NER

Different alphabets, character sets and character encodings

Different tokenization conventions Different time and currency formats Different representations of proper names

Identical (New York, George Bush, IBM) Different for some languages (London vs. Londres, Firenze

vs. Florence vs. Florenz, NATO vs. OTAN, München vs. Munich vs. Monaco)

Page 5: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

SProUT - Objectives

platform for the development of multilingual and domain adaptive shallow text processing and information extraction systems

trade-off between efficiency and expressiveness

modularity (fine-grained modeling of linguistic components into clear-cut modules)

portability and industrial standards

Page 6: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

FINITE-STATETOOLKIT

REGULARCOMPILER

SHALLOWGRAMMAR

INTERPRETER

JTFS

SHALLOWGRAMMAR

EXTENDEDOPTIMIZED

FSTREPRES.

LEXICALRESOURCES

INPUT

DATA

STRUCTURED

OUTPUT DATA

G R A M M A R D E V E L O P M E N T E N V I R O N M E N T

System Architecture

O N L I N E P R O C E S S I N G

STREAM OFTEXT ITEMS

…. [..] [..] [..] ….

LINGUISTICPROCESSINGRESOURCES

Page 7: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

System Components

linguistic processing resources

tokenizer (easily adaptable for indo-european languages)

gazetteer

morphology component (8 languages)

named entity recognition (6 languages)

core tools

JTFS

FSM toolkit

regular compiler

shallow grammar interpreter

tries for NLP processing

Page 8: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

TFS and TFS-XML

TFS as data interchange format in SProUT

unification and subsumption check as basic operations for evaluation

compact XML encoding of typed feature structures (following TEI-SGML)

exchange format for linguistic resources:

grammars

feature structure tree banks

exchange format for visualization

Page 9: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

TFS-XML: Example

<FS type="pred_argument"><F name="PRED"> <FS type=„übernehmen"/> </F><F name="AGENT"><FS coref="1" type="argument">

<F name="NAME"> <FS type="Maria_Müller"/> </F></FS>

</F><F name="THEME"><FS coref="2" type="argument">

<F name="NOM"> <FS type="Vorsitz"/> </F></FS>

</F></FS>

Page 10: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Morphological Resources

English200,000 entries (Mmorph (Multext))

German 830,000 entries (Mmorph (Multext))

French 225,000 entries (Mmorph (Multext))

Spanish 570,000 entries (Mmorph (Multext))

Italian 330,000 entries (Mmorph (Multext))

Czech 600,000 entries (Institue of Formal and Applied Linguistics in Prague)

Chinese Shanxi-Tokenizer

Japanese ChaSen

Asian language resources

Indo-European language resources

Page 11: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Architecture

Mmorph fullform lexica are stored as trie external modules (Asian and Czech) are integrated via Client/Server

Parser

Tokeniser

Mmorph

ChaSen

Czech

Shanxi

Page 12: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

A SProUT Grammar Rule (XTDL)

*

Page 13: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

UnificationMatched input structure Extended Rule Structure

After Match

Fully Unified Structure

Page 14: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Title of Slid

Item 1 Item 2

COLLATE, Scientific Advisory Board Meeting, Saarland University, 22 November 2002

Page 15: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Multilingual Named Entity Grammars

Languages

English, French, German, Spanish

Chinese, Japanese

Grammar Style

MUC-7/MET-2 named entity classes with some variations• ENAMEX: person, location, organisation

• TIMEX: time point, time span (instead of date, time)

• NUMEX: percentage, money

Named entity types with internal attribute-value structures, e.g.,

span := timex & [FROM point,

TO point ].

Page 16: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Multilingual NE grammar development

Our approach

Shared output structures Shared token classes Shared grammars

Page 17: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Shared Output Structures

The grammars for all six languages produce the same, semantically oriented output structures, defined in TDL

ne_type := sign & [DESCRIPTOR string].enamex := ne_type.ne-person := enamex & [TITLE list-of-strings, GIVEN_NAME list-of-strings, SURNAME list-of-strings, P-POSITION list-of-strings, NAME-SUFFIX string].ne-location := enamex & [LOCTYPE loc-type, LOCNAME string].loc-type :< atom.river := loc-type.continent := loc-type.country := loc-type.province := loc-type. city := loc-type.

Page 18: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Shared Token Classes

A single set of token classes is used for the European languages

NATURAL_NUMBER 12344 FLOATING_POINT_NUMBER 123,43NUMBER_PERCENT_COMPOUND 34,4%NUMBER_DOT_COMPOUND 234.345.545.NUMBER_WORD_COMPOUND 2,4-fachenDIGIT_SLASH_COMPOUND 12/01/1998DIGIT_DASH_COMPOUND 12-01-1998DIGIT_COLON_COMPOUND 15:13ALL_CAPS_WORD ABCLOWERCASE_WORD tokenizationFIRST_CAPITAL_WORD MicrosoftMIXED_WORD_FIRST_CAPITAL GmbHMIXED_WORD_FIRST_LOWER dKK

Page 19: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Shared Grammars

SPROUT supports re-use and extension of grammars This feature has been used for the development of

multilingual parallel grammars for English, Spanish and French

Common parts of the grammar for different languages (e.g. date formats like „20.10.2003“) are stored in one file, and combined with the language-specific parts of the grammars (for structures like „20 de octubre del 2003“)

Common proper names such as „Amsterdam“ are stored in generic gazeteer, while language-specific names such as „Brussels“, „Bruxelles“, „Bruselas“ are stored in language-specific lists

Page 20: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Advantages of shared grammars

Grammars are more easily re-usable and extendible Consistency is improved, as changes must only be made in

one place for shared structures Grammar development is more efficient, and less time-

consuming and error-prone The same methodology has been applied for combining

general-language grammars with domain-specific grammars

Page 21: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Re-use of corpora

We use NE-annotated corpora for grammar development and evaluation of grammars

Special-purpose annotation of corpora is only feasible for large-scale evaluations such as MUC, but exceeds the resources of most application-oriented projects

Corpora from other projects are re-used in order to save labour and have larger evaluation resources

There may be mismatches between corpus annotation and grammar output

Page 22: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Multilingual NE corpora

English corpora from the MUC7 evaluations Japanese and Chinese corpora annotated according to

MUC7 conventions German corpora annotated in the COLLATE project with a

superset of MUC7 annotations German, English, French and Spanish texts annotated with

Named Entities, from Joint Research Centre Spanish data from the CoNLL-2002 Language-Independent

NER task English and French corpora from the business domain

annotated with named entities according to the MUC7 guidelines within our project

Page 23: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Issues with re-use of corpora

The corpora contain differences in Annotation format Types of named entities annotated Attributes used to describe each NE

Superficial differences in annotation format are handled by conversion to XML

Differences in the content of the annotation are not handled by modification of the corpora, but rather by making our evaluation tool more flexible

Page 24: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Structure of Annotated Articles

<Firmenmeldung Annotator=“…” ID=“…” Status=“…”><teiHeader>

<fileDesc><titleStmt>

<author>…</author></titleStmt>

<publicationStmt> <publisher>…</publisher> <pubPlace>…</pubPlace> <date>…</date> </publicationStmt> <sourceDesc> <bibl>

<agency>…</agency> <page>…</page>

<topic>…</topic> <domain>…</domain> </bibl> </sourceDesc> </fileDesc>

</teiHeader><sourceText>… </sourceText>…<text>…</text>

</Firmenmeldung>

semantic relations

named entities+ coreference

Page 25: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Annotation of Semantic Relations

acquisition company corporateStructure dividends newBusiness offer occupation premiumIncome profit relocation revenue turnover

Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto-matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT.

<Firma Branche="Kfz-Zulieferkonzern" Firma="Robert Bosch" Rechtsform="GmbH" Sitz="Stuttgart"/>

<Firma Firma="van Doorne's Transmissie" Land="NL" Rechtsform="b. v." Sitz="Tilburg"/>

<Beschaeftigung Firma="van Doorne's Transmissie" Mitarbeiter="220"/>

<Umsatz Betrag="45 Mill." Firma="van Doorne's Transmissie" Waehrung="DEM"/>

<Uebernahme Kaeufer="Robert Bosch" Objekt="van Doorne's Transmissie"/>

Page 26: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Annotation of Named Entities

function location money number ordinalNumber organization percent personName productName scaleUnit time

Robert Bosch GmbH, Stuttgart: Der Kfz-Zulieferkonzern übernimmt zum 1. Januar die van Doorne's Transmissie b. v., Tilburg. Das niederländische Unternehmen, das im letzten Jahr mit 220 Mitarbeitern einen Umsatz von 45 Millionen DM erzielte, entwickelt stufenlose auto-matische Automobilgetriebe (CVT = Continuously Variable Transmission) und produziert Komponenten für CVT.

<NE Organisation="Robert Bosch GmbH">Robert Bosch GmbH</NE> ,<NE Ort="Stuttgart">Stuttgart</NE> : Der Kfz-Zulieferkonzern übernimmt zum<NE Zeit="01.01."> 1. Januar</NE> die <NE Organisation="van Doorne's Transmissie b. v.">van Doorne's Transmissie</NE> ,<NE Ort="Tilburg">Tilburg</NE> .

Page 27: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Annotation of Coreference

3rd person personal pronouns 3rd person possessive

pronouns and determiners demonstrative pronouns and

determiners indefinite pronouns and

determiners anaphoric and cataphoric

adverbs elliptical nominal phrases anaphoric and cataphoric

nominal phrases

LM Ericsson AB, Stockholm: Der schwedischeElektronikkonzern hat …

<exp id="101">LM Ericsson AB</exp>, Stockholm: <exp id="102"><ptr src="101"/>Der schwedische Elektronikkonzern</exp> hat …

Page 28: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Cooperation: Annotation of FR

<REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST>

<CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>

<s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verk&#x00f6;rpert" pos="VVFIN"

morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s>

TIGER: syntactic annotation

LLX: FrameNet annotation

Page 29: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Cooperation: Multi-layer Annotation

<s id="s37"><graph root="s37_503"><terminals> <t id="s37_1" word="Ausgerechnet" pos="ADJD" morph="--" /> <t id="s37_2" word="Iggy" pos="NE" morph="Masc.Nom.Sg" /> <t id="s37_3" word="Pop" pos="NE" morph="*.Nom.Sg" /> <t id="s37_4" word="verk&#x00f6;rpert" pos="VVFIN"

morph="3.Sg.Pres.Ind" /> <t id="s37_5" word="gesanglich" pos="ADJD" morph="Pos" />...</terminals><nonterminals> <nt id="s37_500" cat="MPN"> <edge label="PNC" idref="s37_2"/> <edge label="PNC" idref="s37_3"/> </nt> <nt id="s37_501" cat="NP"> <edge label="NK" idref="s37_6"/> <edge label="NK" idref="s37_7"/> </nt>...</nonterminals></graph></s>

LLX: FrameNet annotation

<REQUEST><SPKR> SPD </SPKR> <FEE> fordert </FEE><ADD> Koalition </ADD> <MSG> zu Gespr"ach "uberReform </MSG> <FEE> auf </FEE>. </REQUEST>

<CONVERSATION>SPD fordert <INTLC-1> Koalition </INTLC-1> zu<FEE> Gespr"ach </FEE> <TOPIC> "uber Reform </TOPIC>auf. </CONVERSATION>

TIGER: syntactic annotation

<Firmenmeldung Annotator="keku" ID="SZ_401" Status="1"><teiHeader>

<fileDesc><titleStmt>

<author/></titleStmt><publicationStmt>

<publisher>SZ</publisher><date>1995-03-31</date>

</publicationStmt><sourceDesc>

<bibl> <agency>vwd</agency> <page>22</page> <topic>Wirtschaft</topic> <domain>Firmenmeldungen</domain></bibl>

</sourceDesc></fileDesc>

</teiHeader><sourceText>Datev eG, Nürnberg: Der EDV-Dienstleister für Steuerberater hat 1994 den Umsatz laut vorläufigen Zahlen um 5% auf rund 980 Mill. DM gesteigert. Die Anzahl der Mitarbeiter ist auf 4605 (4474) Beschäftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf 34246 (33551) an. Die Investitionen von 115 (93) Mill. DM haben sich in erster Linie auf die Modernisierung der Großrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</sourceText><Firma Branche1="EDV-Dienstleister für Steuerberater" Firma="Datev eG" Sitz1="Nürnberg" Rechtsform="eG"/><Umsatz Firma="Datev eG" Differenz="5%" Trend="plus" Betrag1="980 Mill." Waehrung1="DEM" Beschreibung1="rund" Zeit="1994"/><Beschaeftigung Firma="Datev eG" Trend="plus" Mitarbeiter1_alt="4474" Mitarbeiter1_neu="4605" Zeit="1994"/><text><NE Organisation="Datev eG">Datev eG</NE>, <NE Ort="Nürnberg">Nürnberg</NE>: Der EDV-Dienstleister für Steuerberater hat <NE Zeit="1994">1994</NE> den Umsatz laut vorläufigen Zahlen um <NE Prozentzahl="5%">5%</NE> auf <NE Geld="rund 980 Mill. DEM">rund 980 Mill. DM</NE> gesteigert. Die Anzahl der Mitarbeiter ist auf <NE Zahl="4605">4605</NE> (<NE Zahl="4474">4474</NE>) Beschäftigte gestiegen, die Zahl der Genossenschaftsmitglieder zog auf <NE Zahl="34246">34246</NE> (<NE Zahl="33551">33551</NE>) an. Die Investitionen von <NE Geld="115 (93) Mill. DEM">115 (93) Mill. DM</NE> haben sich in erster Linie auf die Modernisierung der Großrechner, den PC-Bereich sowie auf ein automatisches Versandlager konzentriert.</text>

</Firmenmeldung>

COLLATE: semantic annotation

=> multi-layer annotated language resource

Page 30: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Evaluation Tool: jTaCo

Evaluates grammars wrt. an annotated corpus Removes annotations from corpus, and feeds unannotated

text to grammar Compares grammar output with original annotated texts Produces detailed statistics, evaluation scores, and

diagnostic output

Page 31: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Configuration of jTaCo

jTaCo can be configured to deal with various problems in evaluating grammars wrt. a corpus:

Use of different classes of NE, or different granularities (e.g. organization and subclasses company, university etc.) Declaration of class equivalence and subclass relationships.

Extent of NE may be different (CEO Bill Gates vs. Bill Gates) Left or right boundary may be mismatched. Size of allowable

mismatch can be specified for each NE class. Markup of corpus may be textually oriented (XML tags)

while grammar output may be a different datastructure (e.g. semantics encoded in feature structure) No general solution is possible. In case of SPROUT, feature

structures are linked with input tokens, so that a correspondence can be established (under development).

Page 32: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Architecture of jTaCo

Page 33: Multilingual Corpora Workshop, 27 March 2003 Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development Christian Bering, Witold Drożdżyński,

Multilingual Corpora Workshop, 27 March 2003

Conclusion

We discussed a fundamental problem in re-using heterogeneously annotated corpora for multilingual grammar development

With increasing availability of annotated corpora, re-use becomes attractive and cost-effective

We described methods and tools for re-using annotated corpora for development and evaluation of NE grammars