1/(13) using corpora and evaluation tools diana maynard kalina bontcheva //gate.ac.uk/...

14
Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva http:// gate.ac.uk / http:// nlp.shef.ac.uk / March 2004

Upload: shanon-young

Post on 01-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Using Corpora and Evaluation Tools

Diana Maynard

Kalina Bontcheva

http://gate.ac.uk/ http://nlp.shef.ac.uk/

March 2004

Corpus structure

• Located in gatecorpora in cvs• Each directory under gatecorpora has a corpus, e.g.,

gatecorpora/ace• Each corpus can have sub-parts, e.g. ace/bnews• Each (sub-)corpus has a clean and marked directory,

these are important• Clean holds the unannotated version, while marked holds

the human-marked ones• There may also be a processed subdirectory – this is a

datastore (unlike the other two)• Corresponding files in each subdirectory must have the

same name

Tools for corpus manipulation

• There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus

• Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations

• Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars)

Corpora available

• MUC7 (newswires)• MUSE (news texts from the web)• ACE • ACE Chinese• ACE Arabic• Romanian (news texts; 1984)• CMU seminars• Jobs• CONLL’03 – part of Reuters with NEs• Bulgarian - news

MUC 7 corpus

• Newswires used in the official MUC 7 evaluation• Data available in MUC format and GATE format• Annotation types: Person, Location,

Organization, Money, Percent, Date, Time• Division into training and test sets

MUSE corpus

• News texts from various websites (BBC, Guardian, etc.)

• Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address

• Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names

• Available from gatecorpora/news in various subdirectories

ACE corpus

• 3 types of text: newswire, broadcast news and newspaper

• Broadcast news and newspaper available as ground truth and original (degraded) texts

• Annotation types: Person, Organisation, Location, GPE, Facility

• Some annotations have roles to indicate metonymous usage

• Guidelines are different from MUC and MUSE• Available from gatecorpora/ace in various

subdirectories

Multilingual ACE

• As for ACE, but in Chinese and Arabic

• Texts are in UTF-8

• No degraded versions of these texts

• Available from gatecorpora/ace/ace03/Chinese/

and

gatecorpora/ace/ace03/Arabic/

CMU Seminars & Jobs

• Corpora frequently used to evaluate relation extraction and wrapper induction systems

• gatecorpora/jobs-corpus and gatecorpora/cmu-seminars

• Converted into gate xml, ready for use

CONLL’03 shared task

• Corpus used in the CONLL’03 shared task for evaluating NE recognition

• In English, part of the Reuters corpus

• Markup is e.g., <I-LOC>, not converted to Muse tags

• Use reuterstogate.jape to convert to Muse tags

• gatecorpora/ReutersWithNamedEntities

Annotation Diff:per-document evaluation

Regression TestAt corpus level – corpus benchmark tool – tracking system’s performance over time

How it works

• Clean, marked, and processed• Corpus_tool.properties – must be in the directory

from where gate is executed• Specifies configuration information about

– What annotations types are to be evaluated– Threshold below which to print out debug info– Input set name and key set name

• Modes– Default – regression testing– Human marked against already stored, processed – Human marked against current processing results

Conclusion

This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt

More information: http://gate.ac.uk/