gate: a unicode-based infrastructure supporting multilingual information extraction

GATE: A Unicode-based Infrastructure Supporting

Multilingual Information ExtractionKalina Bontcheva, Diana Maynard,

Valentin Tablan, Hamish Cunningham

Department of Computer Science, University of Sheffield

http://gate.ac.uk/

Structure of the talk:

• A brief introduction to GATE

• Multilingual infrastructure in GATE

• Simple multilingual IE components

http://gate.ac.uk/

http://gate.ac.uk/

GATE is...• An architecture A macro-level organisational picture for LE software

systems. • A framework For programmers, GATE is an object-oriented class

library that implements the architecture. • A development environment For language engineers,

computational linguists et al, a graphical development environment.

GATE comes with...• Some free components... ...and wrappers for other people's

components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue;

ontologies; etc.• Free software (LGPL). Download at http://gate.ac.uk/download/

http://gate.ac.uk/download/

Architectural principles

• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse

XML support, integration of Protégé, Jena, Weka...) • (Almost) everything is a component, and component sets

are user-extendable • (Almost) all operations are available both from API and GUI

Component-based development

CREOLE – Collection of REusable Objects for Language Engineering:• Java Beans: an OO way of chunking software• GATE components: modified Java Beans with XML

configuration• The minimal component = 10 lines of Java, 10 lines of

XML, 1 URL• Three types: Language Resources, Processing

Resources, Visual Resources

Why bother? • Allows the system to load arbitrary language processing

components

Language Resources (LRs)• LRs are documents, ontologies, corpora, lexicons, ……• LRs can be associated with DataStores (Oracle,

PostgreSQL, XML, Java Serialisation)• Documents / corpora:

– Diverse document formats: text, html, XML, email, RTF, SGML

– Optional format-preserving markup analyse / save• Standoff annotation model (start, end, type, features),

derivative of TIPSTER, compatible with ATLAS and XCES

Coping with diverse character encodings: • New internationalised versions of JVM support >100

different encodings. • Other encodings: developing system for user-entry of

mapping tables (remove programming from the process)

Processing Resources (PRs)• Algorithmic components knows as PRs – beans

with execute methods.• All PRs can handle Unicode data by default. • Clear distinction between code and data (simple

repurposing).• 20-30 freebies with GATE• Controllers: execute a set of PRs

– SerialController: sequential run of arbitrary PR set– SerialAnalyserController: analyser PRs over corpus– Conditional controllers: execute depend on features– Parallel controller?

• PRs + Controller = Applications• Application parameterisation state can be saved

and restored, and used for embedding / batching

Vis

ual R

esou

rces

(V

Rs)

VRs (2): Coreference

VRs (3): Syntax

Displaying Multilingual DataGATE uses standard (& imperfect) Java rendering engine for displaying text.

GATE Unicode Kit (GUK) Complements Java’s facilities

• Support for defining Input Methods (IMs)• Currently 30 IMs for 17 languages• Pluggable in other applications (e.g. JEdit, EUDICO)• Can use virtual kybd or standard layouts over QWERTY• IMs defined in plain text files• GUK comes with a standalone Unicode editor

Editing Multilingual Data

Processing Multilingual DataAll processing, visualisation and editing tools use GUK

Multilingual IE ComponentsThe ANNIE system – a reusable and easily extendable set of components

The Unicode TokeniserA very portable component for multliple languages:• splits text into typed tokens based on FSM • dynamically constructed from rules based on

character categories defined by the Unicode, e.g.:UPPERCASE_LETTER(LOWERCASE_LETTER|DASH_PUNCTUATION)*

> Token;orth=upperInitial;kind=word; • output generally localised by a later module (e.g.

“don’t” … “do” “n’t”)• 23 rules seem able to handle without changes Indo-

European languages. • the English tokeniser: Unicode tokeniser + pattern

grammar FST

POS tagging in new languages

• TIDES Surprise Language: Hepple tagger but substituted Cebuano/Hindi lexicon for English

• Used empty ruleset since no training data available

• Used default heuristics (e.g. return NNP for capitalised words)

• Very experimental, but reasonable results• 67% correctness for Hindi and 75% for

Cebuano• Adaptation time per language - 2 days

Porting NE grammars

• Most English JAPE rules based on POS tags and gazetteer lookup

• Grammars can be reused for languages with similar word order, orthography etc.

• No time to make detailed study of Cebuano, but very similar in structure to English

• Most of the rules left as for English, but some adjustments to handle especially dates

• Used both English and Cebuano grammars and gazetteers, because NEs appear in both languages

TIDES Evaluation Results

Cebuano English Baseline

Entity P R F P R F

Person 71 65 68 36 36 36

Org 75 71 73 31 47 38

Location 73 78 76 65 7 12

Date 83 100 92 42 58 49

Total 76 79 77.5 45 41.7 43

Conclusion

• GATE – a Unicode-based NLP infrastructure, particularly suitable for multilingual adaptation of IE systems

• Requires little involvement of native speakers and very little annotated data for a basic job

• Future work– Improving multilingual support, e.g.,

morphology support, automatic language and encoding identification

– Learning gazetteer lists from annotated corpora

gate: a unicode-based infrastructure supporting multilingual information extraction

Documents

diverse xml support

multilingual data gate

lines of java

prs beans

lines of xml

unicode data

modified java beans

prs controller