what’s the difference between tony blair and mother theresa? (human language technology for...

19
What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) http ://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham Dept. Computer Science, University of Sheffield Alghero, March 2004

Upload: bertha-stanley

Post on 11-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

What’s the difference between Tony

Blair and Mother Theresa?

(Human Language Technology for

Preservation return on investment)

http://gate.ac.uk/ http://nlp.shef.ac.uk/

Hamish CunninghamDept. Computer Science, University of Sheffield

Alghero, March 2004

Page 2: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

2(19)

20th Century Rot• 20th Century audio-visual media is rapidly

disappearing• Preservation and restoration are high cost• The costs must be justified by increased access• “Metadata”: descriptive information about

content• Therefore the rest of the talk will cover:

– rich metadata and semantic access– cross-lingual access– syndicated delivery– repurposeable content

Page 3: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

3(19)

IT context: the Knowledge Economy and Human Language

Gartner, December 2002: • taxonomic and hierachical knowledge mapping and indexing

will be prevalent in almost all information-rich applications • through 2012 more than 95% of human-to-computer

information input will involve textual language A contradiction: • to deal with the information deluge we need formal knowledge

in semantics-based systems • our archived history is in informal and ambiguous natural

language The challenge: to reconcile these two phenomena

Page 4: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

4(19)

HumanLanguage

Formal Knowledge(ontologies andinstance bases)

(A)IE

CLIE

(M)NLG

ControlledLanguage

OIE

SemanticWeb; Semantic Grid;Semantic Web Services

KEYMNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE

HLT: Closing the Loop

Page 5: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

5(19)

Information Extraction• Information Extraction (IE) pulls facts and

structured information from the content of large text collections.

• Contrast IE and Information Retrieval• NLP history: from NLU to IE • Progress driven by quantitative measures• MUC: Message Understanding

Conferences • ACE: Advanced Content Extraction

Page 6: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

6(19)

IE ExampleThe shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.

• Named entities (NE): "rocket", "Tuesday", "Dr. Head" and "We Build Rockets"

• Co-reference resolution (CO): "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same

• Template Elements (TE): the rocket is "shiny red" and Head's "brainchild".

• Template Relations (TR): Dr. Head works for We Build Rockets Inc.

• Scenario Templates (ST): a rocket launching event occurred with the various participants.

Page 7: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

7(19)

Performance levels(Extensive quantitative evaluation since early

’90s; mainly on text, ASR; now also video OCR)• Vary according to text type, domain, scenario,

language • NE: up to 97% (tested in English, Spanish,

Japanese, Chinese, others) • CO: 60-70% resolution • TE: 80% • TR: 75-80% • ST: 60% (but: human level may be only 80%)

Page 8: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

8(19)

Ontology-based IEXYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in …

Ontology & KB

Company

type

HQ

establOn

City Country

Location

partOf

type

type type

“03/11/1978”

XYZ

London

UK Bulgaria

HQpartOf

Page 9: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

9(19)

EntityPerson

Job-title

president

chancellorminister

G.Brown

“Gordon Brown met George Bush during his two day visit.

Classes, instances & metadata

Classes+instances before

Bush

<metadata> <DOC-ID>http://… 1.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string>

<class>…#Person</class> <inst>…#Person12345</inst>

</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset> <string>George Bush</string>

<class>…#Person</class> <inst>…#Person67890</inst>

</Annotation></metadata>

Classes+instances

after

Page 10: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

10(19)

An example: the MUMIS project• Multimedia Indexing and Searching Environment • Composite index of a multimedia programme from

multiple sources in different languages• ASR, video processing, Information Extraction (Dutch,

English, German), merging, user interface• University of Twente/CTIT, University of Sheffield,

University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA• An important experimental result: multiple sources for

same events can improve extraction quality– PrestoSpace applications in news and sports archiving

Page 11: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

11(19)

Semantic Query

Not “goal Beckham”(includes e.g. missed goals, or “this was not a goal”)

Instead: “goal events with scorer David Beckham”

Page 12: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

12(19)

The results: England win!

Page 13: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

13(19)

PSpace: good news and bad news• The good news: PrestoSpace has some of the world

leaders on AI and metadata• The bad news: AI always fails• How does the machine tell the difference between

“Mother Theresa is a saint” and “Tony Blair is a saint”?(Or, who tells Google which statement is important?)

• Other web users do, by linking (also cf. Amazon)• Two solutions to the AI problem:

– allow archivists and users to build their own (simple specific models can succeed, but the cost may be too high)

– use recommender systems to make the user an archivist’s assistant (researchers and students may barter for access)

• Any route to searchable content!

Page 14: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

14(19)

Syndication and Merging• The web promotes diversity, but also fragmentation• Original web: separate content and presentation (“this

is a header”, not “set in 20 point bold font”)• Now: many incompatible/inaccessible interfaces• Archives need to:

– pool their impact: syndication in networked communities– support repurposable content

• Therefore data must be presentation indepenent• Candidate technologies:

XML, RSS, RDF, OWL (“semantic web”)

Page 15: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

15(19)

GATE, a General Architecture for Text Engineering is...

• An architecture A macro-level organisational picture for LE software systems.

• A framework For programmers, GATE is an object-oriented class library that implements the architecture.

• A development environment For language engineers, a graphical development environment.

GATE comes with...• Free components, and wrappers for other people's• Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue;

ontologies; etc.• Free software (LGPL) at http://gate.ac.uk/download/• Used by thousands of people at hundreds of sites

Page 16: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

16(19)

A bit of a nuisance (GATE users)GATE team projects. Past:• Conceptual indexing: MUMIS:

automatic semantic indices for sports video

• MUSE, cross-genre entitiy finder• HSL, Health-and-safety IE• Old Bailey: collaboration with HRI

on 17th century court reports• Multiflora: plant taxonomy text

analysis for biodiversity research e-science

Present:• Advanced Knowledge

Technologies: €12m UK five site collaborative project

• EMILLE: S. Asian languages corpus

• ACE / TIDES: Arabic, Chinese NE• JHU summer w/s on semtaggingFuture:• Five new projects inc. PrestoSpace

Thousands of users at hundreds of sites. A representative sample: • the American National Corpus project • the Perseus Digital Library project,

Tufts University, US• Longman Pearson publishing, UK• Merck KgAa, Germany• Canon Europe, UK• Knight Ridder, US• BBN (leading HLT research lab), US• SMEs inc. Sirma AI Ltd., Bulgaria• Imperial College, London, the University

of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities

• UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...

Page 17: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

17(19)

GATE – infrastructure for semantic metadata extraction

• Combines learning and rule-based methods (new work on mixed-initiative learning

• Allows combination of IE and IR • Enables use of large-scale linguistic resources

for IE, such as WordNet• Supports ontologies as part of IE applications -

Ontology-Based IE• Supports languages from Hindi to Chinese,

Italian to German

Page 18: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

18(19)

(Not the) MAD Semantics Architecture

EN

FormalText

FormalText

FormalTextFormal

TextFormal

TextFormal

TextFormalText

FormalText

FormalTextText

Sources

IE

IE

IE

IT

FormalText

FormalText

FormalTextFormalText

FormalText

FormalTextFormalText

FormalText

FormalText

Signal md, Transcr-iptions

ASR,etc.

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

Formal

Text

AVSignals

Merging Final Annotations

Formal

Text

Formal

TextForma

lText

Anno-tations

MultilingualConceptual

Q & A

...

Ontology-Based

Metadata

Page 19: What’s the difference between Tony Blair and Mother Theresa? (Human Language Technology for Preservation return on investment) //gate.ac.uk

19(19)

Archiving is not a luxury•C21st: all the C20th mistakes but bigger & better?

•If you don’t know where you’ve been, how can you know where you’re going?

•Archives: ammunition in the war on ignorance

•Ammunition is useless if you can’t find it: new technology must make our history accessible to all, for all our futures

More information: http://gate.ac.uk/ http://www.prestospace.org/