the knowledge acquisition bottleneck revisited: how can we build large kbs?

The Knowledge Acquisition Bottleneck Revisited:

How can we build large KBs?

Illustrations of different approachesPeter Clark and John Thompson

Boeing Research2004

Premise• Intelligent machines needs lots of knowledge, for

– question-answering– intelligent search– information integration– natural language understanding– decision support– modeling– etc. etc.

• Much of this knowledge can be drawn from some general repository of reusable knowledge– e.g., WordNet

• How does one build such a repository?“No-one considers hand-building a large KB to be a realistic proposition these days” [paraphrase of Daphne Koller, 2004]

1. Build it by Hand• “Let’s roll up our sleeves and

get on with it!”• But: It’s a daunting task

– Our own work• Cyc

+ Lots in it, (Relatively) well designed ontology

- 650 person-years effort so far

- Still patchy coverage (why?)

- Difficult to use outside Cycorp

1. Build it by Hand (cont)- WordNet

+ Easy to use+ Comprehensive- Little inference-

supporting knowledge in

- Ad hoc ontology

1. Build it by Hand (cont)• The Component Library

Claim: can bound the required knowledge by working at a coarse-grained level

+ Large, more doable

- Hard to use, still very incomplete

2. Extract from Dictionaries

- MindNet+ Automatically built- Unusable?

- Extended WordNet+ Won TREC

competition- Still somewhat

incoherent- Lot of manual

labor

3. Corpus-based Text/Web Mining

- Schubert’s system+ Automatic

+ Lots of knowledge

- Noisy- No word senses- Only grabs certain

kinds of knowledge

30M entries…

3. Corpus-based Text/Web Mining (cont)

- KnowIt (Etsioni)+ automatic- only factoids

4. Community-Based Acquisition• Knowledge entry by the masses• OpenMind

+ Large- Full of junk, unusable (?)

- Would this work with better acquisition tools?

(see next slide for illustration)

5. Use Existing Resources

• e.g.,– databases– CIA World Fact Book– Web data/services

• e.g., SRI/ISI’s ARDA QA system+ Syntactically simple + Available- Largely limited to factoids- Information integration is a major challenge

- different ontologies, contradictory data

Where to?• Can we bound the knowledge needed

– for a particular application– for a useful, sharable, general resource?

• Which of these approaches seems most realistic?– build by hand– extract from dictionaries– mine text corpora– community knowledge entry– use existing resources

the knowledge acquisition bottleneck revisited: how can we build large kbs?

Documents

required knowledge

knowledge inad

knowledge neededfor

large kbs

coarsegrained level

hand contwordnet easy

existing resourcese

realistic proposition