Querying Text Databases for Efficient Information Extraction
Eugene AgichteinLuis Gravano
Columbia University
2
Extracting Structured Information “Buried” in Text Documents
Apple's programmers "think different" on a "campus" in
Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.
Microsoft's central headquarters in Redmond is home to almost every product group and division.
OrganizationOrganization LocationLocation
Microsoft
Apple Computer
Nike
Redmond
Cupertino
Portland
Brent Barlow, 27, a software analyst and beta-tester at Apple Computer’s headquarters in Cupertino, was fired Monday for "thinking a little too different."
3
Information Extraction Applications
• Over a corporation’s customer report or email complaint database: enabling sophisticated querying and analysis
• Over biomedical literature: identifying drug/condition interactions
• Over newspaper archives: tracking disease outbreaks, terrorist attacks; intelligence
Significant progress over the last decade [MUC]
4
Information Extraction Example: Organizations’ Headquarters
doc2
Brent Barlow, a software analyst and beta-tester at AppleComputer's headquarters in Cupertino, was fired Monday for "thinkinga little too different." doc4
<PERSON>Brent Barlow</PERSON>,a software analyst and beta-tester at<ORGANIZATION>Apple Computer</ORGANIZATION>'sheadquarters in <LOCATION>Cupertino</LOCATION>, was firedMonday for "thinking a little too different." doc4
<ORGANIZATION>'sheadquarters in <LOCATION>
<ORGANIZATION>,based in <LOCATION>
<ORGANIZATION> = AppleComputer<LOCATION> = CupertinoPattern = p1
p1
p2
Extraction Patterns
doc4
Organization Location
Eastman Kodak Rochester doc2
doc4
tid
1Apple Computer Cupertino2
W
0.90.8
Useful
Input: Documents
Named-Entity Tagging
Pattern Matching
Output: Tuples
5
Goal: Extract All Tuples of a Relation from a Document Database
Text Database
InformationExtraction
System
• One approach: feed every document to information extraction system
• Problem: efficiency!
Extracted Tuples
6
Information Extraction is Expensive
• Efficiency is a problem even after training information extraction systemExample: NYU’s Proteus extraction system takes around
9 seconds per document• Over 15 days to process 135,000 news articles
• “Filtering” before further processing a document might help
• Can’t afford to “scan the web” to process each page!• “Hidden-Web” databases don’t allow crawling
7
Information Extraction Without Processing All Documents
• Observation: Often only small fraction of database is relevant for an extraction task
• Our approach: Exploit database search engine to retrieve and process only “promising” documents
8
Extracted Relation
Architecture of our QXtract System
User-Provided Seed Tuples
Queries
Promising Documents
Query Generation
Text Database
Search Engine
Information Extraction
Microsoft Redmond
Apple Cupertino
Microsoft Redmond
Apple Cupertino
Exxon Irving
IBM Armonk
Intel Santa Clara
Key problem: Learn queries to retrieve “promising” documents
Generating Queries to Retrieve Promising Documents
1. Get document sample with “likely negative” and “likely positive” examples.
2. Label sample documents usinginformation extraction systemas “oracle.”
3. Train classifiers to “recognize”useful documents.
4. Generate queries from classifiermodel/rules.
Query Generation
Information Extraction
Text Database
Search Engine
? ???
? ?
??
++
++
- -
--
Seed Sampling
Classifier Training
Queries
tuple1tuple2tuple3tuple4tuple5
++
++
- -
--
User-Provided Seed Tuples
10
? ???
? ?
??
Text Database
Search Engine
Getting a Training Document Sample
Microsoft AND Redmond
Apple AND Cupertino“Random” Queries
Get document sample with “likely negative” and “likely positive” examples.
User-Provided Seed Tuples
Text Database
Search Engine
? ???
? ?
??
Seed Sampling
User-Provided Seed Tuples
11
tuple1tuple2tuple3tuple4tuple5
++
++
- -
--
Labeling the Training Document Sample
Information Extraction System
Microsoft Redmond
Apple Cupertino
IBM Armonk
? ???
? ?
??
Use information extraction system as “oracle” to label examples as “true positive” and “true negative.”
12
++
++
- -
--
Training Classifiers to Recognize “Useful” Documents
Classifier Training
tuple1tuple2tuple3tuple4tuple5
++
++
- -
--
is based in near city
spokesperson reported news earnings release
products made used exported far
past old homerun sponsored event
++--
Ripper SVM
based AND near => Useful
based 3
spokesperson 2
sponsored -1
Okapi (IR)
is
based
near
spokesperson
earnings
sponsored
eventfar
homerun
Document features: words
13Queries
Query Generation
Generating Queries from Classifiers
++
++
- -
--
based AND nearspokesperson
based
QCombined
basedspokesperson
spokespersonearningsbased AND near
Ripper SVM
based 3
spokesperson 2
sponsored -1
Okapi (IR)
based AND near => Useful is
based
near
spokesperson
earnings
sponsored
eventfar
homerun
14
Extracted Relation
Architecture of our QXtract System
User-Provided Seed Tuples
Queries
Promising Documents
Query Generation
Text Database
Search Engine
Information Extraction
Microsoft Redmond
Apple Cupertino
Microsoft Redmond
Apple Cupertino
Exxon Irving
IBM Armonk
Intel Santa Clara
15
Experimental Evaluation: Data
• Training Set: – 1996 New York Times archive of 137,000
newspaper articles– Used to tune QXtract parameters
• Test Set: – 1995 New York Times archive of 135,000
newspaper articles
16
Final Configuration of QXtract, from Training
17
Experimental Evaluation: Information Extraction Systems and
Associated Relations
• DIPRE [Brin 1998]– Headquarters(Organization, Location)
• Snowball [Agichtein and Gravano 2000]– Headquarters(Organization, Location)
• Proteus [Grishman et al. 2002]– DiseaseOutbreaks(DiseaseName, Location,
Country, Date, …)
18
Experimental Evaluation: Seed Tuples
Organization Location
Microsoft Redmond
Exxon Irving
Boeing Seattle
IBM Armonk
Intel Santa Clara
DiseaseName Location
Malaria Ethiopia
Typhus Bergen-Belsen
Flu The Midwest
Mad Cow Disease The U.K.
Pneumonia The U.S.
Headquarters DiseaseOutbreaks
19
Experimental Evaluation: Metrics
• Gold standard: relation Rall, obtained by running information extraction system over every document in Dall database
• Recall: % of Rall captured in approximation extracted from retrieved documents
• Precision: % of retrieved documents that are “useful” (i.e., produced tuples)
20
Experimental Evaluation: Relation Statistics
Relation and Extraction System | Dall | % Useful | Rall |
Headquarters: Snowball 135,000 23 24,536
Headquarters: DIPRE 135,000 22 20,952
DiseaseOutbreaks: Proteus 135,000 4 8,859
21
Alternative Query Generation Strategies
• QXtract, with final configuration from training• Tuples: Keep deriving queries from extracted tuples
– Problem: “disconnected” databases
• Patterns: Derive queries from extraction patterns from information extraction system
– “<ORGANIZATION>, based in <LOCATION>” => “based in”
– Problems: pattern features often not suitable for querying, or not visible from “black-box” extraction system
• Manual: Construct queries manually [MUC]– Obtained for Proteus from developers– Not available for DIPRE and Snowball
Plus simple additional “baseline”: retrieve a random document sample of appropriate size
22
Recall and Precision Headquarters Relation; Snowball Extraction System
(a) (b)
0
5
10
15
20
25
30
35
40
45
5% 10% 15% 20% 25%
rec
all
(%)
QXtractPatternsTuplesBaseline
M axFractionRetrieved (% |Dall|)
20
25
30
35
40
45
50
55
5% 10% 15% 20% 25%p
reci
sio
n (
%)
QXtractPatternsTuplesBaseline
M axFractionRetrieved (% |Dall|)
Recall Precision
23
Recall and Precision Headquarters Relation; DIPRE Extraction System
(a) (b)
0
5
10
15
20
25
30
35
40
45
5% 10% 15% 20% 25%
rec
all
(%)
QXtractPatternsTuplesBaseline
M axFractionRetrieved (% |Dall|)
20
25
30
35
40
45
50
55
60
65
5% 10% 15% 20% 25%
pre
cisi
on
(%
)
QXtract PatternsTuples Baseline
M axFractionRetrieved (% |Dall|)
Recall Precision
24
0
10
20
30
40
50
60
70
80
5% 10% 25%
M axFractionRetrieved
reca
ll (%
)
QXtract Manual Tuples Baseline
Extraction Efficiency and RecallDiseaseOutbreaks Relation; Proteus Extraction System
60% of relation extracted from just 10% of documents of 135,000 newspaper article database
1.4
15.5
0
2
4
6
8
10
12
14
16
0 0 0 0 0
run
nin
g t
ime
(day
s)
ScanQXtract
10% 100%
25
Snowball/Headquarters Queries
26
DIPRE/Headquarters Queries
27
Proteus/DiseaseOutbreaks Queries
28
Current Work: Characterizing Databases for an Extraction Task
Sparse?
yesno
Scan QXtract, Tuples
Connected?
yesno
TuplesQXtract
Text Database
SearchInterface
tuple1tuple2tuple3tuple4tuple5
+
+
++tuple1tuple1
tuple1tuple1
+
+
+
+
29
Related Work
• Information Extraction: focus on quality of extracted relations [MUC]; most relevant sub-task: text filtering – Filters derived from extraction patterns, or consisting of words
(manually created or from supervised learning)– Grishman et al.’s manual pattern-based filters for disease outbreaks– Related to Manual and Patterns strategies in our experiments– Focus not on querying using simple search interface
• Information Retrieval: focus on relevant documents for queries– In our scenario, relevance determined by “extraction task” and associated
information extraction system• Automatic Query Generation: several efforts for different tasks:
– Minority language corpora construction [Ghani et al. 2001]– Topic-specific document search (e.g., [Cohen & Singer 1996])
30
Contributions: An Unsupervised Query-Based Technique for
Efficient Information Extraction• Adapts to “arbitrary” underlying information
extraction system and document database• Can work over non-crawlable “Hidden Web”
databases• Minimal user input required
– Handful of example tuples• Can trade off relation completeness and extraction
efficiency• Particularly interesting in conjunction with
unsupervised/bootstrapping-based information extraction systems (e.g., DIPRE, Snowball)
Questions?
Overflow Slides
33
Related Work (II)
• Focused Crawling (e.g., [Chakrabarti et al. 2002]): uses link and page classification to crawl pages on a topic
• Hidden-Web Crawling [Raghavan & Garcia-Molina 2001]: retrieves pages from non-crawlable Hidden-Web databases– Need rich query interface, with distinguishable attributes– Related to Tuples strategy, but “tuples” derived from pull-down
menus, etc. from search interfaces as found– Our goal: retrieve as few documents as possible from one
database to extract relation
• Question-Answering Systems
34
Related Work (III)
• [Mitchell, Riloff, et al. 1998] use “linguistic phrases” derived from information extraction patterns as features for text categorizationRelated to Patterns strategy; requires document parsing,
so can’t directly generate simple queries
• [Gaizauskas & Robertson 1997] use 9 manually generated keywords to search for documents relevant to a MUC extraction task
35
Recall and Precision DiseaseOutbreaks Relation; Proteus Extraction System
(a) (b)
0
10
20
30
40
50
60
70
80
90
5% 10% 15% 20% 25%
reca
ll (%
)
QXtract
Manual
Manual+QXtract
Tuples
Baseline
M axFractionRetrieved (% |Dall|)
0
5
10
15
20
25
30
35
5% 10% 15% 20% 25%
pre
cisi
on
(%
)
QXtract
Manual
Manual+QXtract
Tuples
Baseline
M axFractionRetrieved (% |Dall|)
Recall Precision
36
Running Times
ProteusSnowball DIPREMaxFractionRetrieved (% |Dall|) MaxFractionRetrieved (% |Dall|) MaxFractionRetrieved (% |Dall|)
0
20
40
60
80
100
120
140
160
180
0 0 0 0 0
run
nin
g t
ime
(min
ute
s)
FullScanQuickScanQXtractExtraction Training
5% 10% 100%0
2
4
6
8
10
12
14
16
0 0 0 0 0
run
nin
g t
ime
(d
ay
s)
FullScan
QXtract
5% 10% 100%0
20
40
60
80
100
120
140
0 0 0 0 0
run
nin
g t
ime
(min
ute
s)
FullScanQuickScanQXtractExtraction Training
5% 10% 100%
37
Extracting Relations from Text: Snowball
•Exploit redundancy on web to focus on “easy” instances
•Require only minimal training (handful of seed tuples)
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
ORGANIZ ATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
ACM DL’00