1 discovering and utilizing structure in large unstructured text datasets eugene agichtein math and...
Post on 01-Jan-2016
222 Views
Preview:
TRANSCRIPT
1
Discovering and Utilizing Structure in Large Unstructured Text DatasetsEugene Agichtein
Math and Computer Science Department
2
Information Extraction Example Information extraction systems represent text in
structured form
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Disease Outbreaks in The New York Times
Information Extraction System
3
How can information extraction help?
… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide input to data mining & statistics analysis
Large Text Collection Structured Relation
4
Goal: Detect, Monitor, Predict Outbreaks
Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, …
911 CallsTraffic accidents, …
Historical news, breaking news stories,wire, alerts, …
Hospital Records
IESys 4
IESys 3
IESys 2
IESys 1
Data Integration, Data Mining, Trend Analysis
Detection, Monitoring, Prediction
5
Challenges in Information Extraction
Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune
Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years
Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction
6
Outline Information extraction overview
Partially supervised information extraction Adaptivity Confidence estimation
Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases
Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation
7
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
8
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
9
What is “Information Extraction”Information Extraction =
segmentation + classification + clustering + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
10
What is “Information Extraction”Information Extraction =
segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
11
What is “Information Extraction”Information Extraction =
segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
12
What is “Information Extraction”Information Extraction =
segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
13
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
14
Information Extraction Tasks Extracting entities and relations
Entities Named (e.g., Person) Generic (e.g., disease name)
Relations Entities related in a predefined way (e.g., Location of a Disease
outbreak) Discovered automatically
Common information extraction steps: Preprocessing: sentence chunking, parsing, morphological analysis Rules/extraction patterns: manual, machine learning, and hybrid Applying extraction patterns to extract new information
Postprocessing and complex extraction: not covered Co-reference resolution Combining Relations into Events, Rules, …
15
Two kinds of IE approaches
Knowledge Engineering
rule based developed by experienced
language engineers make use of human
intuition requires only small amount
of training data development could be very
time consuming some changes may be
hard to accommodate
Machine Learning
use statistics or other machine learning
developers do not need LE expertise
requires large amounts of annotated training data
some changes may require re-annotation of the entire training corpus
annotators are cheap (but you get what you pay for!)
16
Extracting Entities from Text
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaska…WisconsinWyoming
Sliding WindowClassify Pre-segmented
Candidates
Finite State Machines Context Free GrammarsBoundary Models
Abraham Lincoln was born in Kentucky.
member?
Abraham Lincoln was born in Kentucky.Abraham Lincoln was born in Kentucky.
Classifier
which class?
…and beyond
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
17
Hidden Markov ModelsS
t -1S
t
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
18
IE with Hidden Markov Models
Yesterday Lawrence Saul spoke this example sentence.
Yesterday Lawrence Saul spoke this example sentence.
Person name: Lawrence Saul
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
19
HMM Example: “Nymble”
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
Task: Named Entity Extraction
Train on 450k words of news wire text.
Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%
[Bikel, et al 1998], [BBN “IdentiFinder”]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Transitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
Back-off to: Back-off to:
P(st | st-1 )
P(st )
P(ot | st , ot-1 )
P(ot | st )
P(ot )
or
Results:
20
Relation Extraction
Extract structured relations from text
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
Disease Outbreaks in The New York Times
21
Relation Extraction Typically require Entity Tagging as preprocessing
Knowledge Engineering Rules defined over lexical items
“<company> located in <location>” Rules defined over parsed text
“((Obj <company>) (Verb located) (*) (Subj <location>))” Proteus, GATE, …
Machine Learning-based Learn rules/patterns from examples
Dan Roth 2005, Cardie 2006, Mooney 2005, … Partially-supervised: bootstrap from “seed” examples
Agichtein & Gravano 2000, Etzioni et al., 2004, …
Recently, hybrid models [Feldman2004, 2006]
22
Comparison of Approaches Use “language-engineering” environments
to help experts create extraction patterns GATE [2002], Proteus [1998]
Train system over manually labeled data Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996]
Exploit large amounts of unlabeled data DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000] Etzioni et al. (’04): KnowItAll: extracting unary relations Yangarber et al. (’00, ’02): Pattern refinement, generalized names
detection
significanteffort
substantialeffort
minimaleffort
23
The Snowball System: Overview
Snowball
Text Database
Organization Location Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
157th Street Manhattan 0.52
15th Party Congress
China 0.3
15th Century Europe
Dark Ages 0.1
3
2
... ... ..... ... ..
1
24
Snowball: Getting User Input
User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
ACM DL 2000
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
25
Can use any Can use any full-text search full-text search engineengine
Snowball: Finding Example Occurrences Get
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Search Engine
Text Database
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp
The Armonk-based IBM introduced a new line…
Change of guard at IBM Corporation’s headquarters near Armonk, NY ...
26
Named Named entityentity taggerstaggers can recognize can recognize DatesDates, , PeoplePeople, , LocationsLocations, , OrganizationsOrganizations, …, … MITRE’s MITRE’s AlembicAlembic, IBM’s , IBM’s TalentTalent, , LingPipeLingPipe, …, …
Snowball: Tagging EntitiesGet
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Computer servers at Microsoft ’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp
The Armonk -based IBM introduced a new line…
Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...
27
Snowball: Extraction Patterns
General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2
Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in,
0.5)] ) Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) )
HMMs, Sparse sequences, Conditional Random Fields, …
Computer servers at Microsoft’s headquarters in Redmond…
28
Snowball: Generating Patterns Get
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
1 Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms
LOCATIONORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}
LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}
LOCATIONORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>}
LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}
2 Cluster Cluster similarsimilar occurrences.occurrences.
29
Snowball: Generating Patterns Get
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
LOCATIONORGANIZATION { <'s 0.71>, <headquarters 0.71>}
LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}
Create Create patternspatterns as filtered as filtered clustercluster centroidscentroids
1Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms
2 Cluster Cluster similarsimilar occurrences.occurrences.
3
30
Vector Space Clustering
31
Google 's new headquarters in Mountain View are …
Snowball: Extracting New TuplesMatch tagged text fragments against patterns
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
ORGANIZATION {<'s 0.71>, <headquarters 0.71> }
{<located 0.71>, < in 0.71>}
LOCATION {<- 0.71>, <based 0.71>
P1
P2
P3
Match=0.8
Match=0.4
Match=0
ORGANIZATION
ORGANIZATION
LOCATION
LOCATION
V ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION
32
Snowball: Evaluating Patterns
Automatically estimate Automatically estimate patternpattern confidenceconfidence::Conf(P4)= Conf(P4)= Positive / TotalPositive / Total
= 2/3 = 0.66= 2/3 = 0.66
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
IBM, Armonk, reported… PositiveIntel, Santa Clara, introduced... Positive
“Bet on Microsoft”, New York-based analyst Jane Smith said... Negative
LOCATIONORGANIZATION { < , 1> } P4Organization Headquarters
IBM Armonk
Intel Santa Clara
Microsoft Redmond
Current seed tuples
33
Snowball: Evaluating Tuples
Automatically evaluate tuple confidence:
Conf(T) =
A tuple has high confidence if generated by high-confidence patterns.
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
P4: 0.663COM Santa Clara
{<- 0.75>, <based 0.75>}P3: 0.95
0.4
Conf(T): 0.83
)PMatch(*)Conf(P-1-1 i
p
i
0.8
LOCATIONORGANIZATION { < , 1> }
LOCATION ORGANIZATION
34
Snowball: Evaluating TuplesGet
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
157th Street Manhattan 0.52
15th Party Congress
China 0.3
15th Century Europe
Dark Ages 0.1
... .... ..... .... .. ... .... ..... .... ..
Keep only Keep only high-confidencehigh-confidence tuples for next iterationtuples for next iteration
35
Snowball: Evaluating TuplesGet
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
Start new iteration with Start new iteration with expandedexpanded example setexample setIterate until no new tuples are extractedIterate until no new tuples are extracted
36
Pattern-Tuple Duality A “good” tuple:
Extracted by “good” patterns Tuple weight goodness
A “good” pattern: Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness
Edge weight: Match/Similarity of tuple context
to pattern
37
How to Set Node Weights Constraint violation (from before)
Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) =
HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P)
URNS [Downey et al., IJCAI 2005]
EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate
)PMatch(*)Conf(P-1-1 i
p
i
38
Evaluating Patterns and Tuples: Expectation Maximization
EM-Spy Algorithm “Hide” labels for some seed
tuples
Iterate EM algorithm to convergence on tuple/pattern confidence values
Set threshold t such that (t > 90% of spy tuples)
Re-initialize Snowball using new seed tuples
Organization Headquarters Initial Final
Microsoft Redmond 1 1
IBM Armonk 1 0.8
Intel Santa Clara 1 0.9
AG Edwards St Louis 0 0.9
Air Canada Montreal 0 0.8
7th Level Richardson 0 0.8
3Com Corp Santa Clara 0 0.8
3DO Redwood City 0 0.7
3M Minneapolis 0 0.7
MacWorld San Francisco 0 0.7
0
0
157th Street Manhattan 0 0.52
15th Party Congress
China 0 0.3
15th Century Europe
Dark Ages 0 0.1
…..
39
Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy
Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded
40
Example Task: DiseaseOutbreaks
Proteus: 0.409Snowball: 0.415
SDM 2006
41
Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]
CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks
Medical literature: PDR, Micromedex… [Thesis] AdverseEffects, DrugInteractions,
RecommendedTreatments
Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms
42
Outline Information extraction overview
Partially supervised information extraction Adaptivity Confidence estimation
Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases
Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation
43
Extracting A Relation From a Large Text Database
Brute force approach: feed all docs to information extraction system
Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing
keyword index How to identify “useful” documents?
InformationExtraction
System
Text Database StructuredRelation
]Expensive for large collections
44
An Abstract View of Text-Centric Tasks
Output tuples
…Extraction
System
Text Database
3. Extract output tuples2. Process documents1. Retrieve documents from database
Task tuple
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
45
Executing a Text-Centric TaskOutput tuples
…Extraction
System
Text Database
3. Extract output tuples
2. Process documents
1. Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results
Unlike the relational world
Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed)
→underlying data distribution dictates what is best
46
ScanOutput tuples
…Extraction
System
Text Database
3. Extract output tuples
2. Process documents
1. Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Time for retrieving a document
Question: How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
47
Iterative Query ExpansionOutput tuples
…Extraction
System
Text Database
3. Extract tuplesfrom docs
2. Process retrieved documents
1. Query database with seed tuples
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a document
Time for answering a query
Question: How many queries and how many documents
does Iterative Set Expansion need to reach target recall?
Time for processing a document
Query
Generation
4. Augment seed tuples with new tuples
Question: How many queries and how many documents
does Iterative Set Expansion need to reach target recall?
(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)
48
Extracted Relation
QXtract: Querying Text Databases for Robust Scalable Information EXtractionUser-Provided Seed Tuples
Queries
Promising Documents
Text Database
Search Engine
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Mad Cow Disease The U.K. July 1995
Pneumonia The U.S. Feb. 1995
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Query Generation
Information Extraction System
Problem: Learn keyword queries to retrieve “promising” documents
49
Learning Queries to Retrieve Promising Documents
1. Get document sample with “likely negative” and “likely positive” examples.
2. Label sample documents using information extraction system as “oracle.”
3. Train classifiers to “recognize” useful documents.
4. Generate queries from classifier model/rules. Queries
Query Generation
Information Extraction System
? ???
? ?
??
++
++
- -
--
Seed Sampling
Classifier Training
tuple1tuple2tuple3tuple4tuple5
++
++
- -
--
User-Provided Seed Tuples
Text Database
Search Engine
50
Training Classifiers to Recognize “Useful” Documents
disease reported epidemic expected area
virus reported expected infected patients
products made used exported far
past old homerun sponsored event
++
--
Ripper SVM
disease AND reported => USEFUL
virus 3
infected 2
sponsored -1
Okapi (IR)
disease
infected
reported
virus
epidemic
products
usedfar
exported
Document features:
words
D1
D2
D3
D4
51
SVM
Generating Queries from Classifiers
disease and reportedepidemic
virus
QCombined
virusinfected
epidemicvirusdisease AND reported
Ripper Okapi (IR)
disease AND reported => USEFUL
disease
infected
reported
virus
epidemic
products
usedfar
exportedvirus 3
infected 2
sponsored -1
52
SIGMOD 2003 Demonstration
53
An Even Simpler Querying Strategy: “Tuples”
DiseaseName Location Date
Ebola Zaire May 1995
“Ebola” and “Zaire”
InformationExtraction
System
Malaria Ethiopia Jan. 1995
hemorrhagic fever Africa May 1995
1. Convert given tuples into queries2. Retrieve matching documents3. Extract new tuples from documents and
iterate
Search Engine
54
0
10
20
30
40
50
60
70
80
5% 10% 25%
M axFractionRetrieved
reca
ll (%
)
QXtract Manual Tuples Baseline
Comparison of Document Access Methods
QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database
Tuples strategy: Recall at most 46%
55
Predicting Recall of Tuples Strategy
Seed
Tuple
SUCCESS! FAILURE
Can we predict if Tuples will succeed?
WebDB 2003
Seed
Tuple
56
Using Querying Graph for Analysis
We need to compute the: Number of documents retrieved after
sending Q tuples as queries (estimates time) Number of tuples that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the: Degree distribution of the tuples
discovered by retrieving documents Degree distribution of the documents
retrieved by the tuples (Not the same as the degree distribution of a
randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)
tuples Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
57
Information Reachability Graph
t2, t3, and t4 “reachable” from t1t1 retrieves document d1
that contains t2
t1
t2 t3
t4t5
Tuples Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
58
t2
t1
t3
t4
Connected Components
In OutCore(strongly
connected)
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not reachable
59
Sizes of Connected Components
OutInCor
e
OutIn Core
OutIn Core(strongly
connected)
t0
How many tuples are in largest Core + Out?
Conjecture: Degree distribution in reachability graphs follows “power-law.”
Then, reachability graph has at most one giant component.
Define Reachability as Fraction of tuples in largest Core + Out
60
NYT Reachability Graph: Outdegree Distribution
MaxResults=10
MaxResults=50
Matches the power-law distribution
61
NYT: Component Size Distribution
MaxResults=10
MaxResults=50
CG / |T| = 0.297
CG / |T| = 0.620
Not “reachable”
“reachable”
62
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
63
Estimating ReachabilityIn a power-law random graph G a giant
component CG emerges* if d (the average outdegree) > 1, and:
Estimate: Reachability ~ CG / |T| Depends only on d (average
outdegree)
* For < 3.457Chung and Lu, Annals of Combinatorics, 2002
64
Estimating Reachability Algorithm1. Pick some random tuples
2. Use tuples to query database
3. Extract tuples from matching documents to compute reachability graph edges
4. Estimate average outdegree
5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002
TuplesDocument
st1
t2
t3
t4
d1
d2
d3
d4
t1
t3
t2
t2
t4
d =1.5
65
Estimating Reachability of NYT
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000
MaxResults
Rea
chab
ility
S=10 S=50 S=100 S=200 Real Graph
.46
Approximate reachability is estimated after ~ 50 queries.
Can be used to predict success (or failure) of a Tuples querying strategy.
66
Outline Information extraction overview
Partially supervised information extraction Adaptivity Confidence estimation
Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases
Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining
67
Goal: Detect, Monitor, Predict Outbreaks
Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, …
911 CallsTraffic accidents, …
Historical news, breaking news stories,wire, alerts, …
Hospital Records
IESys 4
IESys 3
IESys 2
IESys 1
Data Integration, Data Mining, Trend Analysis
Detection, Monitoring, Prediction
68
Adaptive, Utility-Driven Extraction Extract relevant symptoms and modifiers from text
Physician notes, patient narrative, call transcripts
Call transcripts: a difficult extraction problem Not grammatical, dialogue, speechtext unreliable, … Use partially supervised techniques to learn extraction
patterns
One approach: Link together (when possible) call transcript and patient
record (e.g., by time, address, and patient name) Correlate patterns in transcript with diagnosis/symptoms Fine-grained learning: can automatically train for each
symptom or group of patients, etc.
69
Authority, Trust, Confidence How reliable are signals emitted by
information extraction?
Dimensions of trust/confidence: Source reliability: diagnosis vs. notes vs. 911 calls Tuple extraction confidence Source extraction difficulty
70
Source Confidence Estimation Task “easy” when context term distributions diverge from background
Quantify as relative entropy (Kullback-Liebler divergence)
After calibration, metric predicts task is “easy” or “hard”
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
the to and said 's company mrs won president
fre
qu
en
cy
Vw BG
CiCBGC wLM
wLMwLMLMLM
)(
)(log)()||(KL
CIKM 2005
President George W Bush’s three-day visit to India
71
Inferring Social Networks Explicit networks
Patient records: family, geographical entities in structured and unstructured portions
Implicit connections Extract events (e.g., “went to restaurant X
yesterday”) Extract relationships (e.g., “I work in Kroeger’s in
Toco Hills”
72
Modeling Social Networks for Epidemiology, security, …
Email exchange mapped onto cubicle locations.
73
Improve Prediction Accuracy
Suppose we managed to Automatically identify people currently sick or
about to get sick Automatically infer (part of) their social network
Can we improve prediction for dynamics of an outbreak?
74
Multimodal Information Extraction and Data Mining
Develop joint models over structured data E.g., lab results and symptoms extracted from text
One approach: mutual reinforcement Co-training: train classifier on redundant views of data
(e.g., structured & unstructured) Bootstrap on examples proposed by both views
More generally: graphical models
75
Summary Information extraction overview
Partially supervised information extraction Adaptivity Confidence estimation
Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases
Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining
76
Thank You
Details: papers, other talk slides:http://www.mathcs.emory.edu/~eugene/
top related