automatic hypernym classification: towards the induction of

Click here to load reader

Upload: butest

Post on 11-May-2015

554 views

Category:

Documents


2 download

TRANSCRIPT

  • 1.CS 124/LINGUIST 180: From Languages to Information
    Dan Jurafsky
    Lecture 14: Information Extraction and Semantic Relation learning
    Lots of slides from manypeople, including Rion Snow, Jim Martin, Chris Manning, and William Cohen,

2. 2
Background: Information Extraction
Extract information from text
Sometimes called text analyticscommercially
Extract entities (the people, organizations, locations, times, dates, genes, diseases, medicines, etc. in a text)
Extract the relations between entities
Figure out the larger events that are taking place
3. 3
Information Extraction
Creating knowledge bases and ontologies
Implications for cognitive modeling
Digital Libaries
Google scholar, Citeseer need to extract the title, author and references
Bioinformatics
Patent analysis
Specific market segments for stock analysis
SEC filings
Intelligence analysis
4. Outline
Reminder: Named Entity Tagging
Relation Extraction
Hand-built patterns
Seed (bootstrap) methods
Supervised classification
Distant supervision
5. What is Information Extraction
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.
Richard Stallman, founder of the Free Software Foundation, countered saying
NAMETITLE ORGANIZATION
Slide from William Cohen
6. What is Information Extraction
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.
Richard Stallman, founder of the Free Software Foundation, countered saying
IE
NAMETITLE ORGANIZATION
Bill GatesCEOMicrosoft
Bill VeghteVPMicrosoft
Richard StallmanfounderFree Soft..
Slide from William Cohen
7. What is Information Extraction
As a familyof techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.
Richard Stallman, founder of the Free Software Foundation, countered saying
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
named entity extraction
Slide from William Cohen
8. What is Information Extraction
As a familyof techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.
Richard Stallman, founder of the Free Software Foundation, countered saying
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Slide from William Cohen
9. What is Information Extraction
As a familyof techniques:
Information Extraction =
segmentation + classification+ association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.
Richard Stallman, founder of the Free Software Foundation, countered saying
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Slide from William Cohen
10. What is Information Extraction
TITLE ORGANIZATION
NAME
Bill Gates
CEO
Microsoft
Bill
Veghte
VP
Microsoft
Free Soft..
Stallman
founder
Richard
As a familyof techniques:
Information Extraction =
segmentation + classification+ association+ clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.
Richard Stallman, founder of the Free Software Foundation, countered saying
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
*
*
*
*
Slide from William Cohen
11. Extracting Structured Knowledge
Each article can contain hundreds or thousands of items of knowledge...
The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research laboratory founded by the University of California in 1952.
LLNL EQ Lawrence Livermore National Laboratory
LLNL LOC-IN California
Livermore LOC-IN California
LLNL IS-A scientific research laboratory
LLNL FOUNDED-BY University of California
LLNL FOUNDED-IN 1952
12. 12
Goal:Machine-readable summaries
Structured knowledge extraction:Summary for machine
Textual abstract:
Summary for human
13. From Unstructured Text to Structured Knowledge
Unstructured Text
News articles...
slide from Rion Snow
14. From Unstructured Text to Structured Knowledge
Unstructured Text
Blog posts....
slide from Rion Snow
15. From Unstructured Text to Structured Knowledge
Unstructured Text
Scientific journal articles...
slide from Rion Snow
16. From Unstructured Text to Structured Knowledge
Unstructured Text
Tweets, instant messages, chat logs...
slide from Rion Snow
17. From Unstructured Text to Structured Knowledge
Unstructured Text
slide from Rion Snow
18. From Unstructured Text to Structured Knowledge
Structured Knowledge
Unstructured Text
slide from Rion Snow
19. From Unstructured Text to Structured Knowledge
Structured Knowledge
Unstructured Text
slide from Rion Snow
20. From Unstructured Text to Structured Knowledge
Structured Knowledge
Unstructured Text
slide from Rion Snow
21. From Unstructured Text to Structured Knowledge
Structured Knowledge
Unstructured Text
slide from Rion Snow
22. From Unstructured Text to Structured Knowledge
Structured Knowledge
Unstructured Text
slide from Rion Snow
23. From Unstructured Text to Structured Knowledge
Structured Knowledge
Unstructured Text
slide from Rion Snow
24. Reminder: Task 1: Named Entity Tagging

  • General NER or Biomedical NER

John Hennessy is a professor at Stanford University , in Palo Alto .
TAR independent transactivation by Tat in cells derived from the CNS - a novel mechanism of HIV-1 gene regulation.
Slide from Chris Manning
25. Reminder:Maximum Entropy Markov Model
DNA
O
DNA
O
regulation
HIV1
gene
of
Slide from Chris Manning
26. Task II: Relation Extraction
27. Relations between words
Language Understanding Applications needs word meaning!
Question answering
Conversational agents
Summarization
One key meaning component: word relations
Hierarchical (ontological) relations
San Francisco ISA city
Other relations between words
alternator is a part of a car
28. Relation Prediction
...works by such authors as Herrick, Goldsmith, and Shakespeare.
If you consider authors like Shakespeare...
Some authors (including Shakespeare)...
Shakespeare was the author of several...
Shakespeare, author of The Tempest...
ShakespeareIS-A author(0.87)
How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
29. Hyponymy
One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other
car is a hyponym of vehicle
dog is a hyponym of animal
mango is a hyponym of fruit
Conversely
vehicle is a hypernym/superordinateof car
animal is a hypernym of dog
fruit is a hypernym of mango
30. 30
WordNet relations
X is-a-kind-of Y
(hyponym / hypernym)
X is-a-part-of Y
(meronym / holonym)
slide from Rion Snow
31. WordNet is incomplete; ontological relations are missing for many words
This is especially true for specific domains (restaurants, auto parts, finance)
32. Other kinds of Relations: Disease Outbreaks
Extract structured information from text
Slide from Eugene Agichtein
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs Proteus)
33. More relations: Protein Interactions
interact
complex
CBF-ACBF-C
CBF-BCBF-A-CBF-C complex
associates
We show that CBF-A and CBF-C interact
with each other to form a CBF-A-CBF-C complex
and that CBF-B does not interact with CBF-A or
CBF-C individually but that it associates with the
CBF-A-CBF-C complex.
Slide from Rosario and Hearst
34. Yet More Relations
CHICAGO (AP) Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York
Slide from Jim Martin
35. Relation Types
For generic news texts...
Slide from Jim Martin
36. More relations: UMLS
Unified Medical Language System
integrates linguistic, terminological and semantic information
Semantic Network consists of 134 semantic types and 54 relations between types
Pharmacologic Substance affects Pathologic Function
Pharmacologic Substance causes Pathologic Function
Pharmacologic Substance complicatesPathologic Function
Pharmacologic Substance diagnoses Pathologic Function
Pharmacologic Substance prevents Pathologic Function
Pharmacologic Substance treats Pathologic Function
Slide from Paul Buitelaar
37. Relations in Ontologies: GO (Gene Ontology)
GO (Gene Ontology)
Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes
Organizing principles are molecular function, biological process and cellular component
Accession:GO:0009292
Ontology:biological process
Synonyms:broad: genetic exchange
Definition:In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual.
Term Lineageall : all (164142)
GO:0008150 : biological process (115947)
GO:0007275 : development (11892)
GO:0009292 : genetic transfer (69)
Slide from Paul Buitelaar
38. Relations in Ontologies: geographical
Ontology
F-Logic
similar
Geographical Entity (GE)
is-a
flow_through
Inhabited GE
Natural GE
capital_of
city
river
mountain
country
instance_of
located_in
Neckar
Zugspitze
Germany
capital_of
length (km)
height (m)
flow_through
located_in
Berlin
Stuttgart
367
2962
flow_through
Design: Philipp Cimiano
Slide from Paul Buitelaar
39. 39
MeSH (Medical Subject Headings) Thesaurus
Definition
MeSH Descriptor
Synonym set
Slide from Illhoi Yoo, Xiaohua (Tony) Hu,andIl-Yeol Song
40. MeSH Tree
MeSH Ontology
Hierarchically arranged from most general to most specific.
Actually a graph rather than a tree
normally appear in more than one place in the tree
MeSH Tree
Slide from Illhoi Yoo, Xiaohua (Tony) Hu,andIl-Yeol Song
41. Slide from Doug Appelt
Types of ACE Relations, 2003
ROLE - relates a person to an organization or a geopolitical entity
Subtypes: member, owner, affiliate, client, citizen
PART - generalized containment
Subtypes: subsidiary, physical part-of, set membership
AT - permanent and transient locations
Subtypes: located, based-in, residence
SOCIAL- social relations among persons
Subtypes: parent, sibling, spouse, grandparent, associate
42. Frequent Freebase Relations
a
43. Predicting the is-a relation
...works by such authors as Herrick, Goldsmith, and Shakespeare.
If you consider authors like Shakespeare...
Some authors (including Shakespeare)...
Shakespeare was the author of several...
Shakespeare, author of The Tempest...
ShakespeareIS-A author(0.87)
How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
44. Treatment
Disease
Why this is hard: Ambiguity!Which relations hold between 2 entities?
Cure?
Prevent?
Side Effect?
45. Different relations between Disease (Hepatitis) and Treatment
Cure
These results suggest that con A-induced hepatitis was ameliorated by pretreatment with TJ-135.
Prevent
A two-dose combined hepatitis A and Bvaccine would facilitate immunization programs
Vague
Effect of interferon on hepatitis B
Slide from Barbara Rosario and Marti Hearst
46. 5 easy methods for relation extraction
Hand-built patterns
Supervised methods
Bootstrapping (seed) methods
Unsupervised methods
Distant supervision
47. 5 easy methods for relation extraction
Hand-built patterns
Bootstrapping (seed) methods
Supervised methods
Unsupervised methods
Distant supervision
48. A complex hand-built extraction rule [NYU Proteus]
49. Goal: Add hyponyms to WordNet directly from text.
Intuition from Hearst (1992)
Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use
What does Gelidium mean?
How do you know?`
50. Goal: Add hyponyms to WordNet directly from text.
Intuition from Hearst (1992)
Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use
What does Gelidium mean?
How do you know?`
51. Hearsts Hand-Designed Lexico-Syntactic Patterns
(Hearst, 1992): Automatic Acquisition of Hyponyms
Y such as X ((, X)* (, and/or) X)
such Y as X
X or other Y
X and other Y
Y including X
Y, especially X
52. Hearsts hand-built patterns for Relation Extraction
53. Problem with hand-built patterns
Requires that we hand-build patterns for each relation!
dont want to have to do this for all possible relations!
wed like better accuracy
54. 5 easy methods for relation extraction
Hand-built patterns
Supervised methods
Bootstrapping (seed) methods
Unsupervised methods
Distant supervision
55. 2. Supervised Relation Extraction
Sometimes done in 3 steps
Find all pairs of named entities
Decide if 2 entities are related
If yes, classifying the relation
Why the extra step?
Cuts down on training time for classification by eliminating most pairs
Producing separate feature-sets that are appropriate for each task.
55
56. Relation Analysis
Usually just run on named entities within the same sentence
Slide from Jim Martin
57. Slide from Jing Jiang
Relation Extraction
Task definition: to label the semantic relation between a pair of entities in a sentence (fragment)
[leaderarg-1] of a minority [governmentarg-2]
PHYS
PER-SOC
EMP-ORG
NIL
PHYS: Physical
PER-SOC: Personal / Social
EMP-ORG: Employment / Membership / Subsidiary
58. Supervised Learning
Supervised machine learning (e.g. [Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006], [Surdeanu & Ciaramita 2007])
Training data is needed for each relation type
[leaderarg-1] of a minority [governmentarg-2]
arg-1 word: leader
arg-2 type: ORG
dependency:
arg-1 of arg-2
EMP-ORG
PHYS
PER-SOC
NIL
Slide from Jing Jiang
59. ACE 2008 Six Relations
60. Features: Words
Headwords of M1 and M2, and combination
George Washington Bridge
Bag of words and bigrams in M1 and M2
Words or bigrams in particular positions to the left and right of the M1 and M2
+/- 1, 2, 3
Bag of words or bigrams between the two entities
61. Features: Named Entity Type and Mention Level
Named-entity types (ORG, LOC, etc)
Concatenation of the types
Entity Level of M1 and M2
(NAME, NOMINAL, PRONOUN)
62. Features: Parse Tree and Base Phrases
Syntactic environment
Constituent path through the tree from one to the other
Base syntactic chunk sequence from one to the other
Dependency path
Slide from Jim Martin
63. Features: Gazeteers and trigger words
Personal relative trigger list
from wordnet: parent, wife, husabnd, grandparent, etc
Country name list
64. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagnersaid.
65. Classifiers for supervised methods
Now you can use any classifier you like
SVM
Logistic regression
Nave Bayes
etc
66. Summary
Can get high accuracies with enough hand-labeled training data
If test data looks exactly like the training data
But
labeling 5000 relations (and named entities) is expensive
the approach doesnt generalize to different genres
67. 5 easy methods for relation extraction
Hand-built patterns
Supervised methods
Bootstrapping (seed) methods
Unsupervised methods
Distant supervision
68. Slide from Jim Martin
Bootstrapping Approaches
What if you dont have enough annotated text to train on.
But you might have some seed tuples
Or you might have some patterns that work pretty well
Can you use those seeds to do something useful?
Co-training and active learning use the seeds to train classifiers to tag more data to train better classifiers...
Bootstrapping tries to learn directly (populate a relation) through direct use of the seeds
69. Slide from Jim Martin
Bootstrapping Example: Seed Tuple
Seed tuple
Grep (google)
Mark Twain is buried in Elmira, NY.
X is buried in Y
The grave of Mark Twain is in Elmira
The grave of X is in Y
Elmira is Mark Twains final resting place
Y is Xs final resting place.
Use those patterns to grep for new tuples that you dont already know
70. Hearst (1992) proposal for bootstrapping
Choose lexical relation R.
Gather a set of pairs that have this relation
Find places in the corpus where these expressions occur near each other and record the environment.
Find the commonalities among these environments and hypothesize that common ones yield patterns that indicate the relation of interest.
71. Slide from Jim Martin
Bootstrapping Relations
72. Dipre (Brin 1998)
Extract pairs.
Start with these 5 seeds
Learn these patterns:
Now iterate, using these patterns to get more instances and patterns
73. Snowball (Agichtein and Gravano 2000)
New idea: require that X and Y be named entities of particular types
{ }
LOCATION
ORGANIZATION
{ }
LOCATION
ORGANIZATION
74. 5 easy methods for relation extraction
Hand-built patterns
Supervised methods
Bootstrapping (seed) methods
Unsupervised methods
Distant supervision
75. Distant supervision paradigm
Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17
Mintz, Bills, Snow, Jurafsky (2009)Distant supervision for relation extraction without labeled data.ACL-2009.
Instead of hand-creating 5 seed examples
Use a large database to get our seed examples
lots of examples
supervision from a database, not a corpus!
Not genre-dependent!
Create lots and lots of noisy features from all these examples
Combine in a classifier
76. Distant supervision paradigm
Has advantages of supervised classification:
use of rich of hand-created knowledge
Has advantages of unsupervised classification:
infinite amounts of data
allows for very large number of weak features
not sensitive to training corpus
77. 77
Relation Classification with Distant Supervision
We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
Slide from Rion Snow
78. 78
Relation Classification with Distant Supervision
We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
Slide from Rion Snow
79. 79
Relation Classification with Distant Supervision
We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
Slide from Rion Snow
80. 80
Relation Classification with Distant Supervision
We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
Slide from Rion Snow
81. 81
Relation Classification with Distant Supervision
We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
This leads to high-signal examples like:
...consider authors like Shakespeare...
Some authors (including Shakespeare)...
Shakespeare was the author of several...
Shakespeare, author of The Tempest...
Slide from Rion Snow
82. 82
Relation Classification with Distant Supervision
We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.
This leads to high-signal examples like:
...consider authors like Shakespeare...
Some authors (including Shakespeare)...
Shakespeare was the author of several...
Shakespeare, author of The Tempest...
But noisy examples like:
The author of Shakespeare in Love...
...authors at the ShakespeareFestival...
Training set (TREC and Wikipedia):
14,000 hypernym pairs, ~600,000 total pairs
Slide from Rion Snow
83. How to learn patterns
Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17
doubly heavy hydrogen atomcalleddeuterium
Take corpus sentences
Collect noun pairs
752,311 pairs from 6M words of newswire
Is pair an IS-A in WordNet?
14,387 yes, 737,924 no
Parse the sentences
Extract patterns
69,592 dependency paths >5 pairs)
Train classifier on these patterns
Logistic regression with 70K features(actually converted to 974,288 bucketed binary features)
1
(Atom, deuterium)
2
YES
3
4
5
6
84. One of 70,000 patterns
called
Learned from cases such as:
sarcoma / cancer:an uncommon bone cancercalled osteogenicsarcoma and to
deuterium / atom.heavy water rich in the doubly heavy hydrogen atomcalleddeuterium.
New pairs discovered:
efflorescence / condition:and a conditioncalledefflorescence are other reasons for
neal_inc / company The company, now called O'Neal Inc., was sole distributor of E-Ferol
hat_creek_outfit / ranch run a small ranch called the Hat Creek Outfit.
hiv-1 / aids_virus infected by the AIDS virus, called HIV-1.
bateau_mouche / attraction local sightseeing attraction called the Bateau Mouche...
kibbutz_malkiyya / collective_farman Israeli collective farm called Kibbutz Malkiyya
85. Hypernym Precision / Recall for all Features
Slide from Rion Snow
86. Hypernym Precision / Recall for all Features
Slide from Rion Snow
87. Hypernym Precision / Recall for all Features
Slide from Rion Snow
88. Hypernym Precision / Recall for all Features
Slide from Rion Snow
89. Idea: use each pattern as a feature!!!!Precision/Recall for Hypernym Classification:
logistic regression
10-fold Cross Validation on 14,000 WordNet-Labeled Pairs
slide from Rion Snow
90. Outline
Reminder: Named Entity Tagging
Relation Extraction
Hand-built patterns
Seed (bootstrap) methods
Supervised classification
Distant supervision