entity linking - indian institute of technology...
TRANSCRIPT
Entity Linking
Pawan Goyal
CSE, IIT Kharagpur
August 17th, 2016
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 1 / 23
What is Entity Linking?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 2 / 23
What is Entity Linking?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 3 / 23
What is Entity Linking?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 4 / 23
Entity Linking: Common Steps
Determine “linkable” phrases
mention detection - MD
Rank/Select candidate entity links
link generation - LG
Use “context” to disambiguate/filter/improve
disambiguation - DA
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 5 / 23
Entity Linking: Common Steps
Determine “linkable” phrases
mention detection - MD
Rank/Select candidate entity links
link generation - LG
Use “context” to disambiguate/filter/improve
disambiguation - DA
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 5 / 23
Entity Linking: Common Steps
Determine “linkable” phrases
mention detection - MD
Rank/Select candidate entity links
link generation - LG
Use “context” to disambiguate/filter/improve
disambiguation - DA
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 5 / 23
Mention Detection (MD)
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 6 / 23
Link Generation (LG)
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 7 / 23
Disambiguation (DA)
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 8 / 23
Preliminaries: Wikipedia
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 9 / 23
Preliminaries: Disambiguation Pages
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 10 / 23
Some Statistics
WordNet80k entity definitions
142k senses (entity - surface forms)
Wikipedia4M entity definitions
24M senses
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 11 / 23
Some Statistics
WordNet80k entity definitions
142k senses (entity - surface forms)
Wikipedia4M entity definitions
24M senses
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 11 / 23
Wikipedia based methods
What can be a good measure for MD?
keyphraseness(w)Number of Wikipedia articles that use it as an anchor, divided by the numer ofarticles that mention it at all.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 12 / 23
Wikipedia based methods
What can be a good measure for MD?
keyphraseness(w)Number of Wikipedia articles that use it as an anchor, divided by the numer ofarticles that mention it at all.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 12 / 23
Wikipedia based methods
What can be a good measure for MD?
keyphraseness(w)Number of Wikipedia articles that use it as an anchor, divided by the numer ofarticles that mention it at all.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 12 / 23
Wikipedia based methods
What can be a good measure for DA?
commonness(w,c)The fraction of times, a particular sense is used as a destination in Wikipedia.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 13 / 23
Wikipedia based methods
What can be a good measure for DA?
commonness(w,c)The fraction of times, a particular sense is used as a destination in Wikipedia.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 13 / 23
Commonness and keyphraseness
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 14 / 23
Always the best decision?
Using Relatedness: Basic IdeaIn a sufficiently long text, one finds terms that do not requiredisambiguation at all.
Use every unambiguous link in the document as context to disambiguateambiguous ones.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 15 / 23
Always the best decision?
Using Relatedness: Basic IdeaIn a sufficiently long text, one finds terms that do not requiredisambiguation at all.
Use every unambiguous link in the document as context to disambiguateambiguous ones.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 15 / 23
Finding Relatedness: A link-based measure
relatedness(c,c′)Using the intersection among incoming as well as outgoing links of twoWikipedia pages
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 16 / 23
Finding Relatedness: A link-based measure
relatedness(c,c′)Using the intersection among incoming as well as outgoing links of twoWikipedia pages
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 16 / 23
Computing Relatedness
Each candidate sense and context term is represented by a singleWikipedia article.
Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.
Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.
The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.
How to give different weights to the context terms?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23
Computing Relatedness
Each candidate sense and context term is represented by a singleWikipedia article.
Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.
Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.
The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.
How to give different weights to the context terms?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23
Computing Relatedness
Each candidate sense and context term is represented by a singleWikipedia article.
Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.
Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.
The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.
How to give different weights to the context terms?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23
Computing Relatedness
Each candidate sense and context term is represented by a singleWikipedia article.
Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.
Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.
The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.
How to give different weights to the context terms?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23
Computing Relatedness
Each candidate sense and context term is represented by a singleWikipedia article.
Thus the problem is reduced to selecting the sense article that has mostin common with all of the context articles.
Comparison of articles is facilitated by the Wikipedia Link-basedmeasure, which measures the semantic similarity of two Wikipedia pagesby comparing their incoming and outgoing links.
The relatedness of a candidate sense is the weighted average of itsrelatedness to each context article.
How to give different weights to the context terms?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 17 / 23
Weighting the Context Terms
link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination
relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms
These two variables - link probability and relatedness - are averaged toprovide a weight for each context.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23
Weighting the Context Terms
link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination
relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms
These two variables - link probability and relatedness - are averaged toprovide a weight for each context.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23
Weighting the Context Terms
link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination
relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms
These two variables - link probability and relatedness - are averaged toprovide a weight for each context.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23
Weighting the Context Terms
link probability: Use the ones that are almost always used as a linkwithin the articles where they are found, and always link to the samedestination
relatedness: We can determine how closely a term relates to the centraldocument by calculating its average semantic relatedness to all othercontext terms
These two variables - link probability and relatedness - are averaged toprovide a weight for each context.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 18 / 23
Can we improve mention detection with this approach?
The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold.
Is it the best method?
All the remaining phrases are disambiguated using the approachmentioned earlier.
This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.
Can you use this to learn – which concepts should be linked?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23
Can we improve mention detection with this approach?
The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?
All the remaining phrases are disambiguated using the approachmentioned earlier.
This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.
Can you use this to learn – which concepts should be linked?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23
Can we improve mention detection with this approach?
The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?
All the remaining phrases are disambiguated using the approachmentioned earlier.
This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.
Can you use this to learn – which concepts should be linked?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23
Can we improve mention detection with this approach?
The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?
All the remaining phrases are disambiguated using the approachmentioned earlier.
This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.
Can you use this to learn – which concepts should be linked?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23
Can we improve mention detection with this approach?
The link detection process starts by gathering all n-grams in thedocument, and retaining those whose probability exceeds a very lowthreshold. Is it the best method?
All the remaining phrases are disambiguated using the approachmentioned earlier.
This results in a set of associations between terms in the document andthe Wikipedia articles that describe them.
Can you use this to learn – which concepts should be linked?
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 19 / 23
Example
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 20 / 23
The Learning Problem: Which topics should be linked?
The automatically identified Wikipedia articles provide training instancesfor a classifier.
Positive examples are the articles that were manually linked to, whilenegative ones are those that were not.
Features of these articles – and the places where they were mentioned –are used to inform the classifier about which topics should and should notbe linked.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 21 / 23
The Learning Problem: Which topics should be linked?
The automatically identified Wikipedia articles provide training instancesfor a classifier.
Positive examples are the articles that were manually linked to, whilenegative ones are those that were not.
Features of these articles – and the places where they were mentioned –are used to inform the classifier about which topics should and should notbe linked.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 21 / 23
The Learning Problem: Which topics should be linked?
The automatically identified Wikipedia articles provide training instancesfor a classifier.
Positive examples are the articles that were manually linked to, whilenegative ones are those that were not.
Features of these articles – and the places where they were mentioned –are used to inform the classifier about which topics should and should notbe linked.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 21 / 23
What are the features?
Link Probability: Average as well as maximum of link probability of thelink locations – (e.g. Hillary Clinton and Clinton)
Relatedness: Topics which relate to the central thread of the documentare more likely to be linked
Disambiguation Confidence: The confidence score of the classifier fordisambiguation
Generality: Defined as the minimum depth at which it is located inWikipedia’s category tree. More useful for the readers to provide links forspecific topics.
Location and Spread: Where are these mentioned? First occurrence,last occurrence and the spread.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 22 / 23
References
Mihalcea, Rada, and Andras Csomai. “Wikify!: linking documents toencyclopedic knowledge.” Proceedings of the sixteenth ACM conferenceon information and knowledge management. ACM, 2007.
Milne, David, and Ian H. Witten. “Learning to link with wikipedia.”Proceedings of the 17th ACM conference on Information and knowledgemanagement. ACM, 2008.
Pawan Goyal (IIT Kharagpur) Entity Linking August 17th, 2016 23 / 23