effective named entity recognition for idiosyncratic web collections
DESCRIPTION
Presentation at WWW 2014TRANSCRIPT
![Page 1: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/1.jpg)
1
Effective Named Entity Recognition for Idiosyncratic Web Collections
Roman Prokofyev, Gianluca Demartini, Philippe Cudre-MaurouxeXascale Infolab, University of Fribourg, Switzerland
WWW 2014April 10, 2014
![Page 2: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/2.jpg)
2
Outline• Introduction
• Problem definition• Existing approaches and applicability
• Overview• Candidate Named Entities Selection• Dataset description• Features description• Experimental setup & Evaluation
![Page 3: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/3.jpg)
3
Problem Definition
• search engine• web search
engine• navigational
query• user intent• information need• web content• …
Entity type: scientific concept
![Page 4: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/4.jpg)
4
Traditional NERTypes:• Maximum Entropy (Mallet, NLTK)• Conditional Random Fields (Stanford NER, Mallet)
Properties:• Require extensive training• Usually domain-specific, different collections require training on their domain
• Very good at detecting such types as Location, Person, Organization
![Page 5: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/5.jpg)
5
Proposed ApproachOur problem is defined as a classification task.
Two-step classification:• Extract candidate named entities using frequency filtration algorithm.
• Classify candidate named entities using supervised classifier.
Candidate selection should allow us to greatly reduce the number of n-grams to classify, possibly without significant loss in Recall.
![Page 6: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/6.jpg)
6
Pipeline
![Page 7: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/7.jpg)
7
Candidate Selection: Part IConsider all bigrams with frequency > k (k=2):
candidate named: 5entity are: 4entity candidate: 3entity in: 18entity recognition: 12named entity: 101of named: 10that named: 3the named: 4
candidate named: 5entity candidate: 3entity recognition: 12named entity: 101
NLTK stop word filter
![Page 8: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/8.jpg)
8
Candidate Selection: Part IITrigram frequency is looked up from the n-gram index. candidate named entity: 5
named entity candidate: 3named entity recognition: 12named entity: 101candidate named: 5entity candidate: 3entity recognition: 12
candidate named: 5entity candidate: 3entity recognition: 12named entity: 101
candidate named entity: 5named entity candidate: 3named entity recognition: 12named entity: 81candidate named: 0entity candidate: 0entity recognition: 0
![Page 9: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/9.jpg)
9
Candidate Selection: Discussion
Possible to extract n-grams (n>2) with frequency ≤k
![Page 10: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/10.jpg)
10
After Candidate Selection
TwiNER: named entity recognition in targeted twitter stream‘SIGIR 2012
![Page 11: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/11.jpg)
11
Classifier: OverviewMachine Learning algorithm:Decision Trees from scikit-learn package.
Feature types:• POS Tags and their derivatives• External Knowledge Bases (DBLP, DBPedia)• DBPedia relation graphs• Syntactic features
![Page 12: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/12.jpg)
12
DatasetsTwo collections:• CS Collection (SIGIR 2012 Research Track): 100 papers
• Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category
CS Collection Physics Collection
N# Candidate N-grams 21 531 18 129
N# Judged N-grams 15 057 11 421
N# Valid Entities 8 145 5 747
N# Invalid N-grams 6 912 5 674
Available at: github.com/XI-lab/scientific_NER_dataset
![Page 13: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/13.jpg)
13
Features: POS Tags, part I
100+ different tag patterns
![Page 14: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/14.jpg)
14
Features: POS Tags, part II
Two feature schemes:• Raw POS tag patterns, each tag is a binary feature• Regex POS tag patterns:
• First tag match, for example:
• Last tag match:
JJ NNSJJ NN NNJJ NN...
JJ*
NN VBNN NN VBJJ NN VB...
*VB
![Page 15: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/15.jpg)
15
Features: External Knowledge BasesDomain-specific knowledge bases:
• DBLP (Computer Science): contains author-assigned keywords to the papers
• ScienceWISE: high-quality scientific concepts (mostly for Physics domain) http://sciencewise.info
We perform exact string matching with these KBs.
![Page 16: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/16.jpg)
16
Features: DBPedia, part IDBPedia pages essentially represent valid entities
But there are a few problems when:• N-gram is not an entity• N-gram is not a scientific concept (“Tom Cruise”
in IR paper)
CS Collection Physics Collection
Precision Recall Precision Recall
Exact string matching 0.9045 0.2394 0.7063 0.0155
Matching with redirects 0.8457 0.4229 0.7768 0.5843
![Page 17: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/17.jpg)
17
Features: DBPedia, part II
Without redirects With redirects
![Page 18: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/18.jpg)
18
Features: Syntactic
Set of common syntactic features:• N-gram length in words• Whether n-gram is uppercased• The number of other n-gram given n-gram is part of
![Page 19: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/19.jpg)
19
Experiments: Overview
1. Regex POS Patterns vs Normal POS tags2. Redirects vs Non-redirects3. Feature importance scores4. MaxEntropy comparison
All results are obtained using average with 10-fold cross-validation.
![Page 20: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/20.jpg)
20
Experiments: Comparison I
CS Collection Precision Recall F1 score
Accuracy N# features
Normal POS + Components
0.8794 0.8058* 0.8409* 0.8429* 54
Regex POS + Components
0.8475* 0.8524* 0.8499* 0.8448* 9
Normal POS + Components-Redirects
0.8678* 0.8305* 0.8487* 0.8473 50
Regex POS + Components-Redirects
0.8406* 0.8769 0.8584 0.8509 7
The symbol * indicates a statistically significant difference as compared to the approach in bold.
![Page 21: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/21.jpg)
21
Experiments: Comparison II
Physics Collection Precision Recall F1 score
Accuracy N# features
Normal POS + Components
0.8253* 0.6567* 0.7311* 0.7567 53
Regex POS + Components
0.7941* 0.6781 0.7315* 0.7492* 4
Normal POS + Components-Redirects
0.8339 0.6674* 0.7412 0.7653 50
Regex POS + Components-Redirects
0.8375 0.6479* 0.7305* 0.7592* 6
The symbol * indicates a statistically significant difference as compared to the approach in bold.
![Page 22: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/22.jpg)
22
Experiments: Feature Importance
Importance
NN STARTS 0.3091
DBLP 0.1442
Components + DBLP 0.1125
Components 0.0789
VB ENDS 0.0386
NN ENDS 0.0380
JJ STARTS 0.0364
Importance
ScienceWISE 0.2870
Component + ScienceWISE
0.1948
Wikipedia redirect 0.1104
Components 0.1093
Wikilinks 0.0439
Participation count 0.0370
CS Collection, 7 features Physics Collection, 6 features
![Page 23: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/23.jpg)
23
Experiments: MaxEntropy
Precision Recall F1 score
Maximum Entropy 0.6566 0.7196 0.6867
Decision Trees 0.8121 0.8742 0.8420
MaxEnt classifier receives full text as input.(we used a classifier from NLTK package)
Comparison experiment: 80% of CS Collection as a training data, 20% as a test dataset.
![Page 24: Effective Named Entity Recognition for Idiosyncratic Web Collections](https://reader033.vdocuments.site/reader033/viewer/2022061300/54c6fa464a795931168b45e8/html5/thumbnails/24.jpg)
24
Lessons LearnedClassic NER approaches are not good enough for Idiosyncratic Web Collections
Leveraging the graph of scientific concepts is a key feature
Domain specific KBs and POS patterns work well
Experimental results show up to 85% accuracy over different scientific collectionshttp://iner.exascale.info/
eXascale Infolab, http://exascale.info