effective named entity recognition for idiosyncratic web collections

1

Effective Named Entity Recognition for Idiosyncratic Web Collections

Roman Prokofyev, Gianluca Demartini, Philippe Cudre-MaurouxeXascale Infolab, University of Fribourg, Switzerland

WWW 2014April 10, 2014

2

Outline• Introduction

• Problem definition• Existing approaches and applicability

• Overview• Candidate Named Entities Selection• Dataset description• Features description• Experimental setup & Evaluation

3

Problem Definition

• search engine• web search

engine• navigational

query• user intent• information need• web content• …

Entity type: scientific concept

4

Traditional NERTypes:• Maximum Entropy (Mallet, NLTK)• Conditional Random Fields (Stanford NER, Mallet)

Properties:• Require extensive training• Usually domain-specific, different collections require training on their domain

• Very good at detecting such types as Location, Person, Organization

5

Proposed ApproachOur problem is defined as a classification task.

Two-step classification:• Extract candidate named entities using frequency filtration algorithm.

• Classify candidate named entities using supervised classifier.

Candidate selection should allow us to greatly reduce the number of n-grams to classify, possibly without significant loss in Recall.

6

Pipeline

7

Candidate Selection: Part IConsider all bigrams with frequency > k (k=2):

candidate named: 5entity are: 4entity candidate: 3entity in: 18entity recognition: 12named entity: 101of named: 10that named: 3the named: 4

candidate named: 5entity candidate: 3entity recognition: 12named entity: 101

NLTK stop word filter

8

Candidate Selection: Part IITrigram frequency is looked up from the n-gram index. candidate named entity: 5

named entity candidate: 3named entity recognition: 12named entity: 101candidate named: 5entity candidate: 3entity recognition: 12

candidate named: 5entity candidate: 3entity recognition: 12named entity: 101

candidate named entity: 5named entity candidate: 3named entity recognition: 12named entity: 81candidate named: 0entity candidate: 0entity recognition: 0

9

Candidate Selection: Discussion

Possible to extract n-grams (n>2) with frequency ≤k

10

After Candidate Selection

TwiNER: named entity recognition in targeted twitter stream‘SIGIR 2012

11

Classifier: OverviewMachine Learning algorithm:Decision Trees from scikit-learn package.

Feature types:• POS Tags and their derivatives• External Knowledge Bases (DBLP, DBPedia)• DBPedia relation graphs• Syntactic features

12

DatasetsTwo collections:• CS Collection (SIGIR 2012 Research Track): 100 papers

• Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category

CS Collection Physics Collection

N# Candidate N-grams 21 531 18 129

N# Judged N-grams 15 057 11 421

N# Valid Entities 8 145 5 747

N# Invalid N-grams 6 912 5 674

Available at: github.com/XI-lab/scientific_NER_dataset

http://arxiv.org/

https://github.com/XI-lab/scientific_NER_dataset



13

Features: POS Tags, part I

100+ different tag patterns

14

Features: POS Tags, part II

Two feature schemes:• Raw POS tag patterns, each tag is a binary feature• Regex POS tag patterns:

• First tag match, for example:

• Last tag match:

JJ NNSJJ NN NNJJ NN...

JJ*

NN VBNN NN VBJJ NN VB...

*VB

15

Features: External Knowledge BasesDomain-specific knowledge bases:

• DBLP (Computer Science): contains author-assigned keywords to the papers

• ScienceWISE: high-quality scientific concepts (mostly for Physics domain) http://sciencewise.info

We perform exact string matching with these KBs.

http://sciencewise.info/

16

Features: DBPedia, part IDBPedia pages essentially represent valid entities

But there are a few problems when:• N-gram is not an entity• N-gram is not a scientific concept (“Tom Cruise”

in IR paper)

CS Collection Physics Collection

Precision Recall Precision Recall

Exact string matching 0.9045 0.2394 0.7063 0.0155

Matching with redirects 0.8457 0.4229 0.7768 0.5843

http://dbpedia.org/

17

Features: DBPedia, part II

Without redirects With redirects

18

Features: Syntactic

Set of common syntactic features:• N-gram length in words• Whether n-gram is uppercased• The number of other n-gram given n-gram is part of

19

Experiments: Overview

1. Regex POS Patterns vs Normal POS tags2. Redirects vs Non-redirects3. Feature importance scores4. MaxEntropy comparison

All results are obtained using average with 10-fold cross-validation.

20

Experiments: Comparison I

CS Collection Precision Recall F1 score

Accuracy N# features

Normal POS + Components

0.8794 0.8058* 0.8409* 0.8429* 54

Regex POS + Components

0.8475* 0.8524* 0.8499* 0.8448* 9

Normal POS + Components-Redirects

0.8678* 0.8305* 0.8487* 0.8473 50

Regex POS + Components-Redirects

0.8406* 0.8769 0.8584 0.8509 7

The symbol * indicates a statistically significant difference as compared to the approach in bold.

21

Experiments: Comparison II

Physics Collection Precision Recall F1 score

Accuracy N# features

Normal POS + Components

0.8253* 0.6567* 0.7311* 0.7567 53

Regex POS + Components

0.7941* 0.6781 0.7315* 0.7492* 4

Normal POS + Components-Redirects

0.8339 0.6674* 0.7412 0.7653 50

Regex POS + Components-Redirects

0.8375 0.6479* 0.7305* 0.7592* 6

The symbol * indicates a statistically significant difference as compared to the approach in bold.

22

Experiments: Feature Importance

Importance

NN STARTS 0.3091

DBLP 0.1442

Components + DBLP 0.1125

Components 0.0789

VB ENDS 0.0386

NN ENDS 0.0380

JJ STARTS 0.0364

Importance

ScienceWISE 0.2870

Component + ScienceWISE

0.1948

Wikipedia redirect 0.1104

Components 0.1093

Wikilinks 0.0439

Participation count 0.0370

CS Collection, 7 features Physics Collection, 6 features

23

Experiments: MaxEntropy

Precision Recall F1 score

Maximum Entropy 0.6566 0.7196 0.6867

Decision Trees 0.8121 0.8742 0.8420

MaxEnt classifier receives full text as input.(we used a classifier from NLTK package)

Comparison experiment: 80% of CS Collection as a training data, 20% as a test dataset.

24

Lessons LearnedClassic NER approaches are not good enough for Idiosyncratic Web Collections

Leveraging the graph of scientific concepts is a key feature

Domain specific KBs and POS patterns work well

Experimental results show up to 85% accuracy over different scientific collectionshttp://iner.exascale.info/

eXascale Infolab, http://exascale.info

http://iner.exascale.info/

http://iner.exascale.info/

http://exascale.info/

effective named entity recognition for idiosyncratic web collections

Science

entity candidate

entity ngram

candidate ngrams

list of n

extract candidate

number of n

invalid ngrams

judged ngrams