nerd: evaluating named entity recognition tools in the web of data

NERD: Evaluating Named Entity Recognition Tools in the Web of Data

Giuseppe Rizzo <[email protected]>Raphaël Troncy <[email protected]>

mailto:[email protected]

mailto:[email protected]

24 October 2011 Workshop on Web Scale Knowledge Extraction (WEKEX'11) - 2/21

What is a Named Entity recognition task?

A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document

htpp://www.eurecom.fr/


Named Entity recognition tools



Differences among those NER extractors

Granularity extract NE from sentences vs from the entire document

Technologies used algorithms used to extract NE supported languages taxonomy of type of NE recognized disambiguation (dataset used to provide links) content request size Response format



And ...

What about precision and recall? Which extractor best fits my needs?



Seeks to find pros and cons of those extractors

ontology3REST API1

UI2

What is NERD?

1 http://nerd.eurecom.fr/api/application.wadl2 http://nerd.eurecom.fr/3 http://nerd.eurecom.fr/ontology

http://nerd.eurecom.fr/api/application.wadl

http://nerd.eurecom.fr/

http://nerd.eurecom.fr/ontology



Showcase

http://nerd.eurecom.fr

Science: "Google Cars Drive Themselves", http://bit.ly/oTj8md (part of the original resource found at http://nyti.ms/9p19i8)


http://bit.ly/oTj8md

http://nyti.ms/9p19i8



Evaluation

Controlled experiment 4 human raters 10 English news articles (5 from BBC and 5 from The New York Times) each rater evaluated each article for all the extractors 200 evaluations in total

Uncontrolled experiment 17 human raters 53 English news articles (sources: CNN, BBC, The New York Times and Yahoo! News) free selection of articles

5 extractors using default configurations

Each human rater received a training1

1 http://nerd.eurecom.fr/help

http://www.bbc.com/

http://www.nytimes.com/

http://edition.cnn.com/

http://www.bbc.com/


http://news.yahoo.com/

http://nerd.eurecom.fr/help



Evaluation output

The assessment consists in rating these criteria with a Boolean value

If no type or no disambiguation URI is provided by the extractor, it is considered false by default

t = (NE, type, URI, relevant)



Controlled experiment - dataset1

Categories: World, Business, Sport, Science, Health

1 BBC article and 1 NYT article for each category

Average word number per article: 981

The final number of unique entities detected is 4641 with an average number of named entity per article equal to 23.2

Some of the extractors (e.g. DBpedia Spotlight and Extractiv) provide NE duplicates. We removed all duplicates do not bias the statistics

1 http://nerd.eurecom.fr/ui/evaluation/wekex2011-goldenset.tar.gz

http://nerd.eurecom.fr/ui/evaluation/wekex2011-goldenset.tar.gz



Controlled experiment – agreement score

Grouped by extractor

Grouped by source

Fleiss's kappa score1

1 Joseph L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382, 1971

Grouped by category



Controlled experiment – statistic result

Overall statistics

Grouped by extractor

Grouped by category

different behavior for different sources



Uncontrolled experiment - dataset

17 raters were free to select English news articles from CNN, BBC, The New York Times and Yahoo! News

53 news articles selected

Total number of assessments = 94 and the assessment average number per user = 5.2

Each article assessed at least by 2 different tools

The final number of unique entities detected is 1616 with an average number of named entity per article equal to 34

Some of the extractors (e.g. DBpedia Spotlight and Extractiv) provide NE duplicates. In order do not bias the statistics, we removed all duplicates

http://edition.cnn.com/

http://www.bbc.com/


http://news.yahoo.com/



Uncontrolled experiment – statistic result (I)

Overall precision

Grouped by extractors



Grouped by category

Uncontrolled experiment – statistic result (II)



ConclusionQ. Which are the best NER tools ?A. They are ...

AlchemyAPI has obtained the best results in NE extraction and categorization

DBpedia Spotlight and Zemanta showed ability to disambiguate NE in the LOD cloud

Experiments across categories of articles did not show significant differences in the analysis.

Published the WEKEX'11 ground-truthhttp://nerd.eurecom.fr/ui/evaluation/wekex2011-goldenset.tar.gz

http://nerd.eurecom.fr/ui/evaluation/wekex2011-goldenset.tar.gz



Future Work (NERD Timeline)

beginning

today

uncontrolled experiment

core application

controlled experiment

REST API, release WEKEX'11 ground-truth

release ISWC'11 ground truth

NERD “smart” service: combining the best of all NER tools



ISWC'11 golden-set

Do you believe it's easy to find an agreement among all raters?

We'd like inviting to create a new golden-set during the ISWC'2011 poster and demo session. We will kindly ask each rater to evaluate two short parts of two English news articles with all extractors supported by NERD



http://nerd.eurecom.fr

http://www.slideshare.net/giusepperizzo

Thanks for your time and your attention

@giusepperizzo @rtroncy #nerd


http://www.slideshare.net/giusepperizzo



Fleiss ' Kappa

chance agreement

K = 1 fully agreement among all raters

K = 0 (or lesser than) poor agreement



Fleiss ' kappa Interpretation

Kappa Interpretation

< 0 Poor agreement

0.01 – 0.20 Slight agreement

0.21 – 0.40 Fair agreement

0.41 – 0.60 Moderate agreement

0.61 – 0.80 Substantial agreement

0.81 – 1.00 Almost perfect agreement


nerd: evaluating named entity recognition tools in the web of data

Technology

wekex11 groundtruthhttp

nominal scale agreement

bbc article

ne duplicates

assessment average number

categories of articles

nyt article

total number of assessments