generating pseudo-ground truth for detecting new concepts in social streams

78
Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke

Upload: david-graus

Post on 27-Jan-2015

107 views

Category:

Science


0 download

DESCRIPTION

The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.

TRANSCRIPT

Page 1: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke

Page 2: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

What is "anema"?

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 3: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 4: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 5: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 6: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 7: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

For content interpretation and complex filtering tasks we want to know who/what people talk about.

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 8: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Entity Linking

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TANK (VEHICLE)

KnowledgeBase (KB)

Document r

TANKquery q

?

?

TANK JOHNSON

Page 9: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 10: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 11: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 12: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Named Entity Recognition

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 13: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Named Entity Recognition

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 14: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Challenges

1. Entity "importance"

2. Noisy & short text (Twitter), updates in the KB

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 15: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Challenge 1: Entity Importance

Q: When should an entity exist in Wikipedia?

A: When it is important or has impact

!

!

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 16: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Challenge 1: Entity Importance

Q: When should an entity exist in Wikipedia?

A: When it is important or has impact

!

Q: How do you know an entity is important or has impact?

A: If it is in Wikipedia, it is/has

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 17: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Challenge 1: Entity Importance

Can we leverage today's entities to learn to predict tomorrow's entities?

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 18: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Challenge 1: Entity Importance

Can we leverage today's entities to learn to predict tomorrow's entities?

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

74

Page 19: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Challenge 1: Entity Importance

Can we leverage today's entities to learn to predict tomorrow's entities?

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

74/140

Page 20: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Challenge 2: Noisy data & changing KB

Unsupervised method for generating pseudo-ground truth (for training NER)

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 21: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Assumption

A named-entity recognizer trained only on KB entities will learn to recognize KB entities

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 22: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Page 23: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet

Page 24: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

Page 25: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Page 26: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

Page 27: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

Page 28: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

Page 29: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Page 30: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

Page 31: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

Hahaha! Are we sure Jillert Anema isn't Canadian? RT @rzbh: Dutch Coach's Anti-America Rant http://on.cc.com/1htk9Wo

Page 32: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

Page 33: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Page 34: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

Page 35: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

Page 36: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.product

organizationorganization

Page 37: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

NERC

Page 38: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

NERC

NERCModel

Page 39: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

NERC

NERCModel

Page 40: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

NERC

NERCModel

Predictions

m1, c1m2, c2

Page 41: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

NERC

NERCModel

Predictions

m1, c1m2, c2

Today's KB

small KB

Page 42: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

FutureKB

Unlabeled Tweet

?

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

SampleCorpus

Training data

m1, c1m2, c2

NERC

NERCModel

Predictions

m1, c1m2, c2

Today's KB

small KB

Page 43: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

FutureKB

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Unlabeled Tweet

?

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

SampleCorpus

Training data

m1, c1m2, c2

NERC

NERCModel

Predictions

m1, c1m2, c2

Today's KB

full KB

small KB

Page 44: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

�����������

m1, e1m2, e2

Unlabeled Tweet

?

SampleCorpus

Training data

m1, c1m2, c2

NERC

NERCModel

Predictions

m1, c1m2, c2

Today's KB

FutureKB

Page 45: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

Unlabeled Tweet

? NERCModel

Predictions

m1, c1m2, c2

Today's KB

FutureKB

Ground Truth

m1, c1m2, c2

Page 46: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Tweet EntityLinker

Unlabeled Tweet

? NERCModel

Predictions

m1, c1m2, c2

Today's KB

FutureKB

Ground Truth

m1, c1m2, c2

Evaluate

Page 47: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Evaluation

• Mention level (NER style)

• Entity level

!

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 48: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Evaluation

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

Prediction

• Mention level (NER style)

• Entity level

!

Page 49: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

• Mention level (NER style)

• Entity level

!

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Evaluation

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

Prediction

Ground Truth

Page 50: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Evaluation

• Mention level (NER style)

• Entity level

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

This is like IBM buying Apple after the Homebrew Computing Club

demo of the Apple I.

Prediction

Ground Truth

Page 51: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Tweets Entities Tweets EntitiesPredictionGround Truth

Page 52: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Tweets Entities Tweets EntitiesPredictionGround Truth

Page 53: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Experimental setup

Data:

Corpus: Twitter (TREC'11 MB: 4,832,838 tweets)

KB: Wikipedia (Jan 4th, 2012)

!

Components:

EL: Semanticizer

NERC: [email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 54: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

"links": [ { "text": "ASAP", "linkProbability": 0.17446043165467626, "id": "30864663", "senseProbability": 0.11690647482014388, "title": "ASAP (variety show)", "url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29", "label": "ASAP", "priorProbability": 0.631578947368421 }, { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 },

"ASAP Rocky and Kendrick Lamar, that's when I started listening again"

Page 55: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

NERC

Two-stage approach [1]

1. Recognition

• Predict entity span

• For each token predict B, I, or O tag.

• Structured perceptron

2. Classification

• Given entity span, predict entity class (PER/LOC/ORG)

• SVMs

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.

Page 56: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

NERC

Two-stage approach [1]

1. Recognition

• Predict entity span

• For each token predict B, I, or O tag.

• Structured perceptron

2. Classification

• Given entity span, predict entity class (PER/LOC/ORG)

• SVMs

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.

"ASAP Rocky and Kendrick Lamar, that's when I started listening again"

Page 57: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

NERC

Two-stage approach [1]

1. Recognition

• Predict entity span

• For each token predict B, I, or O tag.

• Structured perceptron

2. Classification

• Given entity span, predict entity class (PER/LOC/ORG)

• SVMs

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.

"ASAP Rocky and Kendrick Lamar, that's when I started listening again"

B I O B I

O O O O O O

Page 58: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

NERC

Two-stage approach [1]

1. Recognition

• Predict entity span

• For each token predict B, I, or O tag.

• Structured perceptron

2. Classification

• Given entity span, predict entity class (PER/LOC/ORG)

• SVMs

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.

"ASAP Rocky and Kendrick Lamar, that's when I started listening again"

person person

Page 59: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

�����������

m1, e1m2, e2

FutureKB

NERC

Tweet NERCModel

Unlabeled Tweet

?

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

Predictions

m1, c1m2, c2

EntityLinker

SampleCorpus

Training data

m1, c1m2, c2

Page 60: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

From tweet to training sample

1. Convert EL output (Wikipedia concepts) to NERC labels;

• Label entity span (B-I-O) & class (PER/LOC/ORG)

!

2. Pick "good" samples

• entity linker's confidence score

• textual quality

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 61: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

"links": [ { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 },

"ASAP Rocky and Kendrick Lamar, that's when I started listening again"

Page 62: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Entity Class

1. Map Wikipedia entity to DBpedia entity

2. Retrieve entity class (ontology);

• if Person: PER

• if Organisation, Company, or Non-ProfitOrganisation: ORG

• if Place, PopulatedPlace, City, Country: LOC

• …?

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 63: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Sampling Methods

1. Entity linker confidence score

2. Textual quality

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 64: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Sampling 1: Confidence Score

• Extract anchor text (a) to Wikipedia page (W)-mappings

• Confidence score combines two signals:

1. How common is it that a is used as a link

2. How commonly is a used as a link to W

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 65: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Sampling 1: Confidence Score

• Higher threshold = fewer entities, less noise

• Lower threshold = fewer entities, more noise

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 66: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Sampling 2: Textual quality

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 67: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Sampling 2: Textual quality

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Highest scoring tweets

1. Watching the History channel, Hitler’”⁹s

Family. Hitler hid his true family

heritage, while others had to measure

up to Aryan purity.

2. When you sense yourself becoming

negative, stop and consider what it

would mean to apply that negative

energy in the opposite direction.

3. So. After school tomorrow, french

revision class. Tuesday, Drama

rehearsal and then at 8, cricket

training. Wednesday, Drama. Thursday

… (c)

Lowest scoring tweets

1. Toni Braxton ~ He Wasn't Man

Enough for Me _HASHTAG_

_HASHTAG_? _URL_ RT _Mention_

2. tell me what u think The GetMore

Girls, Part One _URL_

3. this girl better not go off on me rt

Page 68: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Sampling 2: Textual quality

• Compare different sampling strategies;

• top tweets

• medium tweets

• medium+top tweets

• low+medium+top tweets (no sampling)

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 69: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Results

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 70: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

�����������

m1, e1m2, e2

NERC

NERCModel

Unlabeled Tweet

?EntityLinker

FutureKB

Tweet

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus Sample

Corpus

Predictions

m1, c1m2, c2

RQ1: What is the impact of our sampling methods for generating pseudo-ground truth?

Training data

m1, c1m2, c2

Page 71: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Findings: EL confidence score threshold1. Higher threshold, higher accuracy

!

!

!

!

!

!

!

Solid: Precision Dotted: Recall

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

0"

5"

10"

15"

20"

25"

30"

35"

0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"

Page 72: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Findings: EL confidence score threshold

2. Higher threshold, more predictions

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 73: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Findings: Textual Quality Sampling

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 74: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

FutureKB

Tweet

Unlabeled Tweet

? NERCModel

NERC

Training data

m1, c1m2, c2

SampleCorpus

�����������

m1, e1m2, e2

TweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpusTweetCorpus

RQ2: What is the impact of the size of prior knowledge on detecting unknown entities?

EntityLinker

Today's KB

Predictions

m1, c1m2, c2

Page 75: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Results: RQ2

Sampling 2: KB size (mentions)

!

!

!

!

!

!

blue: Our method red: Baseline

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

0"

5"

10"

15"

20"

25"

30"

35"

40"

45"

50"

20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"

Page 76: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Conclusions

Recall increases as amount of prior knowledge grows:

1. Able to deal with missing labels, justifying approach

2. Rate of unknown entity detection increases as KB grows

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 77: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Future Work

• Next step: Closing the loop

• Feed back to KB (entity normalization)

• From PER/LOC/ORG entities to other classes:

• Books, buildings, drugs, artists, …?

• Apply to other domains, languages

• From random sampling to time-based sampling

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Page 78: Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Fin

Questions?

!

!

!

!

!

!

!

[email protected] | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media