annotating streams of heterogeneous data for topic generation

Annotating streams of Annotating streams of heterogeneous data for topic heterogeneous data for topic

generationgeneration

Giuseppe [email protected]

@giusepperizzo

mailto:[email protected]

http://twitter.com/giusepperizzo

Ferbruary 6, 2013 2/22VU University Amsterdam, NL

Spotting entities while reading a document

➢ Name of People, Locations, Organizations, etc..

➢ Named entities are fundamental keys for topic understanding

➢ But, the same location can refer to different places

source: http://goo.gl/kVzlK

http://goo.gl/kVzlK

htpp://www.eurecom.fr/


A Web of Linked Entities

➢ GGG (global giant graph) http://goo.gl/fH3h

➢ Nodes are Web entities

➢ Entities provide disambiguation pointers

➢ Entities can be univocally referred (disambiguated)

➢ Entities as centroids for topic generation and undestanding

source: http://wole2013.eurecom.fr

source: http://wole2012.eurecom.fr

http://goo.gl/fH3h

http://wole2013.eurecom.fr/

http://wole2012.eurecom.fr/



Entity extractors

Web

API

Disam

bigu

atio

n

URI



DiversityAlchemy

APIDBpedia Spotlight

Extractiv Lupedia OpenCalais

Saplo SemiTags

Wikimeta Yahoo! Zemanta

Language EN,FR,DE,IT,PT,RU,SP,SW

EN EN EN,FR,IT

EN,FRSP

EN,SW

DE,NL

EN,FRSP

EN EN

Granularity OEN OEN OEN OEN OEN OED OED OEN OEN OED

Entityposition

N/A charoffset

wordoffset

range of chars

charoffset

N/A charoffset

POSoffset

rangeof

chars

N/A

Classificationschema

Alchemy DBpediaFreeBaseScema.or

g

Extractiv DBpediaLinkedM

DB

OpenCalais

Saplo ConLL-3

ESTER Yahoo FreeBase

Number of classes

324 320 34 319 95 5 4 7 13 81

ResponseFormat

JSONMicroFXMLRDF

HTMLJSONRDFXML

HTMLJSONRDFXML

HTMLJSONRDFaXML

JSONMicroFormat

JSON XML JSONXML

JSONXML

XMLJSONRDF

Quota (calls/day)

30000 unl 3000 unl 50000 1333 unl unl 5000 10000



Harmonizing annotations

http://nerd.eurecom.fr

ontology1

REST API2

UI3

1 http://nerd.eurecom.fr/ontology2 http://nerd.eurecom.fr/api/application.wadl3 http://nerd.eurecom.fr

http://nerd.eurecom.fr/



NERD Ontology NERD type Occurrence

Person 10

Organization 10

Country 6

Company 6

Location 6

Continent 5

City 5

RadioStation 5

Album 5

Product 5

... ...

The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked Data



ETAPE2012

➢ DGA (French radio transcripts)– Train: 7h 50m– Dev: 3h – Eval: 3h

➢ ELDA (French TV transcripts)– Train: 18h 10m– Dev: 7h 55m– Eval: 7h 55m

➢ Annotation schema Quaero: 32 classes



We can do better: combined

(eA1

,tA1

,URIA1

,siA1

,eiA1

) .........(e

A2,t

A2,URI

A2,si

A2,ei

A2)

(eA3

,tA3

,URIA3

,siA3

,eiA3

)

(eN2

,tN2

,URIN2

,siN2

,eiN2

)

(eN1

,tN1

,URIN1

,siN1

,eiN1

)

extraction

cleaning

fusionWhen at least 2 extractors classify the same entity with a different type then we apply a preferred selection order (learning rules): Wikimeta, AlchemyAPI, OpenCalais, Lupedia

ETAPE2012



… but it introduced systematic errors

SLR (Slot Error Rate)

prec recall F1 %correct

alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%

lupedia 39.49% 22.87% 1.56% 2.91% 1.56%

opencalais 37.47% 41.69% 3.53% 6.49% 3.53%

wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%

combined (nerd)

86.85% 35.31% 17.69% 23.44% 17.69%

ETAPE2012



Gazetteers: combined+

(eA1

,tA1

,URIA1

,siA1

,eA1

)

`

(eA2

,tA2

,URIA2

,siA2

,eiA2

)

(eN1

,tN1

,URIN1

,sN1

,eN1

)

...

Learned model

Created static rules

fusion

Conflicts handled by priority selection:own, Wikimeta,AlchemyAPI,OpenCalais,Lupedia

POS tagger

Apply rules

(e1,t

1,URI

1,si

1,ei

1)

ETAPE2012



Over-estimated training model

SLR (Slot Error Rate)

prec recall F1 %correct

combined 86.85% 35.31% 17.69% 23.44% 17.69%

combined+ 188.81% 15.13% 28.40% 19.45% 28.40%

ETAPE2012



General NER limitations

➢ Perfomances drop– with common settings using off-the-shelf

models, while annotating corpora which differs from the training model (empirically recall drops of ~20%)

– with noisy texts such as transcripts, microposts

➢ Lack of knowledge for particular categories, in particular Event



Participation at the #MSM2013 challenge

➢ English Twitter posts– Train: 2815 posts– Eval: 1526 posts

➢ Annotation schema: 4 classes

➢ Objective: perform better than the Stanford CFR, properly trained with the challenge settings

prec recall F1

LOC 80.12% 57.76% 67.63%

MISC 68.18% 31.51% 43.10%

ORG 83.28% 50.71% 63.04%

PER 79.93% 70.72% 75.04%

4-fold cross validation over training - provisional results of the Stanford CFR

on going



Poor performances of spotting events

➢ Exploit large domain knowledge represented by the Eventmedia dataset1

➢ EventSpotter– Entities classified according to the LODE ontology– Spotting according to the event name, agents,

temporal and geo spatial information– Confidence computed according to the similarity

of the surrounding text where the entity has been spotted and the event description

– Disambiguation provided by the event URIs (nodes of the Eventmedia graph)

1 http://eventmedia.eurecom.fr/sparql

http://eventmedia.eurecom.fr/sparql



Entities for concept mining

➢ Used to annotate textual data– news articles, and ...

➢ Video transcripts:– video segmentation (MediaFragment)– MediaFragment annotation– indexing– topic model generation

➢ Microposts:– text understanding– topic model generation



Media Fragment Enricher

source: http://goo.gl/BMZK3joint work between University of

Southampton and EURECOM

http://goo.gl/BMZK3



Annotating social streams

➢ Live and fresh breaking news: microposts

➢ Media items, such as pictures and videos, usually are attached to microposts

➢ Grouping microposts:– Entity labels– Entity classes– Latent Dirichlet allocation (LDA)– Density based micropost proximity (text similarity,

entity similarity, temporal distance)

➢ Create textual storyboards from vox populi

➢ Describe visually the created storyboards



Centroids for topic generation

➢ Each cloud represents a topic

➢ A topic is depicted by an entity

➢ Leaf are media items, which visually represent the microposts

➢ Each leaf can belong to many topics



Topic storyboard

➢ Visual summary of the topic

➢ Topic is labelled with an entity

➢ A poster picture is displayed according to the relevance of the micropost in the generated topic

➢ If the entity points to a LOD resource, a textual description is displayed



Outlook

➢ Modelling heterogeneous data with entities

➢ Linking entities according to the topics extracted from the text

➢ Enhancing topic modelling with the GGG

➢ Providing visual storyboards tailored with the extracted topics



Thanks for your time and attention

http://www.slideshare.net/giusepperizzo

Agenda:– Web of Linked Entities (sl. 3)– Aligning annotations (sl. 6)– Combining performances of 3rd-

party entity extractors (sl. 9) – Spotting events (sl. 15)– Annotating MFs and microposts for

topic generation (sl. 16)– Topic storyboard generation (sl. 19)


annotating streams of heterogeneous data for topic generation

Education

vu university amsterdam

spotting entities

h eval

disambiguated entities

itspsw nl sp pt

en enen

video transcripts

fr enen