annotating streams of heterogeneous data for topic generation

22
Annotating streams of Annotating streams of heterogeneous data for topic heterogeneous data for topic generation generation Giuseppe Rizzo [email protected] @giusepperizzo

Upload: giuseppe-rizzo

Post on 10-May-2015

1.450 views

Category:

Education


1 download

DESCRIPTION

Talk given at the VU University Amsterdam, NL - February 6, 2013 Abstract: Since the advent of Linked Data, we have observed a dramatic increase of structured data sources published on the Web. They provide mainly entity to entity interconnections, resulting in a Web of Linked Entities, disambiguated through URIs, spanning structured and unstructured data. Several efforts have been made to exploit such a mine of information for enhancing text understanding, by connecting pieces of text to real world objects, i.e. entities, that are easily discoverable by intelligent agents, resulting in a proliferation of different systems for text annotation through "Web" entities. In this perspective, we have developed a framework for harmonizing the access to such systems and their output results. The NERD ontology [1] aligns the difference in the annotations and provide a definition for a set of axioms taken from the long tail distribution of common classes among the used extractors. Powered on top of the NERD ontology, we have developed NERD [2] which implements a combined logic that looks for minimizing the error of annotation taking the best, when possible, from these extractors. We have observed that the well-known entity classes, such as Person, Location, Organization are well covered from these extractors, while Event is less, mainly due to a lack of definition and knowledge about what are events. As a follow-up of the Eventmedia project [3], we are defining an event spotter which takes advantage from the large event graph knowledge described in the Eventmedia dataset [4]. Sources of structured and unstructured data are also social platforms. They constantly record streams of heterogeneous data about human’s activities, feelings, emotions, conversations opening a window to the world in real-time. Making sense out of these streams is extremely challenging. We are currently investigating the role of named entities as centroids for micropost topic generations, presenting them through visual galleries. [1] - http://nerd.eurecom.fr/ontology [2] - http://nerd.eurecom.fr [3] - http://eventmedia.eurecom.fr [4] - http://eventmedia.eurecom.fr/sparql

TRANSCRIPT

Page 1: Annotating streams of heterogeneous data for topic generation

Annotating streams of Annotating streams of heterogeneous data for topic heterogeneous data for topic

generationgeneration

Giuseppe [email protected]

@giusepperizzo

Page 2: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 2/22VU University Amsterdam, NL

Spotting entities while reading a document

➢ Name of People, Locations, Organizations, etc..

➢ Named entities are fundamental keys for topic understanding

➢ But, the same location can refer to different places

source: http://goo.gl/kVzlK

Page 3: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 3/22VU University Amsterdam, NL

A Web of Linked Entities

➢ GGG (global giant graph) http://goo.gl/fH3h

➢ Nodes are Web entities

➢ Entities provide disambiguation pointers

➢ Entities can be univocally referred (disambiguated)

➢ Entities as centroids for topic generation and undestanding

source: http://wole2013.eurecom.fr

source: http://wole2012.eurecom.fr

Page 4: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 4/22VU University Amsterdam, NL

Entity extractors

Web

API

Disam

bigu

atio

n

URI

Page 5: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 5/22VU University Amsterdam, NL

DiversityAlchemy

APIDBpedia Spotlight

Extractiv Lupedia OpenCalais

Saplo SemiTags

Wikimeta Yahoo! Zemanta

Language EN,FR,DE,IT,PT,RU,SP,SW

EN EN EN,FR,IT

EN,FRSP

EN,SW

DE,NL

EN,FRSP

EN EN

Granularity OEN OEN OEN OEN OEN OED OED OEN OEN OED

Entityposition

N/A charoffset

wordoffset

range of chars

charoffset

N/A charoffset

POSoffset

rangeof

chars

N/A

Classificationschema

Alchemy DBpediaFreeBaseScema.or

g

Extractiv DBpediaLinkedM

DB

OpenCalais

Saplo ConLL-3

ESTER Yahoo FreeBase

Number of classes

324 320 34 319 95 5 4 7 13 81

ResponseFormat

JSONMicroFXMLRDF

HTMLJSONRDFXML

HTMLJSONRDFXML

HTMLJSONRDFaXML

JSONMicroFormat

JSON XML JSONXML

JSONXML

XMLJSONRDF

Quota (calls/day)

30000 unl 3000 unl 50000 1333 unl unl 5000 10000

Page 6: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 6/22VU University Amsterdam, NL

Harmonizing annotations

http://nerd.eurecom.fr

ontology1

REST API2

UI3

1 http://nerd.eurecom.fr/ontology2 http://nerd.eurecom.fr/api/application.wadl3 http://nerd.eurecom.fr

Page 7: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 7/22VU University Amsterdam, NL

NERD Ontology NERD type Occurrence

Person 10

Organization 10

Country 6

Company 6

Location 6

Continent 5

City 5

RadioStation 5

Album 5

Product 5

... ...

The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked Data

Page 8: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 8/22VU University Amsterdam, NL

ETAPE2012

➢ DGA (French radio transcripts)– Train: 7h 50m– Dev: 3h – Eval: 3h

➢ ELDA (French TV transcripts)– Train: 18h 10m– Dev: 7h 55m– Eval: 7h 55m

➢ Annotation schema Quaero: 32 classes

Page 9: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 9/22VU University Amsterdam, NL

We can do better: combined

(eA1

,tA1

,URIA1

,siA1

,eiA1

) .........(e

A2,t

A2,URI

A2,si

A2,ei

A2)

(eA3

,tA3

,URIA3

,siA3

,eiA3

)

(eN2

,tN2

,URIN2

,siN2

,eiN2

)

(eN1

,tN1

,URIN1

,siN1

,eiN1

)

extraction

cleaning

fusionWhen at least 2 extractors classify the same entity with a different type then we apply a preferred selection order (learning rules): Wikimeta, AlchemyAPI, OpenCalais, Lupedia

ETAPE2012

Page 10: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 10/22VU University Amsterdam, NL

… but it introduced systematic errors

SLR (Slot Error Rate)

prec recall F1 %correct

alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%

lupedia 39.49% 22.87% 1.56% 2.91% 1.56%

opencalais 37.47% 41.69% 3.53% 6.49% 3.53%

wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%

combined (nerd)

86.85% 35.31% 17.69% 23.44% 17.69%

ETAPE2012

Page 11: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 11/22VU University Amsterdam, NL

Gazetteers: combined+

(eA1

,tA1

,URIA1

,siA1

,eA1

)

`

(eA2

,tA2

,URIA2

,siA2

,eiA2

)

(eN1

,tN1

,URIN1

,sN1

,eN1

)

...

Learned model

Created static rules

fusion

Conflicts handled by priority selection:own, Wikimeta,AlchemyAPI,OpenCalais,Lupedia

POS tagger

Apply rules

(e1,t

1,URI

1,si

1,ei

1)

ETAPE2012

Page 12: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 12/22VU University Amsterdam, NL

Over-estimated training model

SLR (Slot Error Rate)

prec recall F1 %correct

combined 86.85% 35.31% 17.69% 23.44% 17.69%

combined+ 188.81% 15.13% 28.40% 19.45% 28.40%

ETAPE2012

Page 13: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 13/22VU University Amsterdam, NL

General NER limitations

➢ Perfomances drop– with common settings using off-the-shelf

models, while annotating corpora which differs from the training model (empirically recall drops of ~20%)

– with noisy texts such as transcripts, microposts

➢ Lack of knowledge for particular categories, in particular Event

Page 14: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 14/22VU University Amsterdam, NL

Participation at the #MSM2013 challenge

➢ English Twitter posts– Train: 2815 posts– Eval: 1526 posts

➢ Annotation schema: 4 classes

➢ Objective: perform better than the Stanford CFR, properly trained with the challenge settings

prec recall F1

LOC 80.12% 57.76% 67.63%

MISC 68.18% 31.51% 43.10%

ORG 83.28% 50.71% 63.04%

PER 79.93% 70.72% 75.04%

4-fold cross validation over training - provisional results of the Stanford CFR

on going

Page 15: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 15/22VU University Amsterdam, NL

Poor performances of spotting events

➢ Exploit large domain knowledge represented by the Eventmedia dataset1

➢ EventSpotter– Entities classified according to the LODE ontology– Spotting according to the event name, agents,

temporal and geo spatial information– Confidence computed according to the similarity

of the surrounding text where the entity has been spotted and the event description

– Disambiguation provided by the event URIs (nodes of the Eventmedia graph)

1 http://eventmedia.eurecom.fr/sparql

Page 16: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 16/22VU University Amsterdam, NL

Entities for concept mining

➢ Used to annotate textual data– news articles, and ...

➢ Video transcripts:– video segmentation (MediaFragment)– MediaFragment annotation– indexing– topic model generation

➢ Microposts:– text understanding– topic model generation

Page 17: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 17/22VU University Amsterdam, NL

Media Fragment Enricher

source: http://goo.gl/BMZK3joint work between University of

Southampton and EURECOM

Page 18: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 18/22VU University Amsterdam, NL

Annotating social streams

➢ Live and fresh breaking news: microposts

➢ Media items, such as pictures and videos, usually are attached to microposts

➢ Grouping microposts:– Entity labels– Entity classes– Latent Dirichlet allocation (LDA)– Density based micropost proximity (text similarity,

entity similarity, temporal distance)

➢ Create textual storyboards from vox populi

➢ Describe visually the created storyboards

Page 19: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 19/22VU University Amsterdam, NL

Centroids for topic generation

➢ Each cloud represents a topic

➢ A topic is depicted by an entity

➢ Leaf are media items, which visually represent the microposts

➢ Each leaf can belong to many topics

Page 20: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 20/22VU University Amsterdam, NL

Topic storyboard

➢ Visual summary of the topic

➢ Topic is labelled with an entity

➢ A poster picture is displayed according to the relevance of the micropost in the generated topic

➢ If the entity points to a LOD resource, a textual description is displayed

Page 21: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 21/22VU University Amsterdam, NL

Outlook

➢ Modelling heterogeneous data with entities

➢ Linking entities according to the topics extracted from the text

➢ Enhancing topic modelling with the GGG

➢ Providing visual storyboards tailored with the extracted topics

Page 22: Annotating streams of heterogeneous data for topic generation

Ferbruary 6, 2013 22/22VU University Amsterdam, NL

Thanks for your time and attention

http://www.slideshare.net/giusepperizzo

Agenda:– Web of Linked Entities (sl. 3)– Aligning annotations (sl. 6)– Combining performances of 3rd-

party entity extractors (sl. 9) – Spotting events (sl. 15)– Annotating MFs and microposts for

topic generation (sl. 16)– Topic storyboard generation (sl. 19)