annotating streams of heterogeneous data for topic generation
DESCRIPTION
Talk given at the VU University Amsterdam, NL - February 6, 2013 Abstract: Since the advent of Linked Data, we have observed a dramatic increase of structured data sources published on the Web. They provide mainly entity to entity interconnections, resulting in a Web of Linked Entities, disambiguated through URIs, spanning structured and unstructured data. Several efforts have been made to exploit such a mine of information for enhancing text understanding, by connecting pieces of text to real world objects, i.e. entities, that are easily discoverable by intelligent agents, resulting in a proliferation of different systems for text annotation through "Web" entities. In this perspective, we have developed a framework for harmonizing the access to such systems and their output results. The NERD ontology [1] aligns the difference in the annotations and provide a definition for a set of axioms taken from the long tail distribution of common classes among the used extractors. Powered on top of the NERD ontology, we have developed NERD [2] which implements a combined logic that looks for minimizing the error of annotation taking the best, when possible, from these extractors. We have observed that the well-known entity classes, such as Person, Location, Organization are well covered from these extractors, while Event is less, mainly due to a lack of definition and knowledge about what are events. As a follow-up of the Eventmedia project [3], we are defining an event spotter which takes advantage from the large event graph knowledge described in the Eventmedia dataset [4]. Sources of structured and unstructured data are also social platforms. They constantly record streams of heterogeneous data about human’s activities, feelings, emotions, conversations opening a window to the world in real-time. Making sense out of these streams is extremely challenging. We are currently investigating the role of named entities as centroids for micropost topic generations, presenting them through visual galleries. [1] - http://nerd.eurecom.fr/ontology [2] - http://nerd.eurecom.fr [3] - http://eventmedia.eurecom.fr [4] - http://eventmedia.eurecom.fr/sparqlTRANSCRIPT
Annotating streams of Annotating streams of heterogeneous data for topic heterogeneous data for topic
generationgeneration
Giuseppe [email protected]
@giusepperizzo
Ferbruary 6, 2013 2/22VU University Amsterdam, NL
Spotting entities while reading a document
➢ Name of People, Locations, Organizations, etc..
➢ Named entities are fundamental keys for topic understanding
➢ But, the same location can refer to different places
source: http://goo.gl/kVzlK
Ferbruary 6, 2013 3/22VU University Amsterdam, NL
A Web of Linked Entities
➢ GGG (global giant graph) http://goo.gl/fH3h
➢ Nodes are Web entities
➢ Entities provide disambiguation pointers
➢ Entities can be univocally referred (disambiguated)
➢ Entities as centroids for topic generation and undestanding
source: http://wole2013.eurecom.fr
source: http://wole2012.eurecom.fr
Ferbruary 6, 2013 4/22VU University Amsterdam, NL
Entity extractors
Web
API
Disam
bigu
atio
n
URI
Ferbruary 6, 2013 5/22VU University Amsterdam, NL
DiversityAlchemy
APIDBpedia Spotlight
Extractiv Lupedia OpenCalais
Saplo SemiTags
Wikimeta Yahoo! Zemanta
Language EN,FR,DE,IT,PT,RU,SP,SW
EN EN EN,FR,IT
EN,FRSP
EN,SW
DE,NL
EN,FRSP
EN EN
Granularity OEN OEN OEN OEN OEN OED OED OEN OEN OED
Entityposition
N/A charoffset
wordoffset
range of chars
charoffset
N/A charoffset
POSoffset
rangeof
chars
N/A
Classificationschema
Alchemy DBpediaFreeBaseScema.or
g
Extractiv DBpediaLinkedM
DB
OpenCalais
Saplo ConLL-3
ESTER Yahoo FreeBase
Number of classes
324 320 34 319 95 5 4 7 13 81
ResponseFormat
JSONMicroFXMLRDF
HTMLJSONRDFXML
HTMLJSONRDFXML
HTMLJSONRDFaXML
JSONMicroFormat
JSON XML JSONXML
JSONXML
XMLJSONRDF
Quota (calls/day)
30000 unl 3000 unl 50000 1333 unl unl 5000 10000
Ferbruary 6, 2013 6/22VU University Amsterdam, NL
Harmonizing annotations
http://nerd.eurecom.fr
ontology1
REST API2
UI3
1 http://nerd.eurecom.fr/ontology2 http://nerd.eurecom.fr/api/application.wadl3 http://nerd.eurecom.fr
Ferbruary 6, 2013 7/22VU University Amsterdam, NL
NERD Ontology NERD type Occurrence
Person 10
Organization 10
Country 6
Company 6
Location 6
Continent 5
City 5
RadioStation 5
Album 5
Product 5
... ...
The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked Data
Ferbruary 6, 2013 8/22VU University Amsterdam, NL
ETAPE2012
➢ DGA (French radio transcripts)– Train: 7h 50m– Dev: 3h – Eval: 3h
➢ ELDA (French TV transcripts)– Train: 18h 10m– Dev: 7h 55m– Eval: 7h 55m
➢ Annotation schema Quaero: 32 classes
Ferbruary 6, 2013 9/22VU University Amsterdam, NL
We can do better: combined
(eA1
,tA1
,URIA1
,siA1
,eiA1
) .........(e
A2,t
A2,URI
A2,si
A2,ei
A2)
(eA3
,tA3
,URIA3
,siA3
,eiA3
)
(eN2
,tN2
,URIN2
,siN2
,eiN2
)
(eN1
,tN1
,URIN1
,siN1
,eiN1
)
extraction
cleaning
fusionWhen at least 2 extractors classify the same entity with a different type then we apply a preferred selection order (learning rules): Wikimeta, AlchemyAPI, OpenCalais, Lupedia
ETAPE2012
Ferbruary 6, 2013 10/22VU University Amsterdam, NL
… but it introduced systematic errors
SLR (Slot Error Rate)
prec recall F1 %correct
alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%
lupedia 39.49% 22.87% 1.56% 2.91% 1.56%
opencalais 37.47% 41.69% 3.53% 6.49% 3.53%
wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%
combined (nerd)
86.85% 35.31% 17.69% 23.44% 17.69%
ETAPE2012
Ferbruary 6, 2013 11/22VU University Amsterdam, NL
Gazetteers: combined+
(eA1
,tA1
,URIA1
,siA1
,eA1
)
`
(eA2
,tA2
,URIA2
,siA2
,eiA2
)
(eN1
,tN1
,URIN1
,sN1
,eN1
)
...
Learned model
Created static rules
fusion
Conflicts handled by priority selection:own, Wikimeta,AlchemyAPI,OpenCalais,Lupedia
POS tagger
Apply rules
(e1,t
1,URI
1,si
1,ei
1)
ETAPE2012
Ferbruary 6, 2013 12/22VU University Amsterdam, NL
Over-estimated training model
SLR (Slot Error Rate)
prec recall F1 %correct
combined 86.85% 35.31% 17.69% 23.44% 17.69%
combined+ 188.81% 15.13% 28.40% 19.45% 28.40%
ETAPE2012
Ferbruary 6, 2013 13/22VU University Amsterdam, NL
General NER limitations
➢ Perfomances drop– with common settings using off-the-shelf
models, while annotating corpora which differs from the training model (empirically recall drops of ~20%)
– with noisy texts such as transcripts, microposts
➢ Lack of knowledge for particular categories, in particular Event
Ferbruary 6, 2013 14/22VU University Amsterdam, NL
Participation at the #MSM2013 challenge
➢ English Twitter posts– Train: 2815 posts– Eval: 1526 posts
➢ Annotation schema: 4 classes
➢ Objective: perform better than the Stanford CFR, properly trained with the challenge settings
prec recall F1
LOC 80.12% 57.76% 67.63%
MISC 68.18% 31.51% 43.10%
ORG 83.28% 50.71% 63.04%
PER 79.93% 70.72% 75.04%
4-fold cross validation over training - provisional results of the Stanford CFR
on going
Ferbruary 6, 2013 15/22VU University Amsterdam, NL
Poor performances of spotting events
➢ Exploit large domain knowledge represented by the Eventmedia dataset1
➢ EventSpotter– Entities classified according to the LODE ontology– Spotting according to the event name, agents,
temporal and geo spatial information– Confidence computed according to the similarity
of the surrounding text where the entity has been spotted and the event description
– Disambiguation provided by the event URIs (nodes of the Eventmedia graph)
1 http://eventmedia.eurecom.fr/sparql
Ferbruary 6, 2013 16/22VU University Amsterdam, NL
Entities for concept mining
➢ Used to annotate textual data– news articles, and ...
➢ Video transcripts:– video segmentation (MediaFragment)– MediaFragment annotation– indexing– topic model generation
➢ Microposts:– text understanding– topic model generation
Ferbruary 6, 2013 17/22VU University Amsterdam, NL
Media Fragment Enricher
source: http://goo.gl/BMZK3joint work between University of
Southampton and EURECOM
Ferbruary 6, 2013 18/22VU University Amsterdam, NL
Annotating social streams
➢ Live and fresh breaking news: microposts
➢ Media items, such as pictures and videos, usually are attached to microposts
➢ Grouping microposts:– Entity labels– Entity classes– Latent Dirichlet allocation (LDA)– Density based micropost proximity (text similarity,
entity similarity, temporal distance)
➢ Create textual storyboards from vox populi
➢ Describe visually the created storyboards
Ferbruary 6, 2013 19/22VU University Amsterdam, NL
Centroids for topic generation
➢ Each cloud represents a topic
➢ A topic is depicted by an entity
➢ Leaf are media items, which visually represent the microposts
➢ Each leaf can belong to many topics
Ferbruary 6, 2013 20/22VU University Amsterdam, NL
Topic storyboard
➢ Visual summary of the topic
➢ Topic is labelled with an entity
➢ A poster picture is displayed according to the relevance of the micropost in the generated topic
➢ If the entity points to a LOD resource, a textual description is displayed
Ferbruary 6, 2013 21/22VU University Amsterdam, NL
Outlook
➢ Modelling heterogeneous data with entities
➢ Linking entities according to the topics extracted from the text
➢ Enhancing topic modelling with the GGG
➢ Providing visual storyboards tailored with the extracted topics
Ferbruary 6, 2013 22/22VU University Amsterdam, NL
Thanks for your time and attention
http://www.slideshare.net/giusepperizzo
Agenda:– Web of Linked Entities (sl. 3)– Aligning annotations (sl. 6)– Combining performances of 3rd-
party entity extractors (sl. 9) – Spotting events (sl. 15)– Annotating MFs and microposts for
topic generation (sl. 16)– Topic storyboard generation (sl. 19)