1 1 why and how is this a “related document”?: semantics-based analysis of and navigation...
TRANSCRIPT
1
1
Why and how is this a “related document”?:
Semantics-based analysis of and
navigation through heterogeneous text
corpora
Bettina Berendt & Daniel Trümper(KU Leuven / HU Berlin)
Blaž Fortuna, Marko Grobelnik & Dunja Mladenič (JSI Ljubljana)
www.cs.kuleuven.be/~berendt
3
3
1. News and blogs
Application motivation: Beyond dedicated search engines
(Lloyd et al., Proc. CAAW 2006; Berendt et al., Kommunikation, Partizipation und Wirkungen im Social Web , 2008; Berendt, Fortuna et al., in prep.)
2. Multilingual sources Good results in semi-automatic ontology learning based on simple machine translation
4
4PASCAL motivation: Re-use Textgarden‘s bread&butter and advanced tools
Text to bag-of-words
Ontogen
http://www.textmining.nethttp://ontogen.ijs.si/
6
6
Solution approach: Architecture & states overview
Construct composite-similarity neighbourhood *
SelectDocument *
Aspect-based similarity search*
Build ontology
Selectneighbour- hood *
Search
GlobalAnalysis
Localanalysis
Data / toolExternalTextgarden toolUser actionCreated in this project *
Refocus *
Source doc.s database*Ont. Learning (Ontogen)
Import ontology *
Web
Retrieval & Preprocessing *
Specify sources & filters *
7
7
Retrieval and preprocessing
• Crawler / wrapper * (uses Blogdigger)• Translator * (uses Babelfish)• Preprocessing (Txt2Bow)• NER (GATE)• Similarity Computation *
Web
Source doc.s databaseRetrieval & Preprocessing
17
17Constructing the similarity measure & neighbourhood (III)
A news source
A German-language blog
Most neighbours are blogs
Most neighbours are English-
language blogs
English blog
German blog
English news
23
23
“Pump-priming“: PORPOISE as catalyst
Using PASCAL softwarefor analyzing
social-media doc.s
Using PASCAL softwarefor analyzing multilingual
social-media doc.s
Analyzingblogs and news
PORPOISE
PORPOISE+:More fine-grained
sailing
STORYGROWTH:Tracking conceptand community
evolutionSupporting
constructive search
DM4E:“More constructive
search“