![Page 1: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/1.jpg)
1
1
Sailing the Corpus Sea: Tools for Visual
Discovery of Stories in Blogs and News
Bettina Berendt
www.cs.kuleuven.be/~berendt
![Page 2: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/2.jpg)
2
2
About me
![Page 3: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/3.jpg)
3
3
Thanks to ...
Ilija Subašić
Tool forthcoming;
all beta testers and experiment participants welcome!
Daniel Trümper
Tool at http://www.cs.kuleuven.be/~berendt/PORPOISE/
PASCAL
for funding PORPOISE
![Page 4: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/4.jpg)
4
4
References
Subašić, I. & Berendt, B. (2008). Web Mining for Understanding Stories through Graph Visualisation. In Proceedings of ICDM 2008. IEEE Press.
Berendt, B. and D. Trümper (2009). Semantics-based analysis and navigation of heterogeneous text corpora: the Porpoise news and blogs engine. In I.-H. Ting & H.-J. Wu (Eds.), Web Mining Applications in E-commerce and E-services (pp. 45-64). Berlin etc.: Springer, Studies in Computational Intelligence, Vol. 172
Berendt, B. & Subašić, I. (2009). Measuring graph topology for interactive temporal event detection. Künstliche Intelligenz, 02/09, 11-17.
Subašić, I. & Berendt, B. (in press). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems.
Please see http://www.cs.kuleuven.be/~berendt/ for these papers.
![Page 5: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/5.jpg)
5
5
Motivation 1: What‘s the story?
![Page 6: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/6.jpg)
6
6Motivation 2: Global+local interaction; beyond “similar documents“
with respect to what?
![Page 7: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/7.jpg)
7
7Solution vision: Sailing the Internet
GlobalAnalysis
Search
Localanalysis
![Page 8: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/8.jpg)
8
8
Solution approach: Architecture & states overview (version 1)
Construct composite-similarity neighbourhood
Search + selectdocument(s)
Aspect-based similarity search
Storyunderstanding
Selectneighbour- hood
Search
GlobalAnalysis
Localanalysis
Refocus
Source doc.s databaseStory/ontology learning
Import
Web
Retrieval & Preprocessing
Specify sources & filters
Storyspace
Documentspace
![Page 9: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/9.jpg)
9
9
Retrieval and preprocessing
• Crawler / wrapper, using • Yahoo! Web services• Yahoo! / Google News• Blogdigger
• Translator (uses Babelfish)• Preprocessing (uses Textgarden, Terrier)• Named-entity recognition (uses GATE, OpenCalais)• Similarity Computation
Web
Source doc.s databaseRetrieval & Preprocessing
![Page 10: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/10.jpg)
10
10
Story learning
Construct composite-similarity neighbourhood
Search + selectdocument(s)
Aspect-based similarity search
Storyunderstanding
Selectneighbour- hood
Search
GlobalAnalysis
Localanalysis
Refocus
Source doc.s databaseStory/ontology learning
Web
Retrieval & Preprocessing
Specify sources & filters
Storyspace
Documentspace
![Page 11: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/11.jpg)
11
11
Story learning: a focus on news-type stories
![Page 12: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/12.jpg)
12
12
Story learning needs timelines
![Page 13: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/13.jpg)
13
13
... But where‘s the evolution? Approach Temporal latent topics
Mei & Zhai, PKDD 2005
• no fine-grained relational information• “themes“ are fixed by the algorithm• no „drill down“ possible no combination of machine and human intelligence
![Page 14: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/14.jpg)
14
14
The ETP3 problem
Evolutionary theme patterns discovery, summary & exploration
1. identify topical sub-structure in a set (generally, a time-indexed stream) of documents constrained by being about a common topic
2. show how these substructures emerge, change, and disappear (and maybe re-appear) over time
3. provide intuitive interfaces for interactively exploring the topics (story space) and the underlying documents (document space)
and for their own sense-making
– use machine-generated summarization only as a starting point!
![Page 15: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/15.jpg)
15
15
So ...
![Page 16: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/16.jpg)
16
16
Demo
![Page 17: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/17.jpg)
17
17
Evaluations (so far ...)
3. Learning effectiveness
Document search with story graphs leads to averages of
67-75% accuracy on judgments of story fact truth
3.4 nodes/words per query – but this appears to depend on the corpus
1. Information retrieval quality
Challenge: What is the ground truth
Build on Wikipedia or other timelines
Edges – events: up to 80% recall, ca. 30% precision
2. Search quality
Story subgraphs index coherent document clusters
![Page 18: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/18.jpg)
18
18
How to spot events / eventfulness (1)
Search for properties that can be detected formally and visually
Tested so far: Graph topology
Global properties
– Graph size
– Number of connected components
– Size of connected components (avg, max)
Local properties
– Degree
– Degree centrality
– Sum of adjacent TR weights
![Page 19: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/19.jpg)
19
19
Evidence 1 (a story begins)
![Page 20: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/20.jpg)
20
20
Evidence 2 (a suspect is found)
![Page 21: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/21.jpg)
21
21
Evidence 3 (An eventless time)
![Page 22: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/22.jpg)
22
22
Spotting events (2):
Changes in story graphs mark theadvent of an event
![Page 23: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/23.jpg)
23
23
So what happens once you have found documents?
Construct composite-similarity neighbourhood
Search + selectdocument(s)
Aspect-based similarity search
Storyunderstanding
Selectneighbour- hood
Search
GlobalAnalysis
Localanalysis
Refocus
Source doc.s databaseStory/ontology learning
Import
Web
Retrieval & Preprocessing
Specify sources & filters
Storyspace
Documentspace
![Page 24: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/24.jpg)
24
24The neighbourhood of a document
![Page 25: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/25.jpg)
25
25Constructing the similarity measure & neighbourhood (I)
![Page 26: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/26.jpg)
26
26Constructing the similarity measure & neighbourhood (II)
![Page 27: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/27.jpg)
27
27Constructing the similarity measure & neighbourhood (III)
A news source
A German-language blog
Most neighbours are blogs
Most neighbours are English-
language blogs
English blog
German blog
English news
![Page 28: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/28.jpg)
28
28Comparing documents
![Page 29: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/29.jpg)
29
29Comparing documents; utilizing multilingual sources
![Page 30: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/30.jpg)
30
30Refocusing
![Page 31: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/31.jpg)
31
31Structuring a neighbourhood
![Page 32: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/32.jpg)
32
32Ex.: Finding a “story“
![Page 33: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/33.jpg)
33
33
Behind the scenes
![Page 34: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/34.jpg)
34
34
Ingredients of a solutionto ETP3 = Evolutionary theme patterns discovery, summary and exploration
Document / text pre-processing
Document summarization strategy
Selection approach for concepts
Similarity measure to determine relations
Burstiness measure
Interaction approach
STORIES
• Template recognition• Multi-document named entities• Stopword removal, lemmatization
• no topics, but salient concepts & relations• time window; word-span window
• concepts = words or named entities• salient concept = high TF & involved in a salient relation, time-indexed
• bursty co-occurrence
• time relevance, a “temporal co-occurrence lift”
• Graphs (& layout)• Comparative statics or morphing• Drill-down: “uncovering” relations• Links to documents (in progress)
ETP3
![Page 35: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/35.jpg)
35
35
Data collection and preprocessing
Articles from a news archive for search term identifying the top-level story
Only English-language articles
Only freely available articles
Preprocessing: (repeated from above)
HTML cleaning
tokenization
stopword removal
lemmatization
multi-document named-entity recognition
![Page 36: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/36.jpg)
36
36
Story elements
content-bearing words
the 150 top-TF words without stopwords
![Page 37: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/37.jpg)
37
37
Story stages: co-occurrence in a window
“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)
![Page 38: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/38.jpg)
38
38
Salient story elements
1. Split whole corpus T by week (or some other time interval)
2. For each week
Compute the weights for corpus t for this week
3. Weight =
Support of co-occurrence of 2 content-bearing words w1, w2 in t =
(# articles from t containing both w1, w2 in window) / (# all articles in t)
4. Threshold
Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)
Time-relevance TR of co-occurrence(w1, w2) =
support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥
θ2 (e.g., 2) *
5. Rank by TR, for each week identify top 2
6. Story elements = peak words = all elements of these top 2 pairs (# = 38)
![Page 39: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/39.jpg)
39
39
Salient story stages, and story evolution
7. Story stage = co-occurrences of peak words in t
For each week t: aggregate over t-2, t-1, t moving average
8. Story evolution = how story stages evolve over the t in T
![Page 40: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/40.jpg)
40
40
Summary
Navigation in story space story building
+
Document search
+
Navigation in document space
lead to understandable, useful + intuitive interfaces
![Page 41: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/41.jpg)
41
41
Datasets: Where does STORIES not work well?
the need for contiguity and a sizeable dataset
DUC dataset
the need for action and a storyline
Tsunami dataset
the benefit of hindsight
Enron
![Page 42: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/42.jpg)
42
42
My questions to you – and: The Future
linguistic analysis
sparsity problem
which word classes, which relations?
semantic enhancements
linkage information
indexing and ranking for search
graph mining
business models for search
news/blogs
other texts
![Page 43: 1 1 Sailing the Corpus Sea: Tools for Visual Discovery of Stories in Blogs and News Bettina Berendt berendt](https://reader035.vdocuments.site/reader035/viewer/2022062222/5697bf9b1a28abf838c9285b/html5/thumbnails/43.jpg)
43
43
Thanks!
... also for:
your questions!