time.mk - igor trajkovski @ glocal: inside social media
TRANSCRIPT
Introduction Crawler Clustering Classification Scoring Results Conclusion
Computer generated news site
TIME.mk
Dr Igor Trajkovski
New York University SkopjeSkopje, Macedonia
16.10.2009
Trajkovski: Computer generated news site - TIME.mk 1 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Outline
1 Introduction
2 Crawler
3 Clustering
4 Classification
5 Scoring
6 Results
7 Conclusion
Trajkovski: Computer generated news site - TIME.mk 2 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Motivation
Traditionally, news readers first pick a medium and then look forheadlines that interest them.
NEW: Pick a story, then read what mediums wrote about it
Trajkovski: Computer generated news site - TIME.mk 3 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Google Approach: news.google.com
Trajkovski: Computer generated news site - TIME.mk 4 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Google Approach: www.time.mk
Trajkovski: Computer generated news site - TIME.mk 5 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Outline
1 Introduction
2 Crawler
3 Clustering
4 Classification
5 Scoring
6 Results
7 Conclusion
Trajkovski: Computer generated news site - TIME.mk 6 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Crawling links
Many Macedonian news sites don’t have RSS feeds.
Crawling hubs:
- Macedonia http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1
- Economy http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2
- Culture http://novamakedonija.com.mk/DesktopDefault.aspx?tabid=2&CategID=26
- None www.bbc.co.uk/macedonian/
Many hubs per source
- fixed addresses of the hubs (A1, Makfax, etc.)
- dynamic addresses of the hubs (Dnevnik, Utrinski, Sitel, etc.)
Hubs already classified into predefined topics:
- Macedonia, Balkan, World, Economy, Skopje, Culture, Technology, Fun, LifeStyle, ShowBiz, Sport,
Chronic, None
At the moment, hubs of the sources are provided manually.
Trajkovski: Computer generated news site - TIME.mk 7 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
WWW “jungle”
Some of the issues:
News articles change their title (A1, BBC, Netpress, etc.)
News articles change their address (Sitel, Dnevnik, etc.)
Corrupted HTML pages (need for development of cusomized UTF-8 decoder)
News articles from several topics appear on the same page
Trajkovski: Computer generated news site - TIME.mk 8 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Crawler statistics
100 parallel crawlers (threads) running on a quad core machine
50 news sources (+3 per month), 350 hubs
4500 live news articles (archives not included)
1400 new news articles (10-15% duplicates)
Crawling of all hubs takes 1-2 minutes
Refresh rate: 10 minutes
Trajkovski: Computer generated news site - TIME.mk 9 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Information extraction
Trajkovski: Computer generated news site - TIME.mk 10 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Outline
1 Introduction
2 Crawler
3 Clustering
4 Classification
5 Scoring
6 Results
7 Conclusion
Trajkovski: Computer generated news site - TIME.mk 11 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Clustering problem
Trajkovski: Computer generated news site - TIME.mk 12 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Clustering problem
Trajkovski: Computer generated news site - TIME.mk 13 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Keywords extraction
The weight vector for document d is:
vd = [w1,d ,w2,d , ...,wK ,d ]T
where:
wt,d = tft · log|D|
|Dt |
and:
K is the number of distinct words occurring in all news articles.
|Dt | is the number of news articles containing the term t, and
tft is term frequency of term t in document d (a local parameter)
log |D||Dt |
is inverse document frequency (a global parameter), where |D| is the
total number of news articles
The model is known as Term Frequency-Inverse Document Frequency (TF-IDF) model.
Trajkovski: Computer generated news site - TIME.mk 14 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Keywords extraction (cont.)
News article is represented with the top Nkw keywords thathave largest weight.
The weight vector vd is normalized, ‖vd‖ = 1.
Computation of news articles similarity by calculating theangle between news articles vectors:
cos θ =v1 · v2
‖v1‖‖v2‖
Trajkovski: Computer generated news site - TIME.mk 15 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Hierarchical Agglomerative Clustering (HAC)
Trajkovski: Computer generated news site - TIME.mk 16 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Outline
1 Introduction
2 Crawler
3 Clustering
4 Classification
5 Scoring
6 Results
7 Conclusion
Trajkovski: Computer generated news site - TIME.mk 17 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
News story classification
Problem: Most news sources don’t have detailed classifications ofnews articles as TIME.mk haveExample:
Balkan articles published in section World
Technology, ShowBiz, LifeStyle in Fun,
Skopje in Macedonia, etc.
also, they use different topic names from the TIME.mk’s topicnames:Example:
Region for Balkan
Trends for Culture,
Future for Technology, etc.
Trajkovski: Computer generated news site - TIME.mk 18 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
News story classification using ontology
Let N1 is the #articles from the most freq. class C1
Let N2 is the #articles from the second most freq. class C2
Class =
C1, if C1 is more specific or no relation withC2
C2, if C2 is more specific thanC1 and N1 ≤ 2 · N2
C1, if C2 is more specific thanC1 and N1 > 2 · N2
Trajkovski: Computer generated news site - TIME.mk 19 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Outline
1 Introduction
2 Crawler
3 Clustering
4 Classification
5 Scoring
6 Results
7 Conclusion
Trajkovski: Computer generated news site - TIME.mk 20 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
General properties for any news ranking algorithm
Any ranking algorithm for news stories (clusters) should have atleast the following four properties:
Time awareness.The importance of a piece of news changes over the time.
Important News Story is covered by many news articles.The weighted size of the cluster is a measure of its importance.
Authority of the sources.
BBC is more authoritative than Kirilica.
Diversity.News story reported by three different news sources is more important than news story of three articles
published by one source.
Trajkovski: Computer generated news site - TIME.mk 21 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
TIME.mk ranking algorithm
Weight of the cluster c at moment t:
WC (c , t) = SourceEntropy(c) ·
k∑
i=1
WN(ni , t)
where:
k is the size of the cluster c .
ni are the news articles of cluster c .
SourceEntropy(c) represents the entropy of the news sources.
WN(ni , t) the weigth of news article ni at time t which hasbeen published at time ti :
WN(ni , t) = A(source(ni )) · e−α(t−ti ), t > ti
A(s) accounts for the authority of the source. One source ismore authoritive if it is more cited than other sources.
Trajkovski: Computer generated news site - TIME.mk 22 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Outline
1 Introduction
2 Crawler
3 Clustering
4 Classification
5 Scoring
6 Results
7 Conclusion
Trajkovski: Computer generated news site - TIME.mk 23 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Dataset
Category #news Category #news
Macedonia 58,596 Sport 43,498Balkan 19,341 Chronicle 10,108World 29,647 Culture 16,933
Economy 29,754 Technology 5,755Skopje 7,325 Fun/Showbiz 37,798
Uncategorized 58205 Total 316960
Collected over a period of one year (from 01/07/08 to 30/06/09).
Trajkovski: Computer generated news site - TIME.mk 24 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Classification of News Stories
Ontology majority voting algorithm for cluster classification.
Category Precision Recall Category Precision Recall
Macedonia 95% 98% Sport 98% 96%Balkan 94% 96% Chronicle 98% 98%World 98% 98% Culture 98% 96%
Economy 97% 96% Technology 98% 98%Skopje 98% 96% Fun/Showbiz 98% 97%
100 (per category) manually labeled clusters was used as an evaluation set.
Trajkovski: Computer generated news site - TIME.mk 25 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Visits statistics
Trajkovski: Computer generated news site - TIME.mk 26 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Outline
1 Introduction
2 Crawler
3 Clustering
4 Classification
5 Scoring
6 Results
7 Conclusion
Trajkovski: Computer generated news site - TIME.mk 27 / 28
Introduction Crawler Clustering Classification Scoring Results Conclusion
Conclusion and future work
This work has been motivated by the large usage of newsengines versus the lack of academic papers in this area.
The design and implementation details of a full scale newsengine was presented.
A model for scoring news articles and news stories waspresented.
An extensive testing on more than 300,000 news articles,posted by 50 sources over one year, has been performed,showing very encouraging results.
Future work: Scoring of news articles and news stories.New “semantic” model for representing news is needed toreplace current “bag of words” model.
Trajkovski: Computer generated news site - TIME.mk 28 / 28