time.mk - igor trajkovski @ glocal: inside social media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Computer generated news site

TIME.mk

Dr Igor Trajkovski

New York University SkopjeSkopje, Macedonia

16.10.2009

Trajkovski: Computer generated news site - TIME.mk 1 / 28


Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion



Motivation

Traditionally, news readers first pick a medium and then look forheadlines that interest them.

NEW: Pick a story, then read what mediums wrote about it



Google Approach: news.google.com



Google Approach: www.time.mk



Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion



Crawling links

Many Macedonian news sites don’t have RSS feeds.

Crawling hubs:

- Macedonia http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1

- Economy http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2

- Culture http://novamakedonija.com.mk/DesktopDefault.aspx?tabid=2&CategID=26

- None www.bbc.co.uk/macedonian/

Many hubs per source

- fixed addresses of the hubs (A1, Makfax, etc.)

- dynamic addresses of the hubs (Dnevnik, Utrinski, Sitel, etc.)

Hubs already classified into predefined topics:

- Macedonia, Balkan, World, Economy, Skopje, Culture, Technology, Fun, LifeStyle, ShowBiz, Sport,

Chronic, None

At the moment, hubs of the sources are provided manually.


http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1

http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2

http://novamakedonija.com.mk/DesktopDefault.aspx?tabid=2&CategID=26

www.bbc.co.uk/macedonian/


WWW “jungle”

Some of the issues:

News articles change their title (A1, BBC, Netpress, etc.)

News articles change their address (Sitel, Dnevnik, etc.)

Corrupted HTML pages (need for development of cusomized UTF-8 decoder)

News articles from several topics appear on the same page



Crawler statistics

100 parallel crawlers (threads) running on a quad core machine

50 news sources (+3 per month), 350 hubs

4500 live news articles (archives not included)

1400 new news articles (10-15% duplicates)

Crawling of all hubs takes 1-2 minutes

Refresh rate: 10 minutes



Information extraction



Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion



Clustering problem



Keywords extraction

The weight vector for document d is:

vd = [w1,d ,w2,d , ...,wK ,d ]T

where:

wt,d = tft · log|D|

|Dt |

and:

K is the number of distinct words occurring in all news articles.

|Dt | is the number of news articles containing the term t, and

tft is term frequency of term t in document d (a local parameter)

log |D||Dt |

is inverse document frequency (a global parameter), where |D| is the

total number of news articles

The model is known as Term Frequency-Inverse Document Frequency (TF-IDF) model.



Keywords extraction (cont.)

News article is represented with the top Nkw keywords thathave largest weight.

The weight vector vd is normalized, ‖vd‖ = 1.

Computation of news articles similarity by calculating theangle between news articles vectors:

cos θ =v1 · v2

‖v1‖‖v2‖



Hierarchical Agglomerative Clustering (HAC)



Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion



News story classification

Problem: Most news sources don’t have detailed classifications ofnews articles as TIME.mk haveExample:

Balkan articles published in section World

Technology, ShowBiz, LifeStyle in Fun,

Skopje in Macedonia, etc.

also, they use different topic names from the TIME.mk’s topicnames:Example:

Region for Balkan

Trends for Culture,

Future for Technology, etc.



News story classification using ontology

Let N1 is the #articles from the most freq. class C1

Let N2 is the #articles from the second most freq. class C2

Class =

C1, if C1 is more specific or no relation withC2

C2, if C2 is more specific thanC1 and N1 ≤ 2 · N2

C1, if C2 is more specific thanC1 and N1 > 2 · N2



Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion



General properties for any news ranking algorithm

Any ranking algorithm for news stories (clusters) should have atleast the following four properties:

Time awareness.The importance of a piece of news changes over the time.

Important News Story is covered by many news articles.The weighted size of the cluster is a measure of its importance.

Authority of the sources.

BBC is more authoritative than Kirilica.

Diversity.News story reported by three different news sources is more important than news story of three articles

published by one source.



TIME.mk ranking algorithm

Weight of the cluster c at moment t:

WC (c , t) = SourceEntropy(c) ·

k∑

i=1

WN(ni , t)

where:

k is the size of the cluster c .

ni are the news articles of cluster c .

SourceEntropy(c) represents the entropy of the news sources.

WN(ni , t) the weigth of news article ni at time t which hasbeen published at time ti :

WN(ni , t) = A(source(ni )) · e−α(t−ti ), t > ti

A(s) accounts for the authority of the source. One source ismore authoritive if it is more cited than other sources.



Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion



Dataset

Category #news Category #news

Macedonia 58,596 Sport 43,498Balkan 19,341 Chronicle 10,108World 29,647 Culture 16,933

Economy 29,754 Technology 5,755Skopje 7,325 Fun/Showbiz 37,798

Uncategorized 58205 Total 316960

Collected over a period of one year (from 01/07/08 to 30/06/09).



Classification of News Stories

Ontology majority voting algorithm for cluster classification.

Category Precision Recall Category Precision Recall

Macedonia 95% 98% Sport 98% 96%Balkan 94% 96% Chronicle 98% 98%World 98% 98% Culture 98% 96%

Economy 97% 96% Technology 98% 98%Skopje 98% 96% Fun/Showbiz 98% 97%

100 (per category) manually labeled clusters was used as an evaluation set.



Visits statistics



Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion



Conclusion and future work

This work has been motivated by the large usage of newsengines versus the lack of academic papers in this area.

The design and implementation details of a full scale newsengine was presented.

A model for scoring news articles and news stories waspresented.

An extensive testing on more than 300,000 news articles,posted by 50 sources over one year, has been performed,showing very encouraging results.

Future work: Scoring of news articles and news stories.New “semantic” model for representing news is needed toreplace current “bag of words” model.


time.mk - igor trajkovski @ glocal: inside social media

Business

conclusion trajkovski

news readers

news articles vectors

decoder news articles

news sources dont

included1400 new news

macedonian news sites

minutes trajkovski