time.mk - igor trajkovski @ glocal: inside social media

28
Introduction Crawler Clustering Classification Scoring Results Conclusion Computer generated news site TIME.mk Dr Igor Trajkovski New York University Skopje Skopje, Macedonia 16.10.2009 Trajkovski: Computer generated news site - TIME.mk 1 / 28

Upload: newmediamk

Post on 06-May-2015

1.440 views

Category:

Business


1 download

TRANSCRIPT

Page 1: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Computer generated news site

TIME.mk

Dr Igor Trajkovski

New York University SkopjeSkopje, Macedonia

16.10.2009

Trajkovski: Computer generated news site - TIME.mk 1 / 28

Page 2: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion

Trajkovski: Computer generated news site - TIME.mk 2 / 28

Page 3: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Motivation

Traditionally, news readers first pick a medium and then look forheadlines that interest them.

NEW: Pick a story, then read what mediums wrote about it

Trajkovski: Computer generated news site - TIME.mk 3 / 28

Page 4: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Google Approach: news.google.com

Trajkovski: Computer generated news site - TIME.mk 4 / 28

Page 5: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Google Approach: www.time.mk

Trajkovski: Computer generated news site - TIME.mk 5 / 28

Page 6: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion

Trajkovski: Computer generated news site - TIME.mk 6 / 28

Page 7: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Crawling links

Many Macedonian news sites don’t have RSS feeds.

Crawling hubs:

- Macedonia http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1

- Economy http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2

- Culture http://novamakedonija.com.mk/DesktopDefault.aspx?tabid=2&CategID=26

- None www.bbc.co.uk/macedonian/

Many hubs per source

- fixed addresses of the hubs (A1, Makfax, etc.)

- dynamic addresses of the hubs (Dnevnik, Utrinski, Sitel, etc.)

Hubs already classified into predefined topics:

- Macedonia, Balkan, World, Economy, Skopje, Culture, Technology, Fun, LifeStyle, ShowBiz, Sport,

Chronic, None

At the moment, hubs of the sources are provided manually.

Trajkovski: Computer generated news site - TIME.mk 7 / 28

Page 8: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

WWW “jungle”

Some of the issues:

News articles change their title (A1, BBC, Netpress, etc.)

News articles change their address (Sitel, Dnevnik, etc.)

Corrupted HTML pages (need for development of cusomized UTF-8 decoder)

News articles from several topics appear on the same page

Trajkovski: Computer generated news site - TIME.mk 8 / 28

Page 9: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Crawler statistics

100 parallel crawlers (threads) running on a quad core machine

50 news sources (+3 per month), 350 hubs

4500 live news articles (archives not included)

1400 new news articles (10-15% duplicates)

Crawling of all hubs takes 1-2 minutes

Refresh rate: 10 minutes

Trajkovski: Computer generated news site - TIME.mk 9 / 28

Page 10: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Information extraction

Trajkovski: Computer generated news site - TIME.mk 10 / 28

Page 11: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion

Trajkovski: Computer generated news site - TIME.mk 11 / 28

Page 12: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Clustering problem

Trajkovski: Computer generated news site - TIME.mk 12 / 28

Page 13: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Clustering problem

Trajkovski: Computer generated news site - TIME.mk 13 / 28

Page 14: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Keywords extraction

The weight vector for document d is:

vd = [w1,d ,w2,d , ...,wK ,d ]T

where:

wt,d = tft · log|D|

|Dt |

and:

K is the number of distinct words occurring in all news articles.

|Dt | is the number of news articles containing the term t, and

tft is term frequency of term t in document d (a local parameter)

log |D||Dt |

is inverse document frequency (a global parameter), where |D| is the

total number of news articles

The model is known as Term Frequency-Inverse Document Frequency (TF-IDF) model.

Trajkovski: Computer generated news site - TIME.mk 14 / 28

Page 15: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Keywords extraction (cont.)

News article is represented with the top Nkw keywords thathave largest weight.

The weight vector vd is normalized, ‖vd‖ = 1.

Computation of news articles similarity by calculating theangle between news articles vectors:

cos θ =v1 · v2

‖v1‖‖v2‖

Trajkovski: Computer generated news site - TIME.mk 15 / 28

Page 16: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Hierarchical Agglomerative Clustering (HAC)

Trajkovski: Computer generated news site - TIME.mk 16 / 28

Page 17: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion

Trajkovski: Computer generated news site - TIME.mk 17 / 28

Page 18: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

News story classification

Problem: Most news sources don’t have detailed classifications ofnews articles as TIME.mk haveExample:

Balkan articles published in section World

Technology, ShowBiz, LifeStyle in Fun,

Skopje in Macedonia, etc.

also, they use different topic names from the TIME.mk’s topicnames:Example:

Region for Balkan

Trends for Culture,

Future for Technology, etc.

Trajkovski: Computer generated news site - TIME.mk 18 / 28

Page 19: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

News story classification using ontology

Let N1 is the #articles from the most freq. class C1

Let N2 is the #articles from the second most freq. class C2

Class =

C1, if C1 is more specific or no relation withC2

C2, if C2 is more specific thanC1 and N1 ≤ 2 · N2

C1, if C2 is more specific thanC1 and N1 > 2 · N2

Trajkovski: Computer generated news site - TIME.mk 19 / 28

Page 20: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion

Trajkovski: Computer generated news site - TIME.mk 20 / 28

Page 21: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

General properties for any news ranking algorithm

Any ranking algorithm for news stories (clusters) should have atleast the following four properties:

Time awareness.The importance of a piece of news changes over the time.

Important News Story is covered by many news articles.The weighted size of the cluster is a measure of its importance.

Authority of the sources.

BBC is more authoritative than Kirilica.

Diversity.News story reported by three different news sources is more important than news story of three articles

published by one source.

Trajkovski: Computer generated news site - TIME.mk 21 / 28

Page 22: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

TIME.mk ranking algorithm

Weight of the cluster c at moment t:

WC (c , t) = SourceEntropy(c) ·

k∑

i=1

WN(ni , t)

where:

k is the size of the cluster c .

ni are the news articles of cluster c .

SourceEntropy(c) represents the entropy of the news sources.

WN(ni , t) the weigth of news article ni at time t which hasbeen published at time ti :

WN(ni , t) = A(source(ni )) · e−α(t−ti ), t > ti

A(s) accounts for the authority of the source. One source ismore authoritive if it is more cited than other sources.

Trajkovski: Computer generated news site - TIME.mk 22 / 28

Page 23: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion

Trajkovski: Computer generated news site - TIME.mk 23 / 28

Page 24: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Dataset

Category #news Category #news

Macedonia 58,596 Sport 43,498Balkan 19,341 Chronicle 10,108World 29,647 Culture 16,933

Economy 29,754 Technology 5,755Skopje 7,325 Fun/Showbiz 37,798

Uncategorized 58205 Total 316960

Collected over a period of one year (from 01/07/08 to 30/06/09).

Trajkovski: Computer generated news site - TIME.mk 24 / 28

Page 25: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Classification of News Stories

Ontology majority voting algorithm for cluster classification.

Category Precision Recall Category Precision Recall

Macedonia 95% 98% Sport 98% 96%Balkan 94% 96% Chronicle 98% 98%World 98% 98% Culture 98% 96%

Economy 97% 96% Technology 98% 98%Skopje 98% 96% Fun/Showbiz 98% 97%

100 (per category) manually labeled clusters was used as an evaluation set.

Trajkovski: Computer generated news site - TIME.mk 25 / 28

Page 26: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Visits statistics

Trajkovski: Computer generated news site - TIME.mk 26 / 28

Page 27: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Outline

1 Introduction

2 Crawler

3 Clustering

4 Classification

5 Scoring

6 Results

7 Conclusion

Trajkovski: Computer generated news site - TIME.mk 27 / 28

Page 28: Time.Mk - Igor Trajkovski @ Glocal: Inside Social Media

Introduction Crawler Clustering Classification Scoring Results Conclusion

Conclusion and future work

This work has been motivated by the large usage of newsengines versus the lack of academic papers in this area.

The design and implementation details of a full scale newsengine was presented.

A model for scoring news articles and news stories waspresented.

An extensive testing on more than 300,000 news articles,posted by 50 sources over one year, has been performed,showing very encouraging results.

Future work: Scoring of news articles and news stories.New “semantic” model for representing news is needed toreplace current “bag of words” model.

Trajkovski: Computer generated news site - TIME.mk 28 / 28