research article an effective news recommendation method for … · 2019. 7. 31. · research...

15
Research Article An Effective News Recommendation Method for Microblog User Wanrong Gu, Shoubin Dong, Zhizhao Zeng, and Jinchao He School of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, China Correspondence should be addressed to Shoubin Dong; [email protected] Received 4 December 2013; Accepted 19 February 2014; Published 2 April 2014 Academic Editors: Z. Chen and F. Yu Copyright © 2014 Wanrong Gu et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Recommending news stories to users, based on their preferences, has long been a favourite domain for recommender systems research. Traditional systems strive to satisfy their user by tracing users’ reading history and choosing the proper candidate news articles to recommend. However, most of news websites hardly require any user to register before reading news. Besides, the latent relations between news and microblog, the popularity of particular news, and the news organization are not addressed or solved efficiently in previous approaches. In order to solve these issues, we propose an effective personalized news recommendation method based on microblog user profile building and sub class popularity prediction, in which we propose a news organization method using hybrid classification and clustering, implement a sub class popularity prediction method, and construct user profile according to our actual situation. We had designed several experiments compared to the state-of-the-art approaches on a real world dataset, and the experimental results demonstrate that our system significantly improves the accuracy and diversity in mass text data. 1. Introduction With the rapid development of Internet, more and more people prefer reading news online or by mobile phone rather than buying the newspaper. However, massive news and blogs online also bring the users information overload problem. With a large amount of news articles, a very important issue of online news services is how to help users get interesting news that match the users’ preference as much as possible, which is the problem of personalized news recommendation. Microblog has become a famous network application for the past several years [1]. erefore, how to use microblog to recommend items (i.e., news, product, or advertisement) becomes a hot research topic for website providers. Despite some recent advances [14], personalized news recommendation is facing at least three problems. First, fast and real-time processing is needed for the mass news articles every day; that is, how to classify or cluster the news articles rapidly with mass data crawled by spider swarming into the system. Second, the reading context must be considered. For instance, popular news articles would likely be more attractive for the users. ird, the popularity and freshness of news is changing dramatically over time. ese three problems exist in the recommender system for other items, such as movie, music, and product. However, many critical issues of news recommendation have not been solved in previous studies. In this paper, to address the issues mentioned above, we try to solve these in news recommendation system and propose NEMAH, an effective personalized news recom- mendation system based on microblog user profile building and hot subclass popularity prediction. We explore intrinsic relation between user and news, through users’ interest, subclass popularity factor, and freshness. In summary, the three main contributions of our paper are as follows. (i) A Novel Framework for News Partition (See Section 4). News classification and subclass clustering are impor- tant steps in news recommendation processing. We propose 2-stage news partition framework. First, the news articles are divided into several categories using our proposed hybrid classification method (see Section 4.2). en, we cluster the articles in a given class into several clusters to represent news subclasses (see Section 4.3). Such representation can help news recommendation system easily build and update news database rapidly. Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 907515, 14 pages http://dx.doi.org/10.1155/2014/907515

Upload: others

Post on 26-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

Research ArticleAn Effective News Recommendation Method for Microblog User

Wanrong Gu Shoubin Dong Zhizhao Zeng and Jinchao He

School of Computer Science and Engineering South China University of Technology Guangzhou 510641 China

Correspondence should be addressed to Shoubin Dong sbdongscuteducn

Received 4 December 2013 Accepted 19 February 2014 Published 2 April 2014

Academic Editors Z Chen and F Yu

Copyright copy 2014 Wanrong Gu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Recommending news stories to users based on their preferences has long been a favourite domain for recommender systemsresearch Traditional systems strive to satisfy their user by tracing usersrsquo reading history and choosing the proper candidate newsarticles to recommend However most of news websites hardly require any user to register before reading news Besides thelatent relations between news and microblog the popularity of particular news and the news organization are not addressed orsolved efficiently in previous approaches In order to solve these issues we propose an effective personalized news recommendationmethod based on microblog user profile building and sub class popularity prediction in which we propose a news organizationmethod using hybrid classification and clustering implement a sub class popularity prediction method and construct user profileaccording to our actual situationWe had designed several experiments compared to the state-of-the-art approaches on a real worlddataset and the experimental results demonstrate that our system significantly improves the accuracy and diversity in mass textdata

1 Introduction

With the rapid development of Internet more and morepeople prefer reading news online or by mobile phone ratherthan buying the newspaperHowevermassive news and blogsonline also bring the users information overload problemWith a large amount of news articles a very important issueof online news services is how to help users get interestingnews that match the usersrsquo preference as much as possiblewhich is the problem of personalized news recommendationMicroblog has become a famous network application forthe past several years [1] Therefore how to use microblogto recommend items (ie news product or advertisement)becomes a hot research topic for website providers

Despite some recent advances [1ndash4] personalized newsrecommendation is facing at least three problems First fastand real-time processing is needed for the mass news articlesevery day that is how to classify or cluster the news articlesrapidly with mass data crawled by spider swarming into thesystem Second the reading context must be consideredFor instance popular news articles would likely be moreattractive for the users Third the popularity and freshnessof news is changing dramatically over time These threeproblems exist in the recommender system for other items

such as movie music and product However many criticalissues of news recommendation have not been solved inprevious studies

In this paper to address the issues mentioned abovewe try to solve these in news recommendation system andpropose NEMAH an effective personalized news recom-mendation system based on microblog user profile buildingand hot subclass popularity prediction We explore intrinsicrelation between user and news through usersrsquo interestsubclass popularity factor and freshness In summary thethree main contributions of our paper are as follows

(i) ANovel Framework for News Partition (See Section 4)News classification and subclass clustering are impor-tant steps in news recommendation processing Wepropose 2-stage news partition framework Firstthe news articles are divided into several categoriesusing our proposed hybrid classification method (seeSection 42) Then we cluster the articles in a givenclass into several clusters to represent news subclasses(see Section 43) Such representation can help newsrecommendation system easily build andupdate newsdatabase rapidly

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 907515 14 pageshttpdxdoiorg1011552014907515

2 The Scientific World Journal

(ii) A Subclass Popularity Prediction Method for NewsRecommender System (See Section 5) Users not onlylike reading the news articles they are interested inbut also like the hot news and by this phenomenonwe can call the usersrsquo social preference In general areal-time news recommendation system is difficult toinstantly obtain the statistical information of globalusersrsquo attention on a specific piece of news or subclassTherefore we synthetically analyze the historical datacrawled from web and propose a news subclass pop-ularity prediction model based on spectral analysis oftime series

(iii) A Novel Application Using Microblog for User ProfileConstruction (See Section 6) Microblog is the mostmainstream form of grassroots media where userscan express their views and retweet the informationthey agreed on or are interested in In this paper wepropose a user profile construction method based onmicroblog content and user behavior

The rest of this paper is organized as follows Section 2covers related work relevant to personalized news recom-mendation Section 3 describes the recommendation frame-work of NEMAH Section 4 presents the classification andsubclass clusteringmethods we design In Section 5 the newssubclass popularity prediction model will be introducedSection 6 reports the user profile constructionmethodwe putforward and Section 7 introduces the recommending modelExtensive experimental results are reported in Section 8Finally Section 9 concludes this paper

2 Related Work

News recommender system is an important application onrecommendation and has attracted more and more attentionrecently Existing news recommendation methods can beroughly divided into three categories content-based collab-orative filtering and hybrid methods

Content-Based This method uses the userrsquos reading historyin terms of content to recommend similar items In theopinion of Schafer [5] he called this Item-to-Item Correlationmethod In news recommender generally news article isoften represented as a vector spacemodel (VSM) or topic dis-tributions Reference [6] employed TF-IDF to construct VSMand utilized119870-nearest neighbormethod to recommend newsto specific user Reference [7] employed the Naıve Bayesianclassifier to classify web pages and construct user profile Liuet al [4] (called ClickB in Experimental Evaluation section)proposed the recommendation method using news contentbased on click behavior In our work we classify news articlesby VSM and express the articles with TF-IDF weight foreach word Content-based method is easy to express andimplement But it should be noted that not all data are easyto express as VSM such as audio image and video news data[8] Another problem is content similarity for example a userwould not like to read similar news many times from newsrecommender using content-based method In our work we

diversify news articles due to the distribution of crawled newsarticles and the preference of the given user

Collaborative Filtering This method utilizes the behaviors ofuser on item to recommendation In other words collabo-rative filtering method is content-free and can be roughlydivided into two subcategories heuristic-based and model-based For the former its recommended process is inspiredby the real-world phenomena [9] The latter one trains amodel for predicting the utility of the current user 119906 on item119895 such as [10 11] (called Goo in Experimental Evaluationsection) Purchasing and rating are the most importantbehaviors in collaborative filtering recommendation systemBut in news recommender system the rating can be seen asbinary where a click on a piece of news can be representedas 1 rating and 0 rating otherwise [11] The success of thecollaborative filtering-based recommendation system relieson the availability of lots of users and items But a lot of usershave behaviors on only a few items We can observe that theuser-item matrix is a spare matrix that will lead to poor rec-ommendation [12] One way to solve this problem is by usingthe demographic of users to calculate the similarities betweenusers such as age gender education area or employ-ment Another approach is that which employs the behaviorsthrough relationship among users such as review retweetand favorite In our work we utilize the microblog informa-tion to solve the issue discussed above

Hybrid Method This method combines collaborative filter-ing content-based methods and other factors [13] Manynews recommendation methods are hybrid such as Bilinear[14] Bandit [15] and SCENE [3] which will be discussed andanalysed in Experimental Evaluation section

From the perspective of news recommendation our workis similar to SCENE [3] EMM News Explorer [16] andNewsjunike [17] in the use of news content and namedentities for news recommendation However SCENE did notconsider the subclass popularity period EMMNewsExplorerdid not provide personalized recommender and Newsjunikedid not address the issues as we do in classification and userprofile construction

3 Recommendation Framework

Figure 1 shows the brief framework of our proposed systemNEMAH This recommendation is performed by the fol-lowing four modules Classification and Clustering ModuleSubclass Popularity Prediction Module User Profile Moduleand Recommendation Module These major componentsand the processing flow in our framework are described asfollows

(1) Classification and Clustering Module News categories onthis module customized by Press and Publication Admin-istration of the Peoplersquos Republic of China are dividedinto 23 categories As key persons (named entity of typeperson) play an important part on the news classificationwe proposed a hybrid classification method based on theclassical classification method and the key persons A large

The Scientific World Journal 3

Sub class popularityprediction(Section 5)

Newlypublished

newsClassification(Section 42)

Clustering(Section 43)

User profile building

(Section 6)

Rough selection

(Section 71)

Precise selection

(Section 73)

Recency

History news data Recommendation list

Micro blog data

Figure 1 The overview of NEMAH framework

number of experiments show that adjusting the weight ofthe key persons in the hybrid classification help to get agood news classification performance After obtaining therough classes we cluster the subclasses using a clustermethodwhich maximizes the average of neighborhood points

(2) Subclass Popularity Prediction Module Different periodshave different popular subclasses People would like to focuson the popular subclasses rather than spend much time onsearching and selecting information Sometimes even theusers themselves have no idea what they want Therefore thesubclass popularity prediction technologywill help users savetheir time and improve their experience on using recommen-dation system On the research of network news we foundthat some subclasses presented time period significantly Forthe popular subclasses we can assign a higher weight forrecommending In this paper we used time series spectralanalysis method to predict the popularity of subclasses

(3) User Profile Module This module is used to extract thepreference model of users We used the microblog of userstweeted or retweeted for establishing the users profile modelfor representing usersrsquo interestThis procedure combines textanalysis text classification and accessing some particular fac-tors (ie key name and place name appeared in microblog)

(4) Personalized News Recommendation We use user profileand the subclass to determine the candidate subclasses firstlyAnd then we calculate a userrsquos utility on news item by agreedy strategy and rank the recommended list through thepopularity of news article in a special subclass and the newsarticlersquos recencyNote that when recommending specific newsitems using our system the class and the subclass of thenews articles are utilized Moreover the other propertiesof news items such as freshness (recency) and popularity(subclass popularity prediction) are synthesized into the finalrecommended ranking list as adjustment factors

4 Classification and Clustering Module

Classifying massive network news is conducive to the sub-sequent process on the news applications Internet newsrecommendation requires response as soon as possible to

⟨REC⟩

⟨NewsID⟩ = nf2012010121574⟨Date⟩ = 20120101⟨Title⟩ = Our province will implement

the new law of registered residence⟨Author⟩ = Zhangsheng Bo⟨CSN⟩ = 0215

⟨Class⟩ = Law Justice⟨SubClass⟩ = household management⟨Area⟩ = D440000 Guangdong Province⟨Source⟩ = Nanfang Daily⟨KP⟩ = Weifa Liang⟨Text⟩ = News content

⟨CommentsUser⟩ = 1759918187 2414113125 14134759811463256471 871324394 1657924191 20019463411356100372 1549089713

⟨Tagged⟩ = True

Figure 2 The storage structure of a piece of history newsRemark The useful elements in this paper are ⟨NewsID⟩ ⟨Date⟩⟨Title⟩ ⟨CSN⟩ ⟨Area⟩ ⟨KP⟩ ⟨Text⟩ ⟨CommentsUser⟩ in which⟨CommentsUser⟩ shows the user list of whom comments this newsarticle ⟨CSN⟩ denotes the class and sub-class ID of this news article(eg ⟨CSN⟩ = 0215means that class ID is 02 and sub-class ID is 15)and ⟨KP⟩ is the named entity of type person which will be discussedin Section 42

show the recommended list to users In NEMAH given a setof news items 119873 = 119899

1 1198992 119899

119872 where |119873| = 119872 our

goal is to generate a classification result119862 = 1198621 1198622 119862

119870

where 119870 is a predefined classification number (eg 119870 = 23

in this paper) Class names are shown in Table 1 Besideseach class can be divided into several subclasses using ourproposed clustering module 119862

119894= 119878119862

1198941 1198781198621198942 119878119862

119894119898 The

storage structure of our history news and a user are shown inFigures 2 and 3 respectively

41 Feature Selection In the processing of text corpus thedimension of each item will be very large (ie more thanten thousand in the same cases) that would need to selectthe main features for representing the document Generallythere are three classical feature selection methods in textprocessing Mutual Information [18] Information Gain [19]

4 The Scientific World Journal

⟨USER⟩

⟨UserID⟩ = 2414113125

⟨MicroBlog⟩ = MicroBlog content 1 MicroBlog content 2MicroBlog content 3

⟨CommentsOn⟩ = nf2012010121574 nf2011040722331nf201012784512

Figure 3 The storage structure of a user Remark ⟨MicroBlog⟩lists the messages tweeted or retweeted by the user ⟨CommentsOn⟩denotes the news articles which are commented on by this user

Table 1 Name of each class

ID Class name1 Political2 Law Justice3 External Relations International Relations4 Military5 Social Labor Disaster11 Economy12 Finance Banking13 Infrastructure Construction Real estate14 Agriculture Rural areas15 Mining Industrial16 Energy Water Conservancy17 Information industry18 Transport Postal services Logistics19 Commerce Foreign trade Customs21 Services Tourism22 Environmental Meteorological31 Education33 Science and Technology35 Culture Recreation and Leisure36 Literature Art37 Media Industry38 Medicine Health39 Sports

and CHI Statistics [20]Thesemethods are inclined to choosethe rare words which are not reliable in classification onsome corpus Therefore in order to solve this and reducethe computational burden in the process of news articlesclassification wemust filter out some sporadic low-frequencywords the two concrete steps to filter are shown below

(1) Rough Selection Using Document Frequency of FeatureWords In training corpus let 119905

119894be a word we define 119863119891

119894as

total relative document frequency which denotes the ratiothat the number of documents containing 119905

119894occupies over the

whole number of documents When the 119863119891119894is greater than

a threshold 120572 it means that the word 119905119894is a high-frequency

word in training corpus and we add it into 1198791198901198981set For

a given class 119862119896 we define 119863119891

119894119896as class relative document

in class 119862119896 which denotes the ratio that the number of

documents in class 119862119896occupies over the whole number of

documents in class 119862119896 When the119863119891

119894119896is greater than a given

threshold 120573 it means that 119905119894is a high-frequency word in

this class and then we add it into 1198791198901198982set According to

our experiment and corpus we roughly set the 120572 = 001

and 120573 = 01 in order to avoid the fault or omit selectionThis rough selection process selects the words which appearfrequently in all corpus and classes [21] The result of roughfeature selection is 1198791198901198981015840 = 119879119890119898

1⋂119879119890119898

2

(2) Precise Selection Using Index of Discrimination betweenWord and ClassWe employ [22] method to define the indexof discrimination between word and class as follows

119877 (119905119894 119862119896) =

119875 (119905119894isin 119862119896)

max119862119895 = 119862119896119875(119905119894isin119862119895)

(1)

where 119875(119905119894isin 119862119896) denotes the probability of word 119905

119894in class

119862119896andmax

119862119895 = 119862119896119875(119905119894isin119862119895)denotes the maximum probability of

word 119905119894in other classes except 119862

119896 The 119875(119905

119894isin 119862119896) can be

represented as follows

119875 (119905119894isin 119862119896) =

119905119891 (119905119894isin 119862119896) + 1

sum1199051015840 119905119891 (1199051015840 isin 119862

119896) + 1

(2)

where 119905119891(119905119894isin 119862119896) denotes the frequency of 119905

119894appearing in

class 119862119896 1199051015840 denotes the word different to 119905

119894from 119879119890119898

1015840 andsum1199051015840 119905119891(1199051015840isin 119862119896) denotes the sum frequency of 1199051015840 appearing in

class119862119896The 119905

119894is the representative word in class119862

119896when the

index of discrimination 119877(119905119894 119862119896) is greater than a threshold

120574 We can use selection proportion threshold 119879 to decideparameter 120574 which will be discussed in our ExperimentalEvaluation section later We can obtain the representativewords set when the process above is done for each classRough selection step can save calculation time that is used toexclude the words which are certainly not the feature words

42 Classifying News Items In real Internet world classi-fication or clustering on massive news data requires lotsof computational power To tackle this issue on news rec-ommendation we employ One Versus All method [23](One Versus All is a two-class classification method) andconsider the key persons on news articles In this paper newsarticle classification is considered as a plurality of two-classclassification problem For a class 119862

119896 if document 119889

119894belongs

to class 119862119896 it is tagged by 1 for class 119862

119896as a positive sample

and tagged byminus1 as a negative sample otherwiseThismethodis to construct the projective vector 119901

119896between text matrix

119860 and class vector 119910 and we employ the ridge regressionmethod [24] shown in the following

119862 = argmin119901119896

10038171003817100381710038171003817119910 minus 119901119879

11989611986010038171003817100381710038171003817

2

+ 1205791003817100381710038171003817119901119896

1003817100381710038171003817

2

(3)

where 120579 is a positive parameter used to adjust the estimationerror To solve the minimization problem above we shouldfind the partial derivative of 119901

119896and set the partial derivative

to 0 and then we can obtain the equation shown below

119901119896= (119860119860

minus1+ 120579119868)minus1

119860119910119879 (4)

The Scientific World Journal 5

where 119868 is a unitary matrix with the same dimension of 119860Because the training set is divided into 119870 categories we canobtain a group of projective vectors 119875 = 119901

1 1199012 119901

119870 We

utilize code matrix 119872 to describe the correlation betweendifferent classes got from two-class classification Assumingthat class 119862

119896has 119873

119896trained documents 119863

119896119895 where 119895 isin

[1119873119896] the element of 119872 which denotes the correlation

between two classes can be calculated by

1198721198701198701015840 =

1

119873119896

119873119896

sum

119895=1

sgn (⟨1199011198961015840 119863119896119895

⟩) (5)

where 1199011198961015840 denotes the projective vector of 119862

1198961015840 If ⟨119901

1198961015840 119863119896119895

is greater than 0 the return value of function sgn is 1 andotherwise 0 When new articles come the similarity betweenarticle and class can be calculated by the following equation

Sim (119861 119862119870) =

119896

sum

1198701015840=1

11987211989611989610158401198761198961015840 =

119896

sum

1198701015840=1

1198721198961198961015840 sgn (⟨119901

1198961015840 119863119896119895

⟩)

(6)

where 119861 denotes a new article At last we can obtain the classof 119861 through the maximum of function Sim(sdot sdot)

119862 (119861) = argmax119862119896Sim (119861 119862

119896) (7)

In order to further improve the classification accuracyand utilize the manual labor rationally we propose a methodwith considering key person (named entity of type person) toimprove the ability of classification when key persons appearas shown in the following

119875 (119862119894| 119861) = (1 minus 120572)

Sim (119861 119862119870)

sum119870

119894=1Sim (119861 119862

119894)

+ 120572119875 (119862119894| 119861119896) 119875 (119861

119896| 119889119895)

(8)

where Sim(119861 119862119870)sum119870

119894=1Sim(119861 119862

119894) denotes the probability

score of article 119861 on class 119862119896obtained by the method we

mentioned above 119861119896denotes the key person that appeared

in the article 119861 and 119875(119861119896| 119861) = 1 when 119861

119896appeared in 119861

otherwise 119875(119861119896| 119861) = 0 In other words if a new article has

not appeared in any key person we could not implement thekey person factor on it 120572 is the balance parameter on thesetwo methods The 119875(119862

119894| 119861119896) is computed as

119875 (119862119894| 119861119896) =

119875 (119862119894) 119901 (119861

119896| 119862119894)

sum119872

119894=1119875 (119862119894) 119901 (119861

119896| 119862119894)

(9)

43 News Subclass Clustering After obtaining the roughclassification results we need to separate every news classinto subclass 119878119862

119894119909 A natural way to detect subclasses of

an Internet text corpus is typically done using clusteringsfor instance such as 119870-means or hierarchical clusterings InNEMAH we propose a subclass clustering method to obtainsubclasses Each subclass is represented as a subclass vector119879 = ⟨119905

1 1199081⟩ ⟨1199052 1199082⟩ where 119905

119894and 119908

119894denote the rep-

resentative word and its corresponding weight respectively

We call this cluster method as Maximizing Neighborhoodmethod because of the main idea of algorithm

(1) Solving Subspace Projection Problem by Maximizing theAverage ofNeighborhoodFor each document119909

119894in a text space

1198830 the neighbor documents can be divided into two subsets

according to the distance to the 119909119894 similar neighborhood

set Θ119900

119894and heterogeneous neighborhood set Θ

119890

119894 where Θ

119900

119894

contains the top 120585nearest neighborswhich belong to the sameclass of 119909

119894andΘ

119890

119894contains the top 120577 nearest neighbors which

do not belong to the same class of 119909119894 In the text corpus all

data pointsrsquo average distance out of class and within class canbe expressed as follows

119875119894= sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876119894= sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(10)

All data points in the text corpus average out of classdistance and the average within-class distance expression areas follows

119875 = sum

119894

119875119894= sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876 = sum

119894

119876119894= sum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(11)

The subclass clustering problem can be considered as aprojection of text space to a subspace For instance let 119910

119894

be a projection space of 119909119894after projecting we can express

119910119894= 119882119879119909119894 The principle of this projection is maximizing

the average distance of different classes and minimizing theaverage distance within each class [25] as shown in thefollowing

119903 = sum

119894

( sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minus sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)

= 119905119903[

[

119882119879(sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minussum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)119882]

]

= 119905119903 [119882119879(119875 minus 119876)119882]

(12)

where 119905119903(sdot) denotes the trace of a matrix and the constraint ofthis equation is 119882119879119882 = 119868 And then maximize the equationshown as follows

max 119905119903 [119882119879(119875 minus 119876)119882] (13)

6 The Scientific World Journal

(2) The Quick Affinity Propagation Clustering on SubspaceAfter projecting the initial text vector space into subspacethrough projective matrix it can generate 119870 clusters withemploying 119870-Affinity Propagation (119870-AP) (this method willbe more suitable for text clustering because it can achievemore reasonable clusters than the traditional clusteringmeth-ods [26]) implemented in subspace Let the similarity of 119910

119894

and 119910119895in subspace119884 = 1199101 1199102 119910

119899 be 119878 = 119904

119894119895 the target

of 119870-AP is to find the 119870 real samples 119864 = 1198901 1198902 119890

119870

which denotes the 119870 classes 119862 = 1198621 1198622 119862

119870 And then

maximize the following objective function

max119865 (119862119895119870

119895=1) =

119870

sum

119895=1

sum

119910119894isin119862119895

119904 (119910119894 119890119895) (14)

where 119890119895belongs to 119862

119895 The objective function can be trans-

formed into 0-1 integer programming problem when intro-ducing the binary parameter 119861 = 119887

119894119895isin 0 1 119894 119895 = 1 119899

as shown in the following

max119865 (119887119894119895) =

119899

sum

119894=1

119899

sum

119895=1

119887119894119895119904 (119910119894 119910119895) (15)

Equation (15) has three constraints (1) 119887119894119894 if 119887119895119894

= 1(2) sum

119899

119895=1119887119894119895= 1 and (3) sum

119899

119894=1119887119894119894= 119870 where 119887

119894119895= 1 when 119910

119894

considers 119910119895as a sample and 119887

119894119894= 1when 119910

119894is a sample itself

For the first constraint 119910119894is a sample when 119910

119895considers 119910

119894as

a sample For the second one it means that each data pointhas only one sample point For the last one it means that thenumber of samples is119870 which can ensure that119870-APmethodgenerates 119870 clusters

(3) Hybrid Learning of Subspace Projection and Clusteringon Adaptive Subspace The class information updated onsubspace clustering process can be utilized as a priori knowl-edge in the next processing on subspace projection andafter several iterations until convergence we can obtain theglobal optimal clustering resultThe iteration processing is asfollows

1198830rarr 119870-AP rarr 119871

0rarr SubSpacerarr 119882

1 Score

1

1198841

= 119882119879

11198830

rarr 119870-AP rarr 1198711

rarr SubSpace rarr 1198822

Score1

sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot

119884119905= 119882119879

1198791198830

rarr 119870-AP rarr 119871119905rarr SubSpace rarr 119882

119905+1

Score119905+1

It must compute the convergence function value in eachiteration

Score119905+1

= 119905119903 [119882119879

119905+1(119875 (119871119905) minus 119876 (119871

119905))119882119905+1

] (16)

where 119875(119871119905) and 119876(119871

119905) denote the average distances between

classes andwithin class which are calculated by (11) accordingto the class vested instruction matrix 119871

119905 The iteration willbe finished when it meets the condition of convergenceScore119905+1

minusScore119905le 120598 or reaches themax number of iterations

The parameters of our clustering method are the number of

points 120578 which are the nearest in class and the number ofpoints 120577 which are the nearest out of class We did cross-fold validation to train these parameters and we found thatselecting 120577 = 120578 = 13 for all classes per 1 000 documentswould perform better

DiscussionThemotivation of this module (classification andclustering) is to find the userrsquos preference (subclass level)and track the hotness of a newly published news in a givensubclass

5 Subclass Popularity Prediction Module

On the explosion of information today the fast pace oflife makes people focus their attention on the popularsubclass rather than spendmuch time searching and selectinginformation Sometimes even users themselves have no ideawhat they really want Therefore the hot subclass predictiontechnology with recommendation function has become veryimportant News subclass popularity prediction can improvethe performance of news recommender system Besidesit can also improve the display function of popular newsmodules on website automatically reduce the workload ofwebsite editors and improve the usersrsquo browsing experience

On the study of historical statistical data on news sub-classes we found that some subclasses are popular periodi-cally For instance the subclass college entrance examinationwill appear highly popular about June every year in Chinaand a lot of news articles and comments focus on this subclassat that time as shown in Figures 4(b) and 4(a) that show thedata of college entrance examination subclass In this paperwe define the news subclassrsquo degree of concern according tothe number of news articles and their comments as shown inthe following

119867119896= 120582119867

(119896)

119899119890+ (1 minus 120582)119867

(119896)

119903119890

=119873(119896)

119903119890

119873119899119890

+ 119873119903119890

119873(119896)

119899119890

119873119899119890

+119873(119896)

119899119890

119873119899119890

+ 119873119903119890

119873(119896)

119903119890

119873119903119890

(17)

where119867(119896)

119899119890denotes the popular degree of news article on the

119896th subclass 119867(119896)119903119890

denotes the popular degree of commenton the 119896th subclass 120582 is a weight of popular degree of newsarticle and the value is 119873

(119896)

119903119890(119873119899119890

+ 119873119903119890) 119873(119896)119903119890

denotes thenumber of reviews on the 119896th subclass 119873

119903119890denotes the

number of reviews on all corpus 119873(119896)119899119890

denotes the numberof news articles on the 119896th subclass and 119873

119899119890denotes the

number of news articles on all corpus According to theexperiments of time series analysis on our corpus we foundthat most subclasses are suitable for implementing spectralanalysismethod [27]

Any stationary sequence modeling can be extended tomany cosine waves with different frequencies amplitudeand phase combination This analysis method is called timedomain based analysis method The linear combination of 119898cosines with arbitrary amplitudes frequencies and phases itis shown in the following

119884119905= 1198600+

119898

sum

119895=1

[119860119895cos (2120587119891

119894119905) + 119861

119895sin (2120587119891

119894119905)] (18)

The Scientific World Journal 7

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(a) Popularity degree of graduate entrance examination subclass

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(b) Popularity degree of college entrance examination subclass

Figure 4 Periodic subclass news distribution Remark 119909-axis denotes the date from August 1 2009 to May 3 2012 119910-axis denotes the valueof119867119896in (17)

It can get the values of 119860 and 119861 by ordinary least squaresfitting regression When the frequency is a special form thecalculation will become very simple If 119899 is an odd numberwhich can be expressed as 119899 = 2119896 + 1 then the frequencywith the form of 1119899 2119899 119896119899 is called Fourier frequencyThe estimated parameters are as follows

1198600= 119884

119860119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

119861119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

(19)

If the sample size is even which can be expressed as119899 = 2119896 (19) still holds But the equation will change to thefollowing when 119891

119896= 119896119899 = 12

119860119896=

1

119899

119899

sum

119905=1

(minus1)119905119884119905 119861

119896= 0 (20)

Definition 1 When the sample size is odd namely 119899 = 2119896+1we define the cycle diagram whose frequency 119891 = 119895119899 (119895 =

1 2 119870) as 119868 as shown in the following equation

119868 (119895

119899) =

119899

2(1198602

119895+ 1198612

119895) (21)

If the sample size is even (19) still can get the119860 and 119861 valuesand the cycle diagram is the same as the odd case But in theextreme frequency case for example when 119891 = 119896119899 = 12the cycle diagram is shown in the following equation

119868 (119895

119899) = 119899(119860

119895)2

(22)

The periodogram with frequency 119891 = 119895119899 is inverselyproportional to the square value of the correspondingregression coefficients Therefore the peaks of periodogramshow the relative intensity of sine-cosine pairs in differentfrequencies as shown in Figure 5

In Figure 5 the periodogram has two peaks 0004970179and 0002982107 namely the subcycle 119879 = 1119891 may be 201

days and 335 days The other peaks are too low that they canbe neglected The two frequencies are selected for buildingthe model which means that the model has two pairs of sine-cosine in it as shown in the following

119884119905= 120573 + 120573

1cos (2120587119891

1119905) + 120573

2sin (2120587119891

1119905)

+ 1205733cos (2120587119891

2119905) + 120573

4sin (2120587119891

2119905) + 119890119905

(23)

Using spectral analysis method for prediction has severalsteps First we should use the periodogram for getting thevalue and number of strong frequencies Second model isgenerated by the value and number of strong frequenciesFinally we predict future data values according to the modelwhich only requires a time parameter

Discussion The motivation of this module is to obtain thehotness of each subclass Some new studies also take intoaccount the popularity of the newly published news articleFor example SCENE [3] used the popular degree which iscomputed as the ratio of the number of users accessing thearticle However for the newest popular news article 119899

119894 its

clicked number would be less than the news article publishedseveral hours or days before

6 User Profile Module

In order to capture a userrsquos reading interest on news itemsgenerally personalized news recommendation system needsto construct the userrsquos profile Traditionally the user profile

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

2 The Scientific World Journal

(ii) A Subclass Popularity Prediction Method for NewsRecommender System (See Section 5) Users not onlylike reading the news articles they are interested inbut also like the hot news and by this phenomenonwe can call the usersrsquo social preference In general areal-time news recommendation system is difficult toinstantly obtain the statistical information of globalusersrsquo attention on a specific piece of news or subclassTherefore we synthetically analyze the historical datacrawled from web and propose a news subclass pop-ularity prediction model based on spectral analysis oftime series

(iii) A Novel Application Using Microblog for User ProfileConstruction (See Section 6) Microblog is the mostmainstream form of grassroots media where userscan express their views and retweet the informationthey agreed on or are interested in In this paper wepropose a user profile construction method based onmicroblog content and user behavior

The rest of this paper is organized as follows Section 2covers related work relevant to personalized news recom-mendation Section 3 describes the recommendation frame-work of NEMAH Section 4 presents the classification andsubclass clusteringmethods we design In Section 5 the newssubclass popularity prediction model will be introducedSection 6 reports the user profile constructionmethodwe putforward and Section 7 introduces the recommending modelExtensive experimental results are reported in Section 8Finally Section 9 concludes this paper

2 Related Work

News recommender system is an important application onrecommendation and has attracted more and more attentionrecently Existing news recommendation methods can beroughly divided into three categories content-based collab-orative filtering and hybrid methods

Content-Based This method uses the userrsquos reading historyin terms of content to recommend similar items In theopinion of Schafer [5] he called this Item-to-Item Correlationmethod In news recommender generally news article isoften represented as a vector spacemodel (VSM) or topic dis-tributions Reference [6] employed TF-IDF to construct VSMand utilized119870-nearest neighbormethod to recommend newsto specific user Reference [7] employed the Naıve Bayesianclassifier to classify web pages and construct user profile Liuet al [4] (called ClickB in Experimental Evaluation section)proposed the recommendation method using news contentbased on click behavior In our work we classify news articlesby VSM and express the articles with TF-IDF weight foreach word Content-based method is easy to express andimplement But it should be noted that not all data are easyto express as VSM such as audio image and video news data[8] Another problem is content similarity for example a userwould not like to read similar news many times from newsrecommender using content-based method In our work we

diversify news articles due to the distribution of crawled newsarticles and the preference of the given user

Collaborative Filtering This method utilizes the behaviors ofuser on item to recommendation In other words collabo-rative filtering method is content-free and can be roughlydivided into two subcategories heuristic-based and model-based For the former its recommended process is inspiredby the real-world phenomena [9] The latter one trains amodel for predicting the utility of the current user 119906 on item119895 such as [10 11] (called Goo in Experimental Evaluationsection) Purchasing and rating are the most importantbehaviors in collaborative filtering recommendation systemBut in news recommender system the rating can be seen asbinary where a click on a piece of news can be representedas 1 rating and 0 rating otherwise [11] The success of thecollaborative filtering-based recommendation system relieson the availability of lots of users and items But a lot of usershave behaviors on only a few items We can observe that theuser-item matrix is a spare matrix that will lead to poor rec-ommendation [12] One way to solve this problem is by usingthe demographic of users to calculate the similarities betweenusers such as age gender education area or employ-ment Another approach is that which employs the behaviorsthrough relationship among users such as review retweetand favorite In our work we utilize the microblog informa-tion to solve the issue discussed above

Hybrid Method This method combines collaborative filter-ing content-based methods and other factors [13] Manynews recommendation methods are hybrid such as Bilinear[14] Bandit [15] and SCENE [3] which will be discussed andanalysed in Experimental Evaluation section

From the perspective of news recommendation our workis similar to SCENE [3] EMM News Explorer [16] andNewsjunike [17] in the use of news content and namedentities for news recommendation However SCENE did notconsider the subclass popularity period EMMNewsExplorerdid not provide personalized recommender and Newsjunikedid not address the issues as we do in classification and userprofile construction

3 Recommendation Framework

Figure 1 shows the brief framework of our proposed systemNEMAH This recommendation is performed by the fol-lowing four modules Classification and Clustering ModuleSubclass Popularity Prediction Module User Profile Moduleand Recommendation Module These major componentsand the processing flow in our framework are described asfollows

(1) Classification and Clustering Module News categories onthis module customized by Press and Publication Admin-istration of the Peoplersquos Republic of China are dividedinto 23 categories As key persons (named entity of typeperson) play an important part on the news classificationwe proposed a hybrid classification method based on theclassical classification method and the key persons A large

The Scientific World Journal 3

Sub class popularityprediction(Section 5)

Newlypublished

newsClassification(Section 42)

Clustering(Section 43)

User profile building

(Section 6)

Rough selection

(Section 71)

Precise selection

(Section 73)

Recency

History news data Recommendation list

Micro blog data

Figure 1 The overview of NEMAH framework

number of experiments show that adjusting the weight ofthe key persons in the hybrid classification help to get agood news classification performance After obtaining therough classes we cluster the subclasses using a clustermethodwhich maximizes the average of neighborhood points

(2) Subclass Popularity Prediction Module Different periodshave different popular subclasses People would like to focuson the popular subclasses rather than spend much time onsearching and selecting information Sometimes even theusers themselves have no idea what they want Therefore thesubclass popularity prediction technologywill help users savetheir time and improve their experience on using recommen-dation system On the research of network news we foundthat some subclasses presented time period significantly Forthe popular subclasses we can assign a higher weight forrecommending In this paper we used time series spectralanalysis method to predict the popularity of subclasses

(3) User Profile Module This module is used to extract thepreference model of users We used the microblog of userstweeted or retweeted for establishing the users profile modelfor representing usersrsquo interestThis procedure combines textanalysis text classification and accessing some particular fac-tors (ie key name and place name appeared in microblog)

(4) Personalized News Recommendation We use user profileand the subclass to determine the candidate subclasses firstlyAnd then we calculate a userrsquos utility on news item by agreedy strategy and rank the recommended list through thepopularity of news article in a special subclass and the newsarticlersquos recencyNote that when recommending specific newsitems using our system the class and the subclass of thenews articles are utilized Moreover the other propertiesof news items such as freshness (recency) and popularity(subclass popularity prediction) are synthesized into the finalrecommended ranking list as adjustment factors

4 Classification and Clustering Module

Classifying massive network news is conducive to the sub-sequent process on the news applications Internet newsrecommendation requires response as soon as possible to

⟨REC⟩

⟨NewsID⟩ = nf2012010121574⟨Date⟩ = 20120101⟨Title⟩ = Our province will implement

the new law of registered residence⟨Author⟩ = Zhangsheng Bo⟨CSN⟩ = 0215

⟨Class⟩ = Law Justice⟨SubClass⟩ = household management⟨Area⟩ = D440000 Guangdong Province⟨Source⟩ = Nanfang Daily⟨KP⟩ = Weifa Liang⟨Text⟩ = News content

⟨CommentsUser⟩ = 1759918187 2414113125 14134759811463256471 871324394 1657924191 20019463411356100372 1549089713

⟨Tagged⟩ = True

Figure 2 The storage structure of a piece of history newsRemark The useful elements in this paper are ⟨NewsID⟩ ⟨Date⟩⟨Title⟩ ⟨CSN⟩ ⟨Area⟩ ⟨KP⟩ ⟨Text⟩ ⟨CommentsUser⟩ in which⟨CommentsUser⟩ shows the user list of whom comments this newsarticle ⟨CSN⟩ denotes the class and sub-class ID of this news article(eg ⟨CSN⟩ = 0215means that class ID is 02 and sub-class ID is 15)and ⟨KP⟩ is the named entity of type person which will be discussedin Section 42

show the recommended list to users In NEMAH given a setof news items 119873 = 119899

1 1198992 119899

119872 where |119873| = 119872 our

goal is to generate a classification result119862 = 1198621 1198622 119862

119870

where 119870 is a predefined classification number (eg 119870 = 23

in this paper) Class names are shown in Table 1 Besideseach class can be divided into several subclasses using ourproposed clustering module 119862

119894= 119878119862

1198941 1198781198621198942 119878119862

119894119898 The

storage structure of our history news and a user are shown inFigures 2 and 3 respectively

41 Feature Selection In the processing of text corpus thedimension of each item will be very large (ie more thanten thousand in the same cases) that would need to selectthe main features for representing the document Generallythere are three classical feature selection methods in textprocessing Mutual Information [18] Information Gain [19]

4 The Scientific World Journal

⟨USER⟩

⟨UserID⟩ = 2414113125

⟨MicroBlog⟩ = MicroBlog content 1 MicroBlog content 2MicroBlog content 3

⟨CommentsOn⟩ = nf2012010121574 nf2011040722331nf201012784512

Figure 3 The storage structure of a user Remark ⟨MicroBlog⟩lists the messages tweeted or retweeted by the user ⟨CommentsOn⟩denotes the news articles which are commented on by this user

Table 1 Name of each class

ID Class name1 Political2 Law Justice3 External Relations International Relations4 Military5 Social Labor Disaster11 Economy12 Finance Banking13 Infrastructure Construction Real estate14 Agriculture Rural areas15 Mining Industrial16 Energy Water Conservancy17 Information industry18 Transport Postal services Logistics19 Commerce Foreign trade Customs21 Services Tourism22 Environmental Meteorological31 Education33 Science and Technology35 Culture Recreation and Leisure36 Literature Art37 Media Industry38 Medicine Health39 Sports

and CHI Statistics [20]Thesemethods are inclined to choosethe rare words which are not reliable in classification onsome corpus Therefore in order to solve this and reducethe computational burden in the process of news articlesclassification wemust filter out some sporadic low-frequencywords the two concrete steps to filter are shown below

(1) Rough Selection Using Document Frequency of FeatureWords In training corpus let 119905

119894be a word we define 119863119891

119894as

total relative document frequency which denotes the ratiothat the number of documents containing 119905

119894occupies over the

whole number of documents When the 119863119891119894is greater than

a threshold 120572 it means that the word 119905119894is a high-frequency

word in training corpus and we add it into 1198791198901198981set For

a given class 119862119896 we define 119863119891

119894119896as class relative document

in class 119862119896 which denotes the ratio that the number of

documents in class 119862119896occupies over the whole number of

documents in class 119862119896 When the119863119891

119894119896is greater than a given

threshold 120573 it means that 119905119894is a high-frequency word in

this class and then we add it into 1198791198901198982set According to

our experiment and corpus we roughly set the 120572 = 001

and 120573 = 01 in order to avoid the fault or omit selectionThis rough selection process selects the words which appearfrequently in all corpus and classes [21] The result of roughfeature selection is 1198791198901198981015840 = 119879119890119898

1⋂119879119890119898

2

(2) Precise Selection Using Index of Discrimination betweenWord and ClassWe employ [22] method to define the indexof discrimination between word and class as follows

119877 (119905119894 119862119896) =

119875 (119905119894isin 119862119896)

max119862119895 = 119862119896119875(119905119894isin119862119895)

(1)

where 119875(119905119894isin 119862119896) denotes the probability of word 119905

119894in class

119862119896andmax

119862119895 = 119862119896119875(119905119894isin119862119895)denotes the maximum probability of

word 119905119894in other classes except 119862

119896 The 119875(119905

119894isin 119862119896) can be

represented as follows

119875 (119905119894isin 119862119896) =

119905119891 (119905119894isin 119862119896) + 1

sum1199051015840 119905119891 (1199051015840 isin 119862

119896) + 1

(2)

where 119905119891(119905119894isin 119862119896) denotes the frequency of 119905

119894appearing in

class 119862119896 1199051015840 denotes the word different to 119905

119894from 119879119890119898

1015840 andsum1199051015840 119905119891(1199051015840isin 119862119896) denotes the sum frequency of 1199051015840 appearing in

class119862119896The 119905

119894is the representative word in class119862

119896when the

index of discrimination 119877(119905119894 119862119896) is greater than a threshold

120574 We can use selection proportion threshold 119879 to decideparameter 120574 which will be discussed in our ExperimentalEvaluation section later We can obtain the representativewords set when the process above is done for each classRough selection step can save calculation time that is used toexclude the words which are certainly not the feature words

42 Classifying News Items In real Internet world classi-fication or clustering on massive news data requires lotsof computational power To tackle this issue on news rec-ommendation we employ One Versus All method [23](One Versus All is a two-class classification method) andconsider the key persons on news articles In this paper newsarticle classification is considered as a plurality of two-classclassification problem For a class 119862

119896 if document 119889

119894belongs

to class 119862119896 it is tagged by 1 for class 119862

119896as a positive sample

and tagged byminus1 as a negative sample otherwiseThismethodis to construct the projective vector 119901

119896between text matrix

119860 and class vector 119910 and we employ the ridge regressionmethod [24] shown in the following

119862 = argmin119901119896

10038171003817100381710038171003817119910 minus 119901119879

11989611986010038171003817100381710038171003817

2

+ 1205791003817100381710038171003817119901119896

1003817100381710038171003817

2

(3)

where 120579 is a positive parameter used to adjust the estimationerror To solve the minimization problem above we shouldfind the partial derivative of 119901

119896and set the partial derivative

to 0 and then we can obtain the equation shown below

119901119896= (119860119860

minus1+ 120579119868)minus1

119860119910119879 (4)

The Scientific World Journal 5

where 119868 is a unitary matrix with the same dimension of 119860Because the training set is divided into 119870 categories we canobtain a group of projective vectors 119875 = 119901

1 1199012 119901

119870 We

utilize code matrix 119872 to describe the correlation betweendifferent classes got from two-class classification Assumingthat class 119862

119896has 119873

119896trained documents 119863

119896119895 where 119895 isin

[1119873119896] the element of 119872 which denotes the correlation

between two classes can be calculated by

1198721198701198701015840 =

1

119873119896

119873119896

sum

119895=1

sgn (⟨1199011198961015840 119863119896119895

⟩) (5)

where 1199011198961015840 denotes the projective vector of 119862

1198961015840 If ⟨119901

1198961015840 119863119896119895

is greater than 0 the return value of function sgn is 1 andotherwise 0 When new articles come the similarity betweenarticle and class can be calculated by the following equation

Sim (119861 119862119870) =

119896

sum

1198701015840=1

11987211989611989610158401198761198961015840 =

119896

sum

1198701015840=1

1198721198961198961015840 sgn (⟨119901

1198961015840 119863119896119895

⟩)

(6)

where 119861 denotes a new article At last we can obtain the classof 119861 through the maximum of function Sim(sdot sdot)

119862 (119861) = argmax119862119896Sim (119861 119862

119896) (7)

In order to further improve the classification accuracyand utilize the manual labor rationally we propose a methodwith considering key person (named entity of type person) toimprove the ability of classification when key persons appearas shown in the following

119875 (119862119894| 119861) = (1 minus 120572)

Sim (119861 119862119870)

sum119870

119894=1Sim (119861 119862

119894)

+ 120572119875 (119862119894| 119861119896) 119875 (119861

119896| 119889119895)

(8)

where Sim(119861 119862119870)sum119870

119894=1Sim(119861 119862

119894) denotes the probability

score of article 119861 on class 119862119896obtained by the method we

mentioned above 119861119896denotes the key person that appeared

in the article 119861 and 119875(119861119896| 119861) = 1 when 119861

119896appeared in 119861

otherwise 119875(119861119896| 119861) = 0 In other words if a new article has

not appeared in any key person we could not implement thekey person factor on it 120572 is the balance parameter on thesetwo methods The 119875(119862

119894| 119861119896) is computed as

119875 (119862119894| 119861119896) =

119875 (119862119894) 119901 (119861

119896| 119862119894)

sum119872

119894=1119875 (119862119894) 119901 (119861

119896| 119862119894)

(9)

43 News Subclass Clustering After obtaining the roughclassification results we need to separate every news classinto subclass 119878119862

119894119909 A natural way to detect subclasses of

an Internet text corpus is typically done using clusteringsfor instance such as 119870-means or hierarchical clusterings InNEMAH we propose a subclass clustering method to obtainsubclasses Each subclass is represented as a subclass vector119879 = ⟨119905

1 1199081⟩ ⟨1199052 1199082⟩ where 119905

119894and 119908

119894denote the rep-

resentative word and its corresponding weight respectively

We call this cluster method as Maximizing Neighborhoodmethod because of the main idea of algorithm

(1) Solving Subspace Projection Problem by Maximizing theAverage ofNeighborhoodFor each document119909

119894in a text space

1198830 the neighbor documents can be divided into two subsets

according to the distance to the 119909119894 similar neighborhood

set Θ119900

119894and heterogeneous neighborhood set Θ

119890

119894 where Θ

119900

119894

contains the top 120585nearest neighborswhich belong to the sameclass of 119909

119894andΘ

119890

119894contains the top 120577 nearest neighbors which

do not belong to the same class of 119909119894 In the text corpus all

data pointsrsquo average distance out of class and within class canbe expressed as follows

119875119894= sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876119894= sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(10)

All data points in the text corpus average out of classdistance and the average within-class distance expression areas follows

119875 = sum

119894

119875119894= sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876 = sum

119894

119876119894= sum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(11)

The subclass clustering problem can be considered as aprojection of text space to a subspace For instance let 119910

119894

be a projection space of 119909119894after projecting we can express

119910119894= 119882119879119909119894 The principle of this projection is maximizing

the average distance of different classes and minimizing theaverage distance within each class [25] as shown in thefollowing

119903 = sum

119894

( sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minus sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)

= 119905119903[

[

119882119879(sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minussum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)119882]

]

= 119905119903 [119882119879(119875 minus 119876)119882]

(12)

where 119905119903(sdot) denotes the trace of a matrix and the constraint ofthis equation is 119882119879119882 = 119868 And then maximize the equationshown as follows

max 119905119903 [119882119879(119875 minus 119876)119882] (13)

6 The Scientific World Journal

(2) The Quick Affinity Propagation Clustering on SubspaceAfter projecting the initial text vector space into subspacethrough projective matrix it can generate 119870 clusters withemploying 119870-Affinity Propagation (119870-AP) (this method willbe more suitable for text clustering because it can achievemore reasonable clusters than the traditional clusteringmeth-ods [26]) implemented in subspace Let the similarity of 119910

119894

and 119910119895in subspace119884 = 1199101 1199102 119910

119899 be 119878 = 119904

119894119895 the target

of 119870-AP is to find the 119870 real samples 119864 = 1198901 1198902 119890

119870

which denotes the 119870 classes 119862 = 1198621 1198622 119862

119870 And then

maximize the following objective function

max119865 (119862119895119870

119895=1) =

119870

sum

119895=1

sum

119910119894isin119862119895

119904 (119910119894 119890119895) (14)

where 119890119895belongs to 119862

119895 The objective function can be trans-

formed into 0-1 integer programming problem when intro-ducing the binary parameter 119861 = 119887

119894119895isin 0 1 119894 119895 = 1 119899

as shown in the following

max119865 (119887119894119895) =

119899

sum

119894=1

119899

sum

119895=1

119887119894119895119904 (119910119894 119910119895) (15)

Equation (15) has three constraints (1) 119887119894119894 if 119887119895119894

= 1(2) sum

119899

119895=1119887119894119895= 1 and (3) sum

119899

119894=1119887119894119894= 119870 where 119887

119894119895= 1 when 119910

119894

considers 119910119895as a sample and 119887

119894119894= 1when 119910

119894is a sample itself

For the first constraint 119910119894is a sample when 119910

119895considers 119910

119894as

a sample For the second one it means that each data pointhas only one sample point For the last one it means that thenumber of samples is119870 which can ensure that119870-APmethodgenerates 119870 clusters

(3) Hybrid Learning of Subspace Projection and Clusteringon Adaptive Subspace The class information updated onsubspace clustering process can be utilized as a priori knowl-edge in the next processing on subspace projection andafter several iterations until convergence we can obtain theglobal optimal clustering resultThe iteration processing is asfollows

1198830rarr 119870-AP rarr 119871

0rarr SubSpacerarr 119882

1 Score

1

1198841

= 119882119879

11198830

rarr 119870-AP rarr 1198711

rarr SubSpace rarr 1198822

Score1

sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot

119884119905= 119882119879

1198791198830

rarr 119870-AP rarr 119871119905rarr SubSpace rarr 119882

119905+1

Score119905+1

It must compute the convergence function value in eachiteration

Score119905+1

= 119905119903 [119882119879

119905+1(119875 (119871119905) minus 119876 (119871

119905))119882119905+1

] (16)

where 119875(119871119905) and 119876(119871

119905) denote the average distances between

classes andwithin class which are calculated by (11) accordingto the class vested instruction matrix 119871

119905 The iteration willbe finished when it meets the condition of convergenceScore119905+1

minusScore119905le 120598 or reaches themax number of iterations

The parameters of our clustering method are the number of

points 120578 which are the nearest in class and the number ofpoints 120577 which are the nearest out of class We did cross-fold validation to train these parameters and we found thatselecting 120577 = 120578 = 13 for all classes per 1 000 documentswould perform better

DiscussionThemotivation of this module (classification andclustering) is to find the userrsquos preference (subclass level)and track the hotness of a newly published news in a givensubclass

5 Subclass Popularity Prediction Module

On the explosion of information today the fast pace oflife makes people focus their attention on the popularsubclass rather than spendmuch time searching and selectinginformation Sometimes even users themselves have no ideawhat they really want Therefore the hot subclass predictiontechnology with recommendation function has become veryimportant News subclass popularity prediction can improvethe performance of news recommender system Besidesit can also improve the display function of popular newsmodules on website automatically reduce the workload ofwebsite editors and improve the usersrsquo browsing experience

On the study of historical statistical data on news sub-classes we found that some subclasses are popular periodi-cally For instance the subclass college entrance examinationwill appear highly popular about June every year in Chinaand a lot of news articles and comments focus on this subclassat that time as shown in Figures 4(b) and 4(a) that show thedata of college entrance examination subclass In this paperwe define the news subclassrsquo degree of concern according tothe number of news articles and their comments as shown inthe following

119867119896= 120582119867

(119896)

119899119890+ (1 minus 120582)119867

(119896)

119903119890

=119873(119896)

119903119890

119873119899119890

+ 119873119903119890

119873(119896)

119899119890

119873119899119890

+119873(119896)

119899119890

119873119899119890

+ 119873119903119890

119873(119896)

119903119890

119873119903119890

(17)

where119867(119896)

119899119890denotes the popular degree of news article on the

119896th subclass 119867(119896)119903119890

denotes the popular degree of commenton the 119896th subclass 120582 is a weight of popular degree of newsarticle and the value is 119873

(119896)

119903119890(119873119899119890

+ 119873119903119890) 119873(119896)119903119890

denotes thenumber of reviews on the 119896th subclass 119873

119903119890denotes the

number of reviews on all corpus 119873(119896)119899119890

denotes the numberof news articles on the 119896th subclass and 119873

119899119890denotes the

number of news articles on all corpus According to theexperiments of time series analysis on our corpus we foundthat most subclasses are suitable for implementing spectralanalysismethod [27]

Any stationary sequence modeling can be extended tomany cosine waves with different frequencies amplitudeand phase combination This analysis method is called timedomain based analysis method The linear combination of 119898cosines with arbitrary amplitudes frequencies and phases itis shown in the following

119884119905= 1198600+

119898

sum

119895=1

[119860119895cos (2120587119891

119894119905) + 119861

119895sin (2120587119891

119894119905)] (18)

The Scientific World Journal 7

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(a) Popularity degree of graduate entrance examination subclass

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(b) Popularity degree of college entrance examination subclass

Figure 4 Periodic subclass news distribution Remark 119909-axis denotes the date from August 1 2009 to May 3 2012 119910-axis denotes the valueof119867119896in (17)

It can get the values of 119860 and 119861 by ordinary least squaresfitting regression When the frequency is a special form thecalculation will become very simple If 119899 is an odd numberwhich can be expressed as 119899 = 2119896 + 1 then the frequencywith the form of 1119899 2119899 119896119899 is called Fourier frequencyThe estimated parameters are as follows

1198600= 119884

119860119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

119861119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

(19)

If the sample size is even which can be expressed as119899 = 2119896 (19) still holds But the equation will change to thefollowing when 119891

119896= 119896119899 = 12

119860119896=

1

119899

119899

sum

119905=1

(minus1)119905119884119905 119861

119896= 0 (20)

Definition 1 When the sample size is odd namely 119899 = 2119896+1we define the cycle diagram whose frequency 119891 = 119895119899 (119895 =

1 2 119870) as 119868 as shown in the following equation

119868 (119895

119899) =

119899

2(1198602

119895+ 1198612

119895) (21)

If the sample size is even (19) still can get the119860 and 119861 valuesand the cycle diagram is the same as the odd case But in theextreme frequency case for example when 119891 = 119896119899 = 12the cycle diagram is shown in the following equation

119868 (119895

119899) = 119899(119860

119895)2

(22)

The periodogram with frequency 119891 = 119895119899 is inverselyproportional to the square value of the correspondingregression coefficients Therefore the peaks of periodogramshow the relative intensity of sine-cosine pairs in differentfrequencies as shown in Figure 5

In Figure 5 the periodogram has two peaks 0004970179and 0002982107 namely the subcycle 119879 = 1119891 may be 201

days and 335 days The other peaks are too low that they canbe neglected The two frequencies are selected for buildingthe model which means that the model has two pairs of sine-cosine in it as shown in the following

119884119905= 120573 + 120573

1cos (2120587119891

1119905) + 120573

2sin (2120587119891

1119905)

+ 1205733cos (2120587119891

2119905) + 120573

4sin (2120587119891

2119905) + 119890119905

(23)

Using spectral analysis method for prediction has severalsteps First we should use the periodogram for getting thevalue and number of strong frequencies Second model isgenerated by the value and number of strong frequenciesFinally we predict future data values according to the modelwhich only requires a time parameter

Discussion The motivation of this module is to obtain thehotness of each subclass Some new studies also take intoaccount the popularity of the newly published news articleFor example SCENE [3] used the popular degree which iscomputed as the ratio of the number of users accessing thearticle However for the newest popular news article 119899

119894 its

clicked number would be less than the news article publishedseveral hours or days before

6 User Profile Module

In order to capture a userrsquos reading interest on news itemsgenerally personalized news recommendation system needsto construct the userrsquos profile Traditionally the user profile

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

The Scientific World Journal 3

Sub class popularityprediction(Section 5)

Newlypublished

newsClassification(Section 42)

Clustering(Section 43)

User profile building

(Section 6)

Rough selection

(Section 71)

Precise selection

(Section 73)

Recency

History news data Recommendation list

Micro blog data

Figure 1 The overview of NEMAH framework

number of experiments show that adjusting the weight ofthe key persons in the hybrid classification help to get agood news classification performance After obtaining therough classes we cluster the subclasses using a clustermethodwhich maximizes the average of neighborhood points

(2) Subclass Popularity Prediction Module Different periodshave different popular subclasses People would like to focuson the popular subclasses rather than spend much time onsearching and selecting information Sometimes even theusers themselves have no idea what they want Therefore thesubclass popularity prediction technologywill help users savetheir time and improve their experience on using recommen-dation system On the research of network news we foundthat some subclasses presented time period significantly Forthe popular subclasses we can assign a higher weight forrecommending In this paper we used time series spectralanalysis method to predict the popularity of subclasses

(3) User Profile Module This module is used to extract thepreference model of users We used the microblog of userstweeted or retweeted for establishing the users profile modelfor representing usersrsquo interestThis procedure combines textanalysis text classification and accessing some particular fac-tors (ie key name and place name appeared in microblog)

(4) Personalized News Recommendation We use user profileand the subclass to determine the candidate subclasses firstlyAnd then we calculate a userrsquos utility on news item by agreedy strategy and rank the recommended list through thepopularity of news article in a special subclass and the newsarticlersquos recencyNote that when recommending specific newsitems using our system the class and the subclass of thenews articles are utilized Moreover the other propertiesof news items such as freshness (recency) and popularity(subclass popularity prediction) are synthesized into the finalrecommended ranking list as adjustment factors

4 Classification and Clustering Module

Classifying massive network news is conducive to the sub-sequent process on the news applications Internet newsrecommendation requires response as soon as possible to

⟨REC⟩

⟨NewsID⟩ = nf2012010121574⟨Date⟩ = 20120101⟨Title⟩ = Our province will implement

the new law of registered residence⟨Author⟩ = Zhangsheng Bo⟨CSN⟩ = 0215

⟨Class⟩ = Law Justice⟨SubClass⟩ = household management⟨Area⟩ = D440000 Guangdong Province⟨Source⟩ = Nanfang Daily⟨KP⟩ = Weifa Liang⟨Text⟩ = News content

⟨CommentsUser⟩ = 1759918187 2414113125 14134759811463256471 871324394 1657924191 20019463411356100372 1549089713

⟨Tagged⟩ = True

Figure 2 The storage structure of a piece of history newsRemark The useful elements in this paper are ⟨NewsID⟩ ⟨Date⟩⟨Title⟩ ⟨CSN⟩ ⟨Area⟩ ⟨KP⟩ ⟨Text⟩ ⟨CommentsUser⟩ in which⟨CommentsUser⟩ shows the user list of whom comments this newsarticle ⟨CSN⟩ denotes the class and sub-class ID of this news article(eg ⟨CSN⟩ = 0215means that class ID is 02 and sub-class ID is 15)and ⟨KP⟩ is the named entity of type person which will be discussedin Section 42

show the recommended list to users In NEMAH given a setof news items 119873 = 119899

1 1198992 119899

119872 where |119873| = 119872 our

goal is to generate a classification result119862 = 1198621 1198622 119862

119870

where 119870 is a predefined classification number (eg 119870 = 23

in this paper) Class names are shown in Table 1 Besideseach class can be divided into several subclasses using ourproposed clustering module 119862

119894= 119878119862

1198941 1198781198621198942 119878119862

119894119898 The

storage structure of our history news and a user are shown inFigures 2 and 3 respectively

41 Feature Selection In the processing of text corpus thedimension of each item will be very large (ie more thanten thousand in the same cases) that would need to selectthe main features for representing the document Generallythere are three classical feature selection methods in textprocessing Mutual Information [18] Information Gain [19]

4 The Scientific World Journal

⟨USER⟩

⟨UserID⟩ = 2414113125

⟨MicroBlog⟩ = MicroBlog content 1 MicroBlog content 2MicroBlog content 3

⟨CommentsOn⟩ = nf2012010121574 nf2011040722331nf201012784512

Figure 3 The storage structure of a user Remark ⟨MicroBlog⟩lists the messages tweeted or retweeted by the user ⟨CommentsOn⟩denotes the news articles which are commented on by this user

Table 1 Name of each class

ID Class name1 Political2 Law Justice3 External Relations International Relations4 Military5 Social Labor Disaster11 Economy12 Finance Banking13 Infrastructure Construction Real estate14 Agriculture Rural areas15 Mining Industrial16 Energy Water Conservancy17 Information industry18 Transport Postal services Logistics19 Commerce Foreign trade Customs21 Services Tourism22 Environmental Meteorological31 Education33 Science and Technology35 Culture Recreation and Leisure36 Literature Art37 Media Industry38 Medicine Health39 Sports

and CHI Statistics [20]Thesemethods are inclined to choosethe rare words which are not reliable in classification onsome corpus Therefore in order to solve this and reducethe computational burden in the process of news articlesclassification wemust filter out some sporadic low-frequencywords the two concrete steps to filter are shown below

(1) Rough Selection Using Document Frequency of FeatureWords In training corpus let 119905

119894be a word we define 119863119891

119894as

total relative document frequency which denotes the ratiothat the number of documents containing 119905

119894occupies over the

whole number of documents When the 119863119891119894is greater than

a threshold 120572 it means that the word 119905119894is a high-frequency

word in training corpus and we add it into 1198791198901198981set For

a given class 119862119896 we define 119863119891

119894119896as class relative document

in class 119862119896 which denotes the ratio that the number of

documents in class 119862119896occupies over the whole number of

documents in class 119862119896 When the119863119891

119894119896is greater than a given

threshold 120573 it means that 119905119894is a high-frequency word in

this class and then we add it into 1198791198901198982set According to

our experiment and corpus we roughly set the 120572 = 001

and 120573 = 01 in order to avoid the fault or omit selectionThis rough selection process selects the words which appearfrequently in all corpus and classes [21] The result of roughfeature selection is 1198791198901198981015840 = 119879119890119898

1⋂119879119890119898

2

(2) Precise Selection Using Index of Discrimination betweenWord and ClassWe employ [22] method to define the indexof discrimination between word and class as follows

119877 (119905119894 119862119896) =

119875 (119905119894isin 119862119896)

max119862119895 = 119862119896119875(119905119894isin119862119895)

(1)

where 119875(119905119894isin 119862119896) denotes the probability of word 119905

119894in class

119862119896andmax

119862119895 = 119862119896119875(119905119894isin119862119895)denotes the maximum probability of

word 119905119894in other classes except 119862

119896 The 119875(119905

119894isin 119862119896) can be

represented as follows

119875 (119905119894isin 119862119896) =

119905119891 (119905119894isin 119862119896) + 1

sum1199051015840 119905119891 (1199051015840 isin 119862

119896) + 1

(2)

where 119905119891(119905119894isin 119862119896) denotes the frequency of 119905

119894appearing in

class 119862119896 1199051015840 denotes the word different to 119905

119894from 119879119890119898

1015840 andsum1199051015840 119905119891(1199051015840isin 119862119896) denotes the sum frequency of 1199051015840 appearing in

class119862119896The 119905

119894is the representative word in class119862

119896when the

index of discrimination 119877(119905119894 119862119896) is greater than a threshold

120574 We can use selection proportion threshold 119879 to decideparameter 120574 which will be discussed in our ExperimentalEvaluation section later We can obtain the representativewords set when the process above is done for each classRough selection step can save calculation time that is used toexclude the words which are certainly not the feature words

42 Classifying News Items In real Internet world classi-fication or clustering on massive news data requires lotsof computational power To tackle this issue on news rec-ommendation we employ One Versus All method [23](One Versus All is a two-class classification method) andconsider the key persons on news articles In this paper newsarticle classification is considered as a plurality of two-classclassification problem For a class 119862

119896 if document 119889

119894belongs

to class 119862119896 it is tagged by 1 for class 119862

119896as a positive sample

and tagged byminus1 as a negative sample otherwiseThismethodis to construct the projective vector 119901

119896between text matrix

119860 and class vector 119910 and we employ the ridge regressionmethod [24] shown in the following

119862 = argmin119901119896

10038171003817100381710038171003817119910 minus 119901119879

11989611986010038171003817100381710038171003817

2

+ 1205791003817100381710038171003817119901119896

1003817100381710038171003817

2

(3)

where 120579 is a positive parameter used to adjust the estimationerror To solve the minimization problem above we shouldfind the partial derivative of 119901

119896and set the partial derivative

to 0 and then we can obtain the equation shown below

119901119896= (119860119860

minus1+ 120579119868)minus1

119860119910119879 (4)

The Scientific World Journal 5

where 119868 is a unitary matrix with the same dimension of 119860Because the training set is divided into 119870 categories we canobtain a group of projective vectors 119875 = 119901

1 1199012 119901

119870 We

utilize code matrix 119872 to describe the correlation betweendifferent classes got from two-class classification Assumingthat class 119862

119896has 119873

119896trained documents 119863

119896119895 where 119895 isin

[1119873119896] the element of 119872 which denotes the correlation

between two classes can be calculated by

1198721198701198701015840 =

1

119873119896

119873119896

sum

119895=1

sgn (⟨1199011198961015840 119863119896119895

⟩) (5)

where 1199011198961015840 denotes the projective vector of 119862

1198961015840 If ⟨119901

1198961015840 119863119896119895

is greater than 0 the return value of function sgn is 1 andotherwise 0 When new articles come the similarity betweenarticle and class can be calculated by the following equation

Sim (119861 119862119870) =

119896

sum

1198701015840=1

11987211989611989610158401198761198961015840 =

119896

sum

1198701015840=1

1198721198961198961015840 sgn (⟨119901

1198961015840 119863119896119895

⟩)

(6)

where 119861 denotes a new article At last we can obtain the classof 119861 through the maximum of function Sim(sdot sdot)

119862 (119861) = argmax119862119896Sim (119861 119862

119896) (7)

In order to further improve the classification accuracyand utilize the manual labor rationally we propose a methodwith considering key person (named entity of type person) toimprove the ability of classification when key persons appearas shown in the following

119875 (119862119894| 119861) = (1 minus 120572)

Sim (119861 119862119870)

sum119870

119894=1Sim (119861 119862

119894)

+ 120572119875 (119862119894| 119861119896) 119875 (119861

119896| 119889119895)

(8)

where Sim(119861 119862119870)sum119870

119894=1Sim(119861 119862

119894) denotes the probability

score of article 119861 on class 119862119896obtained by the method we

mentioned above 119861119896denotes the key person that appeared

in the article 119861 and 119875(119861119896| 119861) = 1 when 119861

119896appeared in 119861

otherwise 119875(119861119896| 119861) = 0 In other words if a new article has

not appeared in any key person we could not implement thekey person factor on it 120572 is the balance parameter on thesetwo methods The 119875(119862

119894| 119861119896) is computed as

119875 (119862119894| 119861119896) =

119875 (119862119894) 119901 (119861

119896| 119862119894)

sum119872

119894=1119875 (119862119894) 119901 (119861

119896| 119862119894)

(9)

43 News Subclass Clustering After obtaining the roughclassification results we need to separate every news classinto subclass 119878119862

119894119909 A natural way to detect subclasses of

an Internet text corpus is typically done using clusteringsfor instance such as 119870-means or hierarchical clusterings InNEMAH we propose a subclass clustering method to obtainsubclasses Each subclass is represented as a subclass vector119879 = ⟨119905

1 1199081⟩ ⟨1199052 1199082⟩ where 119905

119894and 119908

119894denote the rep-

resentative word and its corresponding weight respectively

We call this cluster method as Maximizing Neighborhoodmethod because of the main idea of algorithm

(1) Solving Subspace Projection Problem by Maximizing theAverage ofNeighborhoodFor each document119909

119894in a text space

1198830 the neighbor documents can be divided into two subsets

according to the distance to the 119909119894 similar neighborhood

set Θ119900

119894and heterogeneous neighborhood set Θ

119890

119894 where Θ

119900

119894

contains the top 120585nearest neighborswhich belong to the sameclass of 119909

119894andΘ

119890

119894contains the top 120577 nearest neighbors which

do not belong to the same class of 119909119894 In the text corpus all

data pointsrsquo average distance out of class and within class canbe expressed as follows

119875119894= sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876119894= sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(10)

All data points in the text corpus average out of classdistance and the average within-class distance expression areas follows

119875 = sum

119894

119875119894= sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876 = sum

119894

119876119894= sum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(11)

The subclass clustering problem can be considered as aprojection of text space to a subspace For instance let 119910

119894

be a projection space of 119909119894after projecting we can express

119910119894= 119882119879119909119894 The principle of this projection is maximizing

the average distance of different classes and minimizing theaverage distance within each class [25] as shown in thefollowing

119903 = sum

119894

( sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minus sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)

= 119905119903[

[

119882119879(sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minussum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)119882]

]

= 119905119903 [119882119879(119875 minus 119876)119882]

(12)

where 119905119903(sdot) denotes the trace of a matrix and the constraint ofthis equation is 119882119879119882 = 119868 And then maximize the equationshown as follows

max 119905119903 [119882119879(119875 minus 119876)119882] (13)

6 The Scientific World Journal

(2) The Quick Affinity Propagation Clustering on SubspaceAfter projecting the initial text vector space into subspacethrough projective matrix it can generate 119870 clusters withemploying 119870-Affinity Propagation (119870-AP) (this method willbe more suitable for text clustering because it can achievemore reasonable clusters than the traditional clusteringmeth-ods [26]) implemented in subspace Let the similarity of 119910

119894

and 119910119895in subspace119884 = 1199101 1199102 119910

119899 be 119878 = 119904

119894119895 the target

of 119870-AP is to find the 119870 real samples 119864 = 1198901 1198902 119890

119870

which denotes the 119870 classes 119862 = 1198621 1198622 119862

119870 And then

maximize the following objective function

max119865 (119862119895119870

119895=1) =

119870

sum

119895=1

sum

119910119894isin119862119895

119904 (119910119894 119890119895) (14)

where 119890119895belongs to 119862

119895 The objective function can be trans-

formed into 0-1 integer programming problem when intro-ducing the binary parameter 119861 = 119887

119894119895isin 0 1 119894 119895 = 1 119899

as shown in the following

max119865 (119887119894119895) =

119899

sum

119894=1

119899

sum

119895=1

119887119894119895119904 (119910119894 119910119895) (15)

Equation (15) has three constraints (1) 119887119894119894 if 119887119895119894

= 1(2) sum

119899

119895=1119887119894119895= 1 and (3) sum

119899

119894=1119887119894119894= 119870 where 119887

119894119895= 1 when 119910

119894

considers 119910119895as a sample and 119887

119894119894= 1when 119910

119894is a sample itself

For the first constraint 119910119894is a sample when 119910

119895considers 119910

119894as

a sample For the second one it means that each data pointhas only one sample point For the last one it means that thenumber of samples is119870 which can ensure that119870-APmethodgenerates 119870 clusters

(3) Hybrid Learning of Subspace Projection and Clusteringon Adaptive Subspace The class information updated onsubspace clustering process can be utilized as a priori knowl-edge in the next processing on subspace projection andafter several iterations until convergence we can obtain theglobal optimal clustering resultThe iteration processing is asfollows

1198830rarr 119870-AP rarr 119871

0rarr SubSpacerarr 119882

1 Score

1

1198841

= 119882119879

11198830

rarr 119870-AP rarr 1198711

rarr SubSpace rarr 1198822

Score1

sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot

119884119905= 119882119879

1198791198830

rarr 119870-AP rarr 119871119905rarr SubSpace rarr 119882

119905+1

Score119905+1

It must compute the convergence function value in eachiteration

Score119905+1

= 119905119903 [119882119879

119905+1(119875 (119871119905) minus 119876 (119871

119905))119882119905+1

] (16)

where 119875(119871119905) and 119876(119871

119905) denote the average distances between

classes andwithin class which are calculated by (11) accordingto the class vested instruction matrix 119871

119905 The iteration willbe finished when it meets the condition of convergenceScore119905+1

minusScore119905le 120598 or reaches themax number of iterations

The parameters of our clustering method are the number of

points 120578 which are the nearest in class and the number ofpoints 120577 which are the nearest out of class We did cross-fold validation to train these parameters and we found thatselecting 120577 = 120578 = 13 for all classes per 1 000 documentswould perform better

DiscussionThemotivation of this module (classification andclustering) is to find the userrsquos preference (subclass level)and track the hotness of a newly published news in a givensubclass

5 Subclass Popularity Prediction Module

On the explosion of information today the fast pace oflife makes people focus their attention on the popularsubclass rather than spendmuch time searching and selectinginformation Sometimes even users themselves have no ideawhat they really want Therefore the hot subclass predictiontechnology with recommendation function has become veryimportant News subclass popularity prediction can improvethe performance of news recommender system Besidesit can also improve the display function of popular newsmodules on website automatically reduce the workload ofwebsite editors and improve the usersrsquo browsing experience

On the study of historical statistical data on news sub-classes we found that some subclasses are popular periodi-cally For instance the subclass college entrance examinationwill appear highly popular about June every year in Chinaand a lot of news articles and comments focus on this subclassat that time as shown in Figures 4(b) and 4(a) that show thedata of college entrance examination subclass In this paperwe define the news subclassrsquo degree of concern according tothe number of news articles and their comments as shown inthe following

119867119896= 120582119867

(119896)

119899119890+ (1 minus 120582)119867

(119896)

119903119890

=119873(119896)

119903119890

119873119899119890

+ 119873119903119890

119873(119896)

119899119890

119873119899119890

+119873(119896)

119899119890

119873119899119890

+ 119873119903119890

119873(119896)

119903119890

119873119903119890

(17)

where119867(119896)

119899119890denotes the popular degree of news article on the

119896th subclass 119867(119896)119903119890

denotes the popular degree of commenton the 119896th subclass 120582 is a weight of popular degree of newsarticle and the value is 119873

(119896)

119903119890(119873119899119890

+ 119873119903119890) 119873(119896)119903119890

denotes thenumber of reviews on the 119896th subclass 119873

119903119890denotes the

number of reviews on all corpus 119873(119896)119899119890

denotes the numberof news articles on the 119896th subclass and 119873

119899119890denotes the

number of news articles on all corpus According to theexperiments of time series analysis on our corpus we foundthat most subclasses are suitable for implementing spectralanalysismethod [27]

Any stationary sequence modeling can be extended tomany cosine waves with different frequencies amplitudeand phase combination This analysis method is called timedomain based analysis method The linear combination of 119898cosines with arbitrary amplitudes frequencies and phases itis shown in the following

119884119905= 1198600+

119898

sum

119895=1

[119860119895cos (2120587119891

119894119905) + 119861

119895sin (2120587119891

119894119905)] (18)

The Scientific World Journal 7

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(a) Popularity degree of graduate entrance examination subclass

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(b) Popularity degree of college entrance examination subclass

Figure 4 Periodic subclass news distribution Remark 119909-axis denotes the date from August 1 2009 to May 3 2012 119910-axis denotes the valueof119867119896in (17)

It can get the values of 119860 and 119861 by ordinary least squaresfitting regression When the frequency is a special form thecalculation will become very simple If 119899 is an odd numberwhich can be expressed as 119899 = 2119896 + 1 then the frequencywith the form of 1119899 2119899 119896119899 is called Fourier frequencyThe estimated parameters are as follows

1198600= 119884

119860119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

119861119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

(19)

If the sample size is even which can be expressed as119899 = 2119896 (19) still holds But the equation will change to thefollowing when 119891

119896= 119896119899 = 12

119860119896=

1

119899

119899

sum

119905=1

(minus1)119905119884119905 119861

119896= 0 (20)

Definition 1 When the sample size is odd namely 119899 = 2119896+1we define the cycle diagram whose frequency 119891 = 119895119899 (119895 =

1 2 119870) as 119868 as shown in the following equation

119868 (119895

119899) =

119899

2(1198602

119895+ 1198612

119895) (21)

If the sample size is even (19) still can get the119860 and 119861 valuesand the cycle diagram is the same as the odd case But in theextreme frequency case for example when 119891 = 119896119899 = 12the cycle diagram is shown in the following equation

119868 (119895

119899) = 119899(119860

119895)2

(22)

The periodogram with frequency 119891 = 119895119899 is inverselyproportional to the square value of the correspondingregression coefficients Therefore the peaks of periodogramshow the relative intensity of sine-cosine pairs in differentfrequencies as shown in Figure 5

In Figure 5 the periodogram has two peaks 0004970179and 0002982107 namely the subcycle 119879 = 1119891 may be 201

days and 335 days The other peaks are too low that they canbe neglected The two frequencies are selected for buildingthe model which means that the model has two pairs of sine-cosine in it as shown in the following

119884119905= 120573 + 120573

1cos (2120587119891

1119905) + 120573

2sin (2120587119891

1119905)

+ 1205733cos (2120587119891

2119905) + 120573

4sin (2120587119891

2119905) + 119890119905

(23)

Using spectral analysis method for prediction has severalsteps First we should use the periodogram for getting thevalue and number of strong frequencies Second model isgenerated by the value and number of strong frequenciesFinally we predict future data values according to the modelwhich only requires a time parameter

Discussion The motivation of this module is to obtain thehotness of each subclass Some new studies also take intoaccount the popularity of the newly published news articleFor example SCENE [3] used the popular degree which iscomputed as the ratio of the number of users accessing thearticle However for the newest popular news article 119899

119894 its

clicked number would be less than the news article publishedseveral hours or days before

6 User Profile Module

In order to capture a userrsquos reading interest on news itemsgenerally personalized news recommendation system needsto construct the userrsquos profile Traditionally the user profile

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

4 The Scientific World Journal

⟨USER⟩

⟨UserID⟩ = 2414113125

⟨MicroBlog⟩ = MicroBlog content 1 MicroBlog content 2MicroBlog content 3

⟨CommentsOn⟩ = nf2012010121574 nf2011040722331nf201012784512

Figure 3 The storage structure of a user Remark ⟨MicroBlog⟩lists the messages tweeted or retweeted by the user ⟨CommentsOn⟩denotes the news articles which are commented on by this user

Table 1 Name of each class

ID Class name1 Political2 Law Justice3 External Relations International Relations4 Military5 Social Labor Disaster11 Economy12 Finance Banking13 Infrastructure Construction Real estate14 Agriculture Rural areas15 Mining Industrial16 Energy Water Conservancy17 Information industry18 Transport Postal services Logistics19 Commerce Foreign trade Customs21 Services Tourism22 Environmental Meteorological31 Education33 Science and Technology35 Culture Recreation and Leisure36 Literature Art37 Media Industry38 Medicine Health39 Sports

and CHI Statistics [20]Thesemethods are inclined to choosethe rare words which are not reliable in classification onsome corpus Therefore in order to solve this and reducethe computational burden in the process of news articlesclassification wemust filter out some sporadic low-frequencywords the two concrete steps to filter are shown below

(1) Rough Selection Using Document Frequency of FeatureWords In training corpus let 119905

119894be a word we define 119863119891

119894as

total relative document frequency which denotes the ratiothat the number of documents containing 119905

119894occupies over the

whole number of documents When the 119863119891119894is greater than

a threshold 120572 it means that the word 119905119894is a high-frequency

word in training corpus and we add it into 1198791198901198981set For

a given class 119862119896 we define 119863119891

119894119896as class relative document

in class 119862119896 which denotes the ratio that the number of

documents in class 119862119896occupies over the whole number of

documents in class 119862119896 When the119863119891

119894119896is greater than a given

threshold 120573 it means that 119905119894is a high-frequency word in

this class and then we add it into 1198791198901198982set According to

our experiment and corpus we roughly set the 120572 = 001

and 120573 = 01 in order to avoid the fault or omit selectionThis rough selection process selects the words which appearfrequently in all corpus and classes [21] The result of roughfeature selection is 1198791198901198981015840 = 119879119890119898

1⋂119879119890119898

2

(2) Precise Selection Using Index of Discrimination betweenWord and ClassWe employ [22] method to define the indexof discrimination between word and class as follows

119877 (119905119894 119862119896) =

119875 (119905119894isin 119862119896)

max119862119895 = 119862119896119875(119905119894isin119862119895)

(1)

where 119875(119905119894isin 119862119896) denotes the probability of word 119905

119894in class

119862119896andmax

119862119895 = 119862119896119875(119905119894isin119862119895)denotes the maximum probability of

word 119905119894in other classes except 119862

119896 The 119875(119905

119894isin 119862119896) can be

represented as follows

119875 (119905119894isin 119862119896) =

119905119891 (119905119894isin 119862119896) + 1

sum1199051015840 119905119891 (1199051015840 isin 119862

119896) + 1

(2)

where 119905119891(119905119894isin 119862119896) denotes the frequency of 119905

119894appearing in

class 119862119896 1199051015840 denotes the word different to 119905

119894from 119879119890119898

1015840 andsum1199051015840 119905119891(1199051015840isin 119862119896) denotes the sum frequency of 1199051015840 appearing in

class119862119896The 119905

119894is the representative word in class119862

119896when the

index of discrimination 119877(119905119894 119862119896) is greater than a threshold

120574 We can use selection proportion threshold 119879 to decideparameter 120574 which will be discussed in our ExperimentalEvaluation section later We can obtain the representativewords set when the process above is done for each classRough selection step can save calculation time that is used toexclude the words which are certainly not the feature words

42 Classifying News Items In real Internet world classi-fication or clustering on massive news data requires lotsof computational power To tackle this issue on news rec-ommendation we employ One Versus All method [23](One Versus All is a two-class classification method) andconsider the key persons on news articles In this paper newsarticle classification is considered as a plurality of two-classclassification problem For a class 119862

119896 if document 119889

119894belongs

to class 119862119896 it is tagged by 1 for class 119862

119896as a positive sample

and tagged byminus1 as a negative sample otherwiseThismethodis to construct the projective vector 119901

119896between text matrix

119860 and class vector 119910 and we employ the ridge regressionmethod [24] shown in the following

119862 = argmin119901119896

10038171003817100381710038171003817119910 minus 119901119879

11989611986010038171003817100381710038171003817

2

+ 1205791003817100381710038171003817119901119896

1003817100381710038171003817

2

(3)

where 120579 is a positive parameter used to adjust the estimationerror To solve the minimization problem above we shouldfind the partial derivative of 119901

119896and set the partial derivative

to 0 and then we can obtain the equation shown below

119901119896= (119860119860

minus1+ 120579119868)minus1

119860119910119879 (4)

The Scientific World Journal 5

where 119868 is a unitary matrix with the same dimension of 119860Because the training set is divided into 119870 categories we canobtain a group of projective vectors 119875 = 119901

1 1199012 119901

119870 We

utilize code matrix 119872 to describe the correlation betweendifferent classes got from two-class classification Assumingthat class 119862

119896has 119873

119896trained documents 119863

119896119895 where 119895 isin

[1119873119896] the element of 119872 which denotes the correlation

between two classes can be calculated by

1198721198701198701015840 =

1

119873119896

119873119896

sum

119895=1

sgn (⟨1199011198961015840 119863119896119895

⟩) (5)

where 1199011198961015840 denotes the projective vector of 119862

1198961015840 If ⟨119901

1198961015840 119863119896119895

is greater than 0 the return value of function sgn is 1 andotherwise 0 When new articles come the similarity betweenarticle and class can be calculated by the following equation

Sim (119861 119862119870) =

119896

sum

1198701015840=1

11987211989611989610158401198761198961015840 =

119896

sum

1198701015840=1

1198721198961198961015840 sgn (⟨119901

1198961015840 119863119896119895

⟩)

(6)

where 119861 denotes a new article At last we can obtain the classof 119861 through the maximum of function Sim(sdot sdot)

119862 (119861) = argmax119862119896Sim (119861 119862

119896) (7)

In order to further improve the classification accuracyand utilize the manual labor rationally we propose a methodwith considering key person (named entity of type person) toimprove the ability of classification when key persons appearas shown in the following

119875 (119862119894| 119861) = (1 minus 120572)

Sim (119861 119862119870)

sum119870

119894=1Sim (119861 119862

119894)

+ 120572119875 (119862119894| 119861119896) 119875 (119861

119896| 119889119895)

(8)

where Sim(119861 119862119870)sum119870

119894=1Sim(119861 119862

119894) denotes the probability

score of article 119861 on class 119862119896obtained by the method we

mentioned above 119861119896denotes the key person that appeared

in the article 119861 and 119875(119861119896| 119861) = 1 when 119861

119896appeared in 119861

otherwise 119875(119861119896| 119861) = 0 In other words if a new article has

not appeared in any key person we could not implement thekey person factor on it 120572 is the balance parameter on thesetwo methods The 119875(119862

119894| 119861119896) is computed as

119875 (119862119894| 119861119896) =

119875 (119862119894) 119901 (119861

119896| 119862119894)

sum119872

119894=1119875 (119862119894) 119901 (119861

119896| 119862119894)

(9)

43 News Subclass Clustering After obtaining the roughclassification results we need to separate every news classinto subclass 119878119862

119894119909 A natural way to detect subclasses of

an Internet text corpus is typically done using clusteringsfor instance such as 119870-means or hierarchical clusterings InNEMAH we propose a subclass clustering method to obtainsubclasses Each subclass is represented as a subclass vector119879 = ⟨119905

1 1199081⟩ ⟨1199052 1199082⟩ where 119905

119894and 119908

119894denote the rep-

resentative word and its corresponding weight respectively

We call this cluster method as Maximizing Neighborhoodmethod because of the main idea of algorithm

(1) Solving Subspace Projection Problem by Maximizing theAverage ofNeighborhoodFor each document119909

119894in a text space

1198830 the neighbor documents can be divided into two subsets

according to the distance to the 119909119894 similar neighborhood

set Θ119900

119894and heterogeneous neighborhood set Θ

119890

119894 where Θ

119900

119894

contains the top 120585nearest neighborswhich belong to the sameclass of 119909

119894andΘ

119890

119894contains the top 120577 nearest neighbors which

do not belong to the same class of 119909119894 In the text corpus all

data pointsrsquo average distance out of class and within class canbe expressed as follows

119875119894= sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876119894= sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(10)

All data points in the text corpus average out of classdistance and the average within-class distance expression areas follows

119875 = sum

119894

119875119894= sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876 = sum

119894

119876119894= sum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(11)

The subclass clustering problem can be considered as aprojection of text space to a subspace For instance let 119910

119894

be a projection space of 119909119894after projecting we can express

119910119894= 119882119879119909119894 The principle of this projection is maximizing

the average distance of different classes and minimizing theaverage distance within each class [25] as shown in thefollowing

119903 = sum

119894

( sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minus sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)

= 119905119903[

[

119882119879(sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minussum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)119882]

]

= 119905119903 [119882119879(119875 minus 119876)119882]

(12)

where 119905119903(sdot) denotes the trace of a matrix and the constraint ofthis equation is 119882119879119882 = 119868 And then maximize the equationshown as follows

max 119905119903 [119882119879(119875 minus 119876)119882] (13)

6 The Scientific World Journal

(2) The Quick Affinity Propagation Clustering on SubspaceAfter projecting the initial text vector space into subspacethrough projective matrix it can generate 119870 clusters withemploying 119870-Affinity Propagation (119870-AP) (this method willbe more suitable for text clustering because it can achievemore reasonable clusters than the traditional clusteringmeth-ods [26]) implemented in subspace Let the similarity of 119910

119894

and 119910119895in subspace119884 = 1199101 1199102 119910

119899 be 119878 = 119904

119894119895 the target

of 119870-AP is to find the 119870 real samples 119864 = 1198901 1198902 119890

119870

which denotes the 119870 classes 119862 = 1198621 1198622 119862

119870 And then

maximize the following objective function

max119865 (119862119895119870

119895=1) =

119870

sum

119895=1

sum

119910119894isin119862119895

119904 (119910119894 119890119895) (14)

where 119890119895belongs to 119862

119895 The objective function can be trans-

formed into 0-1 integer programming problem when intro-ducing the binary parameter 119861 = 119887

119894119895isin 0 1 119894 119895 = 1 119899

as shown in the following

max119865 (119887119894119895) =

119899

sum

119894=1

119899

sum

119895=1

119887119894119895119904 (119910119894 119910119895) (15)

Equation (15) has three constraints (1) 119887119894119894 if 119887119895119894

= 1(2) sum

119899

119895=1119887119894119895= 1 and (3) sum

119899

119894=1119887119894119894= 119870 where 119887

119894119895= 1 when 119910

119894

considers 119910119895as a sample and 119887

119894119894= 1when 119910

119894is a sample itself

For the first constraint 119910119894is a sample when 119910

119895considers 119910

119894as

a sample For the second one it means that each data pointhas only one sample point For the last one it means that thenumber of samples is119870 which can ensure that119870-APmethodgenerates 119870 clusters

(3) Hybrid Learning of Subspace Projection and Clusteringon Adaptive Subspace The class information updated onsubspace clustering process can be utilized as a priori knowl-edge in the next processing on subspace projection andafter several iterations until convergence we can obtain theglobal optimal clustering resultThe iteration processing is asfollows

1198830rarr 119870-AP rarr 119871

0rarr SubSpacerarr 119882

1 Score

1

1198841

= 119882119879

11198830

rarr 119870-AP rarr 1198711

rarr SubSpace rarr 1198822

Score1

sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot

119884119905= 119882119879

1198791198830

rarr 119870-AP rarr 119871119905rarr SubSpace rarr 119882

119905+1

Score119905+1

It must compute the convergence function value in eachiteration

Score119905+1

= 119905119903 [119882119879

119905+1(119875 (119871119905) minus 119876 (119871

119905))119882119905+1

] (16)

where 119875(119871119905) and 119876(119871

119905) denote the average distances between

classes andwithin class which are calculated by (11) accordingto the class vested instruction matrix 119871

119905 The iteration willbe finished when it meets the condition of convergenceScore119905+1

minusScore119905le 120598 or reaches themax number of iterations

The parameters of our clustering method are the number of

points 120578 which are the nearest in class and the number ofpoints 120577 which are the nearest out of class We did cross-fold validation to train these parameters and we found thatselecting 120577 = 120578 = 13 for all classes per 1 000 documentswould perform better

DiscussionThemotivation of this module (classification andclustering) is to find the userrsquos preference (subclass level)and track the hotness of a newly published news in a givensubclass

5 Subclass Popularity Prediction Module

On the explosion of information today the fast pace oflife makes people focus their attention on the popularsubclass rather than spendmuch time searching and selectinginformation Sometimes even users themselves have no ideawhat they really want Therefore the hot subclass predictiontechnology with recommendation function has become veryimportant News subclass popularity prediction can improvethe performance of news recommender system Besidesit can also improve the display function of popular newsmodules on website automatically reduce the workload ofwebsite editors and improve the usersrsquo browsing experience

On the study of historical statistical data on news sub-classes we found that some subclasses are popular periodi-cally For instance the subclass college entrance examinationwill appear highly popular about June every year in Chinaand a lot of news articles and comments focus on this subclassat that time as shown in Figures 4(b) and 4(a) that show thedata of college entrance examination subclass In this paperwe define the news subclassrsquo degree of concern according tothe number of news articles and their comments as shown inthe following

119867119896= 120582119867

(119896)

119899119890+ (1 minus 120582)119867

(119896)

119903119890

=119873(119896)

119903119890

119873119899119890

+ 119873119903119890

119873(119896)

119899119890

119873119899119890

+119873(119896)

119899119890

119873119899119890

+ 119873119903119890

119873(119896)

119903119890

119873119903119890

(17)

where119867(119896)

119899119890denotes the popular degree of news article on the

119896th subclass 119867(119896)119903119890

denotes the popular degree of commenton the 119896th subclass 120582 is a weight of popular degree of newsarticle and the value is 119873

(119896)

119903119890(119873119899119890

+ 119873119903119890) 119873(119896)119903119890

denotes thenumber of reviews on the 119896th subclass 119873

119903119890denotes the

number of reviews on all corpus 119873(119896)119899119890

denotes the numberof news articles on the 119896th subclass and 119873

119899119890denotes the

number of news articles on all corpus According to theexperiments of time series analysis on our corpus we foundthat most subclasses are suitable for implementing spectralanalysismethod [27]

Any stationary sequence modeling can be extended tomany cosine waves with different frequencies amplitudeand phase combination This analysis method is called timedomain based analysis method The linear combination of 119898cosines with arbitrary amplitudes frequencies and phases itis shown in the following

119884119905= 1198600+

119898

sum

119895=1

[119860119895cos (2120587119891

119894119905) + 119861

119895sin (2120587119891

119894119905)] (18)

The Scientific World Journal 7

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(a) Popularity degree of graduate entrance examination subclass

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(b) Popularity degree of college entrance examination subclass

Figure 4 Periodic subclass news distribution Remark 119909-axis denotes the date from August 1 2009 to May 3 2012 119910-axis denotes the valueof119867119896in (17)

It can get the values of 119860 and 119861 by ordinary least squaresfitting regression When the frequency is a special form thecalculation will become very simple If 119899 is an odd numberwhich can be expressed as 119899 = 2119896 + 1 then the frequencywith the form of 1119899 2119899 119896119899 is called Fourier frequencyThe estimated parameters are as follows

1198600= 119884

119860119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

119861119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

(19)

If the sample size is even which can be expressed as119899 = 2119896 (19) still holds But the equation will change to thefollowing when 119891

119896= 119896119899 = 12

119860119896=

1

119899

119899

sum

119905=1

(minus1)119905119884119905 119861

119896= 0 (20)

Definition 1 When the sample size is odd namely 119899 = 2119896+1we define the cycle diagram whose frequency 119891 = 119895119899 (119895 =

1 2 119870) as 119868 as shown in the following equation

119868 (119895

119899) =

119899

2(1198602

119895+ 1198612

119895) (21)

If the sample size is even (19) still can get the119860 and 119861 valuesand the cycle diagram is the same as the odd case But in theextreme frequency case for example when 119891 = 119896119899 = 12the cycle diagram is shown in the following equation

119868 (119895

119899) = 119899(119860

119895)2

(22)

The periodogram with frequency 119891 = 119895119899 is inverselyproportional to the square value of the correspondingregression coefficients Therefore the peaks of periodogramshow the relative intensity of sine-cosine pairs in differentfrequencies as shown in Figure 5

In Figure 5 the periodogram has two peaks 0004970179and 0002982107 namely the subcycle 119879 = 1119891 may be 201

days and 335 days The other peaks are too low that they canbe neglected The two frequencies are selected for buildingthe model which means that the model has two pairs of sine-cosine in it as shown in the following

119884119905= 120573 + 120573

1cos (2120587119891

1119905) + 120573

2sin (2120587119891

1119905)

+ 1205733cos (2120587119891

2119905) + 120573

4sin (2120587119891

2119905) + 119890119905

(23)

Using spectral analysis method for prediction has severalsteps First we should use the periodogram for getting thevalue and number of strong frequencies Second model isgenerated by the value and number of strong frequenciesFinally we predict future data values according to the modelwhich only requires a time parameter

Discussion The motivation of this module is to obtain thehotness of each subclass Some new studies also take intoaccount the popularity of the newly published news articleFor example SCENE [3] used the popular degree which iscomputed as the ratio of the number of users accessing thearticle However for the newest popular news article 119899

119894 its

clicked number would be less than the news article publishedseveral hours or days before

6 User Profile Module

In order to capture a userrsquos reading interest on news itemsgenerally personalized news recommendation system needsto construct the userrsquos profile Traditionally the user profile

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

The Scientific World Journal 5

where 119868 is a unitary matrix with the same dimension of 119860Because the training set is divided into 119870 categories we canobtain a group of projective vectors 119875 = 119901

1 1199012 119901

119870 We

utilize code matrix 119872 to describe the correlation betweendifferent classes got from two-class classification Assumingthat class 119862

119896has 119873

119896trained documents 119863

119896119895 where 119895 isin

[1119873119896] the element of 119872 which denotes the correlation

between two classes can be calculated by

1198721198701198701015840 =

1

119873119896

119873119896

sum

119895=1

sgn (⟨1199011198961015840 119863119896119895

⟩) (5)

where 1199011198961015840 denotes the projective vector of 119862

1198961015840 If ⟨119901

1198961015840 119863119896119895

is greater than 0 the return value of function sgn is 1 andotherwise 0 When new articles come the similarity betweenarticle and class can be calculated by the following equation

Sim (119861 119862119870) =

119896

sum

1198701015840=1

11987211989611989610158401198761198961015840 =

119896

sum

1198701015840=1

1198721198961198961015840 sgn (⟨119901

1198961015840 119863119896119895

⟩)

(6)

where 119861 denotes a new article At last we can obtain the classof 119861 through the maximum of function Sim(sdot sdot)

119862 (119861) = argmax119862119896Sim (119861 119862

119896) (7)

In order to further improve the classification accuracyand utilize the manual labor rationally we propose a methodwith considering key person (named entity of type person) toimprove the ability of classification when key persons appearas shown in the following

119875 (119862119894| 119861) = (1 minus 120572)

Sim (119861 119862119870)

sum119870

119894=1Sim (119861 119862

119894)

+ 120572119875 (119862119894| 119861119896) 119875 (119861

119896| 119889119895)

(8)

where Sim(119861 119862119870)sum119870

119894=1Sim(119861 119862

119894) denotes the probability

score of article 119861 on class 119862119896obtained by the method we

mentioned above 119861119896denotes the key person that appeared

in the article 119861 and 119875(119861119896| 119861) = 1 when 119861

119896appeared in 119861

otherwise 119875(119861119896| 119861) = 0 In other words if a new article has

not appeared in any key person we could not implement thekey person factor on it 120572 is the balance parameter on thesetwo methods The 119875(119862

119894| 119861119896) is computed as

119875 (119862119894| 119861119896) =

119875 (119862119894) 119901 (119861

119896| 119862119894)

sum119872

119894=1119875 (119862119894) 119901 (119861

119896| 119862119894)

(9)

43 News Subclass Clustering After obtaining the roughclassification results we need to separate every news classinto subclass 119878119862

119894119909 A natural way to detect subclasses of

an Internet text corpus is typically done using clusteringsfor instance such as 119870-means or hierarchical clusterings InNEMAH we propose a subclass clustering method to obtainsubclasses Each subclass is represented as a subclass vector119879 = ⟨119905

1 1199081⟩ ⟨1199052 1199082⟩ where 119905

119894and 119908

119894denote the rep-

resentative word and its corresponding weight respectively

We call this cluster method as Maximizing Neighborhoodmethod because of the main idea of algorithm

(1) Solving Subspace Projection Problem by Maximizing theAverage ofNeighborhoodFor each document119909

119894in a text space

1198830 the neighbor documents can be divided into two subsets

according to the distance to the 119909119894 similar neighborhood

set Θ119900

119894and heterogeneous neighborhood set Θ

119890

119894 where Θ

119900

119894

contains the top 120585nearest neighborswhich belong to the sameclass of 119909

119894andΘ

119890

119894contains the top 120577 nearest neighbors which

do not belong to the same class of 119909119894 In the text corpus all

data pointsrsquo average distance out of class and within class canbe expressed as follows

119875119894= sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876119894= sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(10)

All data points in the text corpus average out of classdistance and the average within-class distance expression areas follows

119875 = sum

119894

119875119894= sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

119876 = sum

119894

119876119894= sum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

(11)

The subclass clustering problem can be considered as aprojection of text space to a subspace For instance let 119910

119894

be a projection space of 119909119894after projecting we can express

119910119894= 119882119879119909119894 The principle of this projection is maximizing

the average distance of different classes and minimizing theaverage distance within each class [25] as shown in thefollowing

119903 = sum

119894

( sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minus sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)

= 119905119903[

[

119882119879(sum

119894

sum

119909119901isinΘ119890119894

10038171003817100381710038171003817119909119894minus 119909119901

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119890

119894

1003816100381610038161003816

minussum

119894

sum

119909119902isinΘ119900119894

10038171003817100381710038171003817119909119894minus 119909119902

10038171003817100381710038171003817

2

1003816100381610038161003816Θ119900

119894

1003816100381610038161003816

)119882]

]

= 119905119903 [119882119879(119875 minus 119876)119882]

(12)

where 119905119903(sdot) denotes the trace of a matrix and the constraint ofthis equation is 119882119879119882 = 119868 And then maximize the equationshown as follows

max 119905119903 [119882119879(119875 minus 119876)119882] (13)

6 The Scientific World Journal

(2) The Quick Affinity Propagation Clustering on SubspaceAfter projecting the initial text vector space into subspacethrough projective matrix it can generate 119870 clusters withemploying 119870-Affinity Propagation (119870-AP) (this method willbe more suitable for text clustering because it can achievemore reasonable clusters than the traditional clusteringmeth-ods [26]) implemented in subspace Let the similarity of 119910

119894

and 119910119895in subspace119884 = 1199101 1199102 119910

119899 be 119878 = 119904

119894119895 the target

of 119870-AP is to find the 119870 real samples 119864 = 1198901 1198902 119890

119870

which denotes the 119870 classes 119862 = 1198621 1198622 119862

119870 And then

maximize the following objective function

max119865 (119862119895119870

119895=1) =

119870

sum

119895=1

sum

119910119894isin119862119895

119904 (119910119894 119890119895) (14)

where 119890119895belongs to 119862

119895 The objective function can be trans-

formed into 0-1 integer programming problem when intro-ducing the binary parameter 119861 = 119887

119894119895isin 0 1 119894 119895 = 1 119899

as shown in the following

max119865 (119887119894119895) =

119899

sum

119894=1

119899

sum

119895=1

119887119894119895119904 (119910119894 119910119895) (15)

Equation (15) has three constraints (1) 119887119894119894 if 119887119895119894

= 1(2) sum

119899

119895=1119887119894119895= 1 and (3) sum

119899

119894=1119887119894119894= 119870 where 119887

119894119895= 1 when 119910

119894

considers 119910119895as a sample and 119887

119894119894= 1when 119910

119894is a sample itself

For the first constraint 119910119894is a sample when 119910

119895considers 119910

119894as

a sample For the second one it means that each data pointhas only one sample point For the last one it means that thenumber of samples is119870 which can ensure that119870-APmethodgenerates 119870 clusters

(3) Hybrid Learning of Subspace Projection and Clusteringon Adaptive Subspace The class information updated onsubspace clustering process can be utilized as a priori knowl-edge in the next processing on subspace projection andafter several iterations until convergence we can obtain theglobal optimal clustering resultThe iteration processing is asfollows

1198830rarr 119870-AP rarr 119871

0rarr SubSpacerarr 119882

1 Score

1

1198841

= 119882119879

11198830

rarr 119870-AP rarr 1198711

rarr SubSpace rarr 1198822

Score1

sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot

119884119905= 119882119879

1198791198830

rarr 119870-AP rarr 119871119905rarr SubSpace rarr 119882

119905+1

Score119905+1

It must compute the convergence function value in eachiteration

Score119905+1

= 119905119903 [119882119879

119905+1(119875 (119871119905) minus 119876 (119871

119905))119882119905+1

] (16)

where 119875(119871119905) and 119876(119871

119905) denote the average distances between

classes andwithin class which are calculated by (11) accordingto the class vested instruction matrix 119871

119905 The iteration willbe finished when it meets the condition of convergenceScore119905+1

minusScore119905le 120598 or reaches themax number of iterations

The parameters of our clustering method are the number of

points 120578 which are the nearest in class and the number ofpoints 120577 which are the nearest out of class We did cross-fold validation to train these parameters and we found thatselecting 120577 = 120578 = 13 for all classes per 1 000 documentswould perform better

DiscussionThemotivation of this module (classification andclustering) is to find the userrsquos preference (subclass level)and track the hotness of a newly published news in a givensubclass

5 Subclass Popularity Prediction Module

On the explosion of information today the fast pace oflife makes people focus their attention on the popularsubclass rather than spendmuch time searching and selectinginformation Sometimes even users themselves have no ideawhat they really want Therefore the hot subclass predictiontechnology with recommendation function has become veryimportant News subclass popularity prediction can improvethe performance of news recommender system Besidesit can also improve the display function of popular newsmodules on website automatically reduce the workload ofwebsite editors and improve the usersrsquo browsing experience

On the study of historical statistical data on news sub-classes we found that some subclasses are popular periodi-cally For instance the subclass college entrance examinationwill appear highly popular about June every year in Chinaand a lot of news articles and comments focus on this subclassat that time as shown in Figures 4(b) and 4(a) that show thedata of college entrance examination subclass In this paperwe define the news subclassrsquo degree of concern according tothe number of news articles and their comments as shown inthe following

119867119896= 120582119867

(119896)

119899119890+ (1 minus 120582)119867

(119896)

119903119890

=119873(119896)

119903119890

119873119899119890

+ 119873119903119890

119873(119896)

119899119890

119873119899119890

+119873(119896)

119899119890

119873119899119890

+ 119873119903119890

119873(119896)

119903119890

119873119903119890

(17)

where119867(119896)

119899119890denotes the popular degree of news article on the

119896th subclass 119867(119896)119903119890

denotes the popular degree of commenton the 119896th subclass 120582 is a weight of popular degree of newsarticle and the value is 119873

(119896)

119903119890(119873119899119890

+ 119873119903119890) 119873(119896)119903119890

denotes thenumber of reviews on the 119896th subclass 119873

119903119890denotes the

number of reviews on all corpus 119873(119896)119899119890

denotes the numberof news articles on the 119896th subclass and 119873

119899119890denotes the

number of news articles on all corpus According to theexperiments of time series analysis on our corpus we foundthat most subclasses are suitable for implementing spectralanalysismethod [27]

Any stationary sequence modeling can be extended tomany cosine waves with different frequencies amplitudeand phase combination This analysis method is called timedomain based analysis method The linear combination of 119898cosines with arbitrary amplitudes frequencies and phases itis shown in the following

119884119905= 1198600+

119898

sum

119895=1

[119860119895cos (2120587119891

119894119905) + 119861

119895sin (2120587119891

119894119905)] (18)

The Scientific World Journal 7

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(a) Popularity degree of graduate entrance examination subclass

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(b) Popularity degree of college entrance examination subclass

Figure 4 Periodic subclass news distribution Remark 119909-axis denotes the date from August 1 2009 to May 3 2012 119910-axis denotes the valueof119867119896in (17)

It can get the values of 119860 and 119861 by ordinary least squaresfitting regression When the frequency is a special form thecalculation will become very simple If 119899 is an odd numberwhich can be expressed as 119899 = 2119896 + 1 then the frequencywith the form of 1119899 2119899 119896119899 is called Fourier frequencyThe estimated parameters are as follows

1198600= 119884

119860119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

119861119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

(19)

If the sample size is even which can be expressed as119899 = 2119896 (19) still holds But the equation will change to thefollowing when 119891

119896= 119896119899 = 12

119860119896=

1

119899

119899

sum

119905=1

(minus1)119905119884119905 119861

119896= 0 (20)

Definition 1 When the sample size is odd namely 119899 = 2119896+1we define the cycle diagram whose frequency 119891 = 119895119899 (119895 =

1 2 119870) as 119868 as shown in the following equation

119868 (119895

119899) =

119899

2(1198602

119895+ 1198612

119895) (21)

If the sample size is even (19) still can get the119860 and 119861 valuesand the cycle diagram is the same as the odd case But in theextreme frequency case for example when 119891 = 119896119899 = 12the cycle diagram is shown in the following equation

119868 (119895

119899) = 119899(119860

119895)2

(22)

The periodogram with frequency 119891 = 119895119899 is inverselyproportional to the square value of the correspondingregression coefficients Therefore the peaks of periodogramshow the relative intensity of sine-cosine pairs in differentfrequencies as shown in Figure 5

In Figure 5 the periodogram has two peaks 0004970179and 0002982107 namely the subcycle 119879 = 1119891 may be 201

days and 335 days The other peaks are too low that they canbe neglected The two frequencies are selected for buildingthe model which means that the model has two pairs of sine-cosine in it as shown in the following

119884119905= 120573 + 120573

1cos (2120587119891

1119905) + 120573

2sin (2120587119891

1119905)

+ 1205733cos (2120587119891

2119905) + 120573

4sin (2120587119891

2119905) + 119890119905

(23)

Using spectral analysis method for prediction has severalsteps First we should use the periodogram for getting thevalue and number of strong frequencies Second model isgenerated by the value and number of strong frequenciesFinally we predict future data values according to the modelwhich only requires a time parameter

Discussion The motivation of this module is to obtain thehotness of each subclass Some new studies also take intoaccount the popularity of the newly published news articleFor example SCENE [3] used the popular degree which iscomputed as the ratio of the number of users accessing thearticle However for the newest popular news article 119899

119894 its

clicked number would be less than the news article publishedseveral hours or days before

6 User Profile Module

In order to capture a userrsquos reading interest on news itemsgenerally personalized news recommendation system needsto construct the userrsquos profile Traditionally the user profile

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

6 The Scientific World Journal

(2) The Quick Affinity Propagation Clustering on SubspaceAfter projecting the initial text vector space into subspacethrough projective matrix it can generate 119870 clusters withemploying 119870-Affinity Propagation (119870-AP) (this method willbe more suitable for text clustering because it can achievemore reasonable clusters than the traditional clusteringmeth-ods [26]) implemented in subspace Let the similarity of 119910

119894

and 119910119895in subspace119884 = 1199101 1199102 119910

119899 be 119878 = 119904

119894119895 the target

of 119870-AP is to find the 119870 real samples 119864 = 1198901 1198902 119890

119870

which denotes the 119870 classes 119862 = 1198621 1198622 119862

119870 And then

maximize the following objective function

max119865 (119862119895119870

119895=1) =

119870

sum

119895=1

sum

119910119894isin119862119895

119904 (119910119894 119890119895) (14)

where 119890119895belongs to 119862

119895 The objective function can be trans-

formed into 0-1 integer programming problem when intro-ducing the binary parameter 119861 = 119887

119894119895isin 0 1 119894 119895 = 1 119899

as shown in the following

max119865 (119887119894119895) =

119899

sum

119894=1

119899

sum

119895=1

119887119894119895119904 (119910119894 119910119895) (15)

Equation (15) has three constraints (1) 119887119894119894 if 119887119895119894

= 1(2) sum

119899

119895=1119887119894119895= 1 and (3) sum

119899

119894=1119887119894119894= 119870 where 119887

119894119895= 1 when 119910

119894

considers 119910119895as a sample and 119887

119894119894= 1when 119910

119894is a sample itself

For the first constraint 119910119894is a sample when 119910

119895considers 119910

119894as

a sample For the second one it means that each data pointhas only one sample point For the last one it means that thenumber of samples is119870 which can ensure that119870-APmethodgenerates 119870 clusters

(3) Hybrid Learning of Subspace Projection and Clusteringon Adaptive Subspace The class information updated onsubspace clustering process can be utilized as a priori knowl-edge in the next processing on subspace projection andafter several iterations until convergence we can obtain theglobal optimal clustering resultThe iteration processing is asfollows

1198830rarr 119870-AP rarr 119871

0rarr SubSpacerarr 119882

1 Score

1

1198841

= 119882119879

11198830

rarr 119870-AP rarr 1198711

rarr SubSpace rarr 1198822

Score1

sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot rarr sdot sdot sdot

119884119905= 119882119879

1198791198830

rarr 119870-AP rarr 119871119905rarr SubSpace rarr 119882

119905+1

Score119905+1

It must compute the convergence function value in eachiteration

Score119905+1

= 119905119903 [119882119879

119905+1(119875 (119871119905) minus 119876 (119871

119905))119882119905+1

] (16)

where 119875(119871119905) and 119876(119871

119905) denote the average distances between

classes andwithin class which are calculated by (11) accordingto the class vested instruction matrix 119871

119905 The iteration willbe finished when it meets the condition of convergenceScore119905+1

minusScore119905le 120598 or reaches themax number of iterations

The parameters of our clustering method are the number of

points 120578 which are the nearest in class and the number ofpoints 120577 which are the nearest out of class We did cross-fold validation to train these parameters and we found thatselecting 120577 = 120578 = 13 for all classes per 1 000 documentswould perform better

DiscussionThemotivation of this module (classification andclustering) is to find the userrsquos preference (subclass level)and track the hotness of a newly published news in a givensubclass

5 Subclass Popularity Prediction Module

On the explosion of information today the fast pace oflife makes people focus their attention on the popularsubclass rather than spendmuch time searching and selectinginformation Sometimes even users themselves have no ideawhat they really want Therefore the hot subclass predictiontechnology with recommendation function has become veryimportant News subclass popularity prediction can improvethe performance of news recommender system Besidesit can also improve the display function of popular newsmodules on website automatically reduce the workload ofwebsite editors and improve the usersrsquo browsing experience

On the study of historical statistical data on news sub-classes we found that some subclasses are popular periodi-cally For instance the subclass college entrance examinationwill appear highly popular about June every year in Chinaand a lot of news articles and comments focus on this subclassat that time as shown in Figures 4(b) and 4(a) that show thedata of college entrance examination subclass In this paperwe define the news subclassrsquo degree of concern according tothe number of news articles and their comments as shown inthe following

119867119896= 120582119867

(119896)

119899119890+ (1 minus 120582)119867

(119896)

119903119890

=119873(119896)

119903119890

119873119899119890

+ 119873119903119890

119873(119896)

119899119890

119873119899119890

+119873(119896)

119899119890

119873119899119890

+ 119873119903119890

119873(119896)

119903119890

119873119903119890

(17)

where119867(119896)

119899119890denotes the popular degree of news article on the

119896th subclass 119867(119896)119903119890

denotes the popular degree of commenton the 119896th subclass 120582 is a weight of popular degree of newsarticle and the value is 119873

(119896)

119903119890(119873119899119890

+ 119873119903119890) 119873(119896)119903119890

denotes thenumber of reviews on the 119896th subclass 119873

119903119890denotes the

number of reviews on all corpus 119873(119896)119899119890

denotes the numberof news articles on the 119896th subclass and 119873

119899119890denotes the

number of news articles on all corpus According to theexperiments of time series analysis on our corpus we foundthat most subclasses are suitable for implementing spectralanalysismethod [27]

Any stationary sequence modeling can be extended tomany cosine waves with different frequencies amplitudeand phase combination This analysis method is called timedomain based analysis method The linear combination of 119898cosines with arbitrary amplitudes frequencies and phases itis shown in the following

119884119905= 1198600+

119898

sum

119895=1

[119860119895cos (2120587119891

119894119905) + 119861

119895sin (2120587119891

119894119905)] (18)

The Scientific World Journal 7

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(a) Popularity degree of graduate entrance examination subclass

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(b) Popularity degree of college entrance examination subclass

Figure 4 Periodic subclass news distribution Remark 119909-axis denotes the date from August 1 2009 to May 3 2012 119910-axis denotes the valueof119867119896in (17)

It can get the values of 119860 and 119861 by ordinary least squaresfitting regression When the frequency is a special form thecalculation will become very simple If 119899 is an odd numberwhich can be expressed as 119899 = 2119896 + 1 then the frequencywith the form of 1119899 2119899 119896119899 is called Fourier frequencyThe estimated parameters are as follows

1198600= 119884

119860119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

119861119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

(19)

If the sample size is even which can be expressed as119899 = 2119896 (19) still holds But the equation will change to thefollowing when 119891

119896= 119896119899 = 12

119860119896=

1

119899

119899

sum

119905=1

(minus1)119905119884119905 119861

119896= 0 (20)

Definition 1 When the sample size is odd namely 119899 = 2119896+1we define the cycle diagram whose frequency 119891 = 119895119899 (119895 =

1 2 119870) as 119868 as shown in the following equation

119868 (119895

119899) =

119899

2(1198602

119895+ 1198612

119895) (21)

If the sample size is even (19) still can get the119860 and 119861 valuesand the cycle diagram is the same as the odd case But in theextreme frequency case for example when 119891 = 119896119899 = 12the cycle diagram is shown in the following equation

119868 (119895

119899) = 119899(119860

119895)2

(22)

The periodogram with frequency 119891 = 119895119899 is inverselyproportional to the square value of the correspondingregression coefficients Therefore the peaks of periodogramshow the relative intensity of sine-cosine pairs in differentfrequencies as shown in Figure 5

In Figure 5 the periodogram has two peaks 0004970179and 0002982107 namely the subcycle 119879 = 1119891 may be 201

days and 335 days The other peaks are too low that they canbe neglected The two frequencies are selected for buildingthe model which means that the model has two pairs of sine-cosine in it as shown in the following

119884119905= 120573 + 120573

1cos (2120587119891

1119905) + 120573

2sin (2120587119891

1119905)

+ 1205733cos (2120587119891

2119905) + 120573

4sin (2120587119891

2119905) + 119890119905

(23)

Using spectral analysis method for prediction has severalsteps First we should use the periodogram for getting thevalue and number of strong frequencies Second model isgenerated by the value and number of strong frequenciesFinally we predict future data values according to the modelwhich only requires a time parameter

Discussion The motivation of this module is to obtain thehotness of each subclass Some new studies also take intoaccount the popularity of the newly published news articleFor example SCENE [3] used the popular degree which iscomputed as the ratio of the number of users accessing thearticle However for the newest popular news article 119899

119894 its

clicked number would be less than the news article publishedseveral hours or days before

6 User Profile Module

In order to capture a userrsquos reading interest on news itemsgenerally personalized news recommendation system needsto construct the userrsquos profile Traditionally the user profile

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

The Scientific World Journal 7

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(a) Popularity degree of graduate entrance examination subclass

0 200 400 600 800 10000

01

02

03

04

05

06

07

08

09

1

Days

Prop

ortio

n

(b) Popularity degree of college entrance examination subclass

Figure 4 Periodic subclass news distribution Remark 119909-axis denotes the date from August 1 2009 to May 3 2012 119910-axis denotes the valueof119867119896in (17)

It can get the values of 119860 and 119861 by ordinary least squaresfitting regression When the frequency is a special form thecalculation will become very simple If 119899 is an odd numberwhich can be expressed as 119899 = 2119896 + 1 then the frequencywith the form of 1119899 2119899 119896119899 is called Fourier frequencyThe estimated parameters are as follows

1198600= 119884

119860119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

119861119895=

2

119899

119899

sum

119905minus1

119884119905cos(

2120587119905119895

119899)

(19)

If the sample size is even which can be expressed as119899 = 2119896 (19) still holds But the equation will change to thefollowing when 119891

119896= 119896119899 = 12

119860119896=

1

119899

119899

sum

119905=1

(minus1)119905119884119905 119861

119896= 0 (20)

Definition 1 When the sample size is odd namely 119899 = 2119896+1we define the cycle diagram whose frequency 119891 = 119895119899 (119895 =

1 2 119870) as 119868 as shown in the following equation

119868 (119895

119899) =

119899

2(1198602

119895+ 1198612

119895) (21)

If the sample size is even (19) still can get the119860 and 119861 valuesand the cycle diagram is the same as the odd case But in theextreme frequency case for example when 119891 = 119896119899 = 12the cycle diagram is shown in the following equation

119868 (119895

119899) = 119899(119860

119895)2

(22)

The periodogram with frequency 119891 = 119895119899 is inverselyproportional to the square value of the correspondingregression coefficients Therefore the peaks of periodogramshow the relative intensity of sine-cosine pairs in differentfrequencies as shown in Figure 5

In Figure 5 the periodogram has two peaks 0004970179and 0002982107 namely the subcycle 119879 = 1119891 may be 201

days and 335 days The other peaks are too low that they canbe neglected The two frequencies are selected for buildingthe model which means that the model has two pairs of sine-cosine in it as shown in the following

119884119905= 120573 + 120573

1cos (2120587119891

1119905) + 120573

2sin (2120587119891

1119905)

+ 1205733cos (2120587119891

2119905) + 120573

4sin (2120587119891

2119905) + 119890119905

(23)

Using spectral analysis method for prediction has severalsteps First we should use the periodogram for getting thevalue and number of strong frequencies Second model isgenerated by the value and number of strong frequenciesFinally we predict future data values according to the modelwhich only requires a time parameter

Discussion The motivation of this module is to obtain thehotness of each subclass Some new studies also take intoaccount the popularity of the newly published news articleFor example SCENE [3] used the popular degree which iscomputed as the ratio of the number of users accessing thearticle However for the newest popular news article 119899

119894 its

clicked number would be less than the news article publishedseveral hours or days before

6 User Profile Module

In order to capture a userrsquos reading interest on news itemsgenerally personalized news recommendation system needsto construct the userrsquos profile Traditionally the user profile

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

8 The Scientific World Journal

00 01 02 03 04 05

00

02

04

06

08

Frequency [005]

Ener

gy

(a) The complete periodogramwith its frequency range is from0 to 005

000 001 002 003 004 005

00

02

04

06

08

Ener

gy

Frequency [1005]

(b) Local amplification periodogram with the frequency range is from1 to 005

Figure 5 Periodogram of the popularity degree of college entrance examination subclass Remark 119909-axis denotes the possible frequency ofthe popularity degree 119910-axis represents the energy of the corresponding frequency

can be captured by the track of user reading history A surveyof various user profile construction techniques is provided in[28 29] In this paper we use the microblog to construct theuserrsquos profile The reason is that the user who is interestedin some subclasses will tend to tweet or retweet microblogon these subclasses For instance a user tweets or retweetsmany messages about basketball game that we can deducethat this user may like reading basketball news reports (ieNBA CBA etc) Besides many readers tend to glance atnews articles and are interested in some key personsrsquo namesMoreover people from different areas would tend to read thenews from their living city or their hometown Based on theabove analysis we propose to construct usersrsquo profiles by theexploration on the four factors discussed above microblogcontent place name and preferred key persons In order toreduce the computational complexity preference is also takeninto account in our model that can be represented by a vector119880119901119891

= 120591 120588 120581 Consider the following

(1) 120591 represents the key index words distribution ofmicroblogs which user tweeted or retweeted in thepast and it can be expressed as a vector ⟨119905

1 1199081⟩ ⟨1199052

1199082⟩ where each element consists an index word

and its corresponding weight(2) 120588 represents the place names which appeared in the

microblog of a specific user and it can be expressed as⟨1199011 1199081⟩ ⟨1199012 1199082⟩ ⟨119901

119894 119908119894⟩ where 119901

119894denotes

a place name and 119908119894denotes the number of this

place appearing in the tweets of the given user Wecollect all the cities and provinces names in ChinaSome place names are subordinate to others forinstance GuangZhou city is subordinate to Guang-Dong province In this case system will add weightto GuangDong using 119908GuangDong+ = 119908GuangZhou whenGuangZhou appears

(3) 120581 represents the list of key personsrsquo name extractedfrom the usersrsquo microblog ⟨119896

1 1199081⟩ ⟨1198962 1199082⟩

where the name list is constructed from NanFangDaily training corpus which the key personsrsquo nameshave tagged in each news article

7 Personalized NewsRecommendation Module

The recommendation module can be divided into two stepsRough Selection (see Section 71) and Precise Selection (seeSection 73) For the first step some subclasses are matcheddue to the userrsquos preference And then we select the newsarticles from these subclasses by our selection strategy

71 Rough Selection Subclass Selection for a User Once weobtain the subclasses and userrsquos profile we can calculate thesimilarity between each subclass and a given userWe can useTF-IDFweight to represent the vector of a given subtopic 120591

119904=

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ The similarity between a subclass and

a user (represented as 120591119906= ⟨1199051 1199081⟩ ⟨1199052 1199082⟩ in 119880

119901119891 see

Section 6) is computed by cosine similarity In general userstend to have their preference on some special subclasses thatis they are not interested in all subclasses Therefore we canroughly select some subclasses with a similarity thresholdThis threshold is set to be equal to the 30 of all similarityscores ranking with respect to a given user

72 News Profile Construction After obtaining news clustersthat user might be interested in the next step is to selectspecific news articles to the given user Similar to user weinitially maintain a news profile for each news article andthen model the recommendation as a budgeted maximumcoverage problemand solve it by a greedy selection algorithm

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

The Scientific World Journal 9

Anews profile containsmany similar factors for example keyperson place clustering of belonging recency popularityand so forth For the popularity as we discussed above weused 119867

119896to represent the popularity degree of 119896 cluster For

the recency the score is represented as the following

Rec (119894) =

119894119888minus 119894119901

24 lowast 60 (24)

where Rec(119894) function returns the recency score of newsarticle 119894 and 119894

119888and 119894119901denote the current time and published

time respectivelyIn this paper news profiles are helpful to evaluate how the

news article can satisfy the user Given a news profile 119873119901119891

=

120588 120581 120592 and a userrsquos profile 119880119901119891

= 120588 120581 120592 the similaritybetween 119873

119901119891and 119880

119901119891is computed as

sim (119873119901119891

119880119901119891

) = 1205741sim (120588

119899 120588119906)

+ 1205742sim (120581

119899 120581119906) + 1205743sim (120592

119899 120592119906)

(25)

where 1205741 1205742 and 120574

3are parameters to control how we trust or

weigh the corresponding components and are set to 1 in oursystem Each component is calculated by the cosine similarity

Let 119864 be a finite set and 119891 a real valued nondecreasingfunction defined on the subsets of 119864 that satisfies

119891 (119879 cup 120589) minus 119891 (119879) le 119891 (119878 cup 120589) minus 119891 (119878) (26)

where 119878 sube 119879 119878 and 119879 are two subsets of 119864 and 120589 isin 119864 119879Such a function 119891 is called a submodular function [30] Byadding an element to a larger set 119879 the value increment of119891 cannot larger than that add an element to a smaller set 119878This budgeted maximum coverage problem can be describedas follows given a set of elements 119864 in which each element isassociated with an influence and a cost defined over a domainof these elements and a budget 119861 the goal is to find out asubset of119864which contains the largest influencewhile the totalcost does not exceed budget 119861 This problem is NP-hard [31]However [31] proposed a greedy algorithm which sequen-tially picks up the element that increases the largest possibleinfluence within the cost limit Submodularity resides in eachpick up step Due to the result of [32] submodular functionsare closed under nonnegative linear combinations

73 Precise Selection News Selection for RecommendationIn a given news subclass we observe that most of newsconcentrate on similar topic with minor difference on majoraspects of the corresponding topic Typically a reader isinterested in some aspects of the given subclass but not allof them Based on this intuition our news selection strategycan be described as follows

Assuming that C denotes the newly published news setS represents the selected news set and 120589 denotes the newsarticle being selected After selecting a piece of news 120589 wemust insure that

(i) the topic diversity should not deviate much in S(ii) S should give more satisfaction to the given user(iii) S should be similar to the general topic inC S

For each of the above strategies similar to [3] we define aquality function 119902(S) to evaluate the value of current selectednews set S as follows

119902 (S) =1

(|S|2)

sum

1198731 1198992isinS1198991 = 1198992

minus sim (1198991 1198992) +

1

|S|sum

1198991isinS

sim (119906 1198991)

+1

|C S| sdot |S|sum

1198991isinCS

sum

1198992isinS

sim (1198991 1198992)

(27)

where 1198991and 119899

2denote news items 119906 denotes the given

user and sim(sdot sdot) function returns the similarity of its twoparameters Equation (27) contains three components corre-sponding to the news selection strategy we list above 119902(S)

balances the contribution of different components Suppose120589 is the candidate news document the quality increase can berepresented as

119868 (120589) = 119902 (S cup 120589) minus 119902 (S) (28)

The goal is to select a list of recommended news documentswhich provide the largest possible values within the budget(ie the budget can be regarded as the maximum number ofthe articles in recommended list)We can obtain a list of newsdocuments for each subclass by adopting the greedy selectionalgorithm Taking into account the other characteristics ofnews documents for example the popularity and the recencythe ranking of the selected news articles needs to be adjustedin order to make the recommended results more reasonableFormally given a news article 119899 the popularity and therecency can be combined as

119899120601=

119867119896119899

minus 119867min

119867max minus 119867minminusRec (119899) minus RecminRecmax minus Recmin

(29)

where 119867119896119899

denotes the popularity degree of the subclasswhich the news 119899 belongs to and Rec(119899) can be obtainedfrom (24) From the equation above we note that the smallerthe recency is the higher the article is ranking Besides thegreater the popularity is the higher the article is rankingAfter computing the 119899

120601value of the list of recommended

articles we implement a quicksort algorithm on these articlesaccording to the 119899

120601 By such adjustment the generated

ranking can emphasize more popular and freshness as wellas concentrate on news documents that satisfy the userrsquospreference

8 Experimental Evaluation

In this section we provide a comprehensive experimentalevaluation to show the efficacy of our proposed news recom-mendation system We start with an introduction to a real-world collection obtained from a news andmicroblog servicewebsite SINA After that we will describe the experimentaldesign and show the results based on the recommendationframework introduced in this paper

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

10 The Scientific World Journal

Table 2 Recommendation Micro-F1 (Top30) of different timeperiods for different classification based systems

Range(YM) NB Cheng ZGuo NEMAH

0908-0908 4239 0204 0242 0270 03510910ndash0912 37910 0206 0254 0268 03641001ndash1006 75047 0227 0289 0297 04031007ndash1107 151995 0198 0271 0274 03710908ndash1208 280737 0210 0273 0284 0383Remark denotes the number of news articles Time range 0908 denotesAugust 2009

81 Real-World Data Set For experiments we gather thenews data from SINA (httpnewssinacomcn) where thedata collection ranges from August 1 2009 to August 312012 We also gather the users who comment on the thesearticles and their microblog from SINA (httpweibocom)and preprocess the data by removing microblog messagesthat are too short (ie less than 3 words) and the nonactiveusers (ie the users who tweeted or retweeted less than 10

messages) for verifying our recommendation performanceAfter preprocessing 5 127 users are stored with 124 301

messages and 280 737 news articles

82 Experiments Our system has four major components(1) a module responsible for classification and clusteringnews articles (2) a component of constructing and updatingprofiles of users (3) hot news subclass prediction based ontime-series analysis and (4) a recommendation componentusing news cluster and user profile accompanied by subclasspopularity factor and recency From the experimental per-spective we verify our components firstly And thenwe verifyour system compared to the state-of-the-art approaches anddesign a user study

821 Classification and Clustering Evaluation In order toevaluate the performance of classification and clusteringcomponent we design two experiments

(1) ClassificationComparisonThere aremany classifiedmeth-ods in the past decade in the field of text processing Weimplement the three following classification methods themethod of Cheng et al [33] themethod ofGuo et al [34] andtheNaıve Bayesian (NB)method Cheng proposed a text clas-sification based on refining concept index and Guo employedgenetic algorithm for classifying Before using classificationmodule we must set the 120572 in (8) and decide the thresholdof feature selection through an offline experiment as shownin Figure 6 where T-10 denotes that threshold = 10 infeature selection and F-score is Micro-F1 The performanceachieves the best roughly when 120572 = 02 From the resultwe also observe that the thresholds we selected as 20 30and 40 produce similar results so we use 119879 = 20 in ourprocessing

Table 2 lists the recommendation evaluation results fromdifferent classifications Based on the comparison we knowthat our proposed method outperforms the classical method

0 01 02 03 04 05 06 07085

0855086

0865087

0875088

0885089

F1 sc

ore

T-10T-20

T-30T-40

120572

Figure 6 120572 parameter selection via classification Remark 119910-axis isthe F1 measure score of our classification using different 120572

Naıve Bayesian and Cheng and Guo methods in terms of F1measure A straightforward explanation for the improvementis that our method uses less features selected by the methodwe proposed to represent news articles and implement aseries of two-class classification to improve the similarityproblem of different classes and the most important reasonmay be that we implement the key persons which areclassified manually into the method

(2) Clustering Comparison In reality we need to cluster thenews articles into subclasses every day even every hourFor our spider software we know that more than thousandsof news articles arrive per day 119870-means and hierarchicalclustering methods are the most common clustering algo-rithms In order to verify our proposed method we designthe experiment as follows (1) use 500 1000 and 1500 asthe number of newly published articles for processing (2)for each scale of dataset implement classification on thesedata (3) perform 119870-means hierarchical clustering and ourproposed clustering method on these data (4) performTop30 news recommendation and (5) compute the F1 scorefor different clustering based systems The comparison ofrecommendation on different subclass clustering methods isshown as in Figure 7

From the experimental result we have the followingobservations (1) NEMAH performs a better result comparedto the other methods in terms of F1 score (2) NEMAHis more stable than the other methods A straightforwardexplanationmight be that119870-means clustering needs an initialclustering center for each cluster Besides with fewer param-eters our proposed method has stronger generalization andlearning ability without requiring the size and distribution oftext corpus

822 User Profile and Subclass Popularity Prediction Evalua-tion User profile is an important factor in a recommendationsystem that can affect the recommendation result signifi-cantly Our user profile construction includes the followingfactors content place name and key person Prior approachesoften use the content or similar access pattern to construct

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

The Scientific World Journal 11

K-means Hierarchical NEMAH0

005

01

015

02

025

03

035

04

F1 sc

ore

n = 500

n = 1000

n = 1500

Figure 7 Recommendation performance of different data scales fordifferent clustering based systems Remark 119899 denotes the number ofnews articles for clustering

T5 T10 T20 T300

005

01

015

02

025

03

035

04

F1 sc

ore

CC + GATEC + P + K

C + P + K + EC + P + K + S

Figure 8 Recommendation F1 score of different profile factors andsubclass popularity predictionmethodsRemarkC content P placename K key person name GATE entities name using GATE toolE popularity prediction using three-time exponential smoothingand S popularity prediction using spectral analysis (employed byNEMAH)

the user profile SCENE [3] used the content similar accesspattern and entities which are extracted by GATE [35] Inreality the entities such as place names and key person namesare stable for a period relatively Figure 8 shows the resultsof using different user profile building methods and subclasspopularity prediction methods

From this result we observe the following (1) Ourmethod performs better performance than using GATE (2)Recommendation using content only cannot perform well

Table 3 Diversity evaluation on different recommendation lists

Methods T10 T20 T30Goo 05104 04320 01215ClickB 05231 04457 01587Bilinear 05024 03547 01478Bandit 06112 03874 02674SCENE 06821 05747 05687NEMAH 07425 06941 06637Remark T at 119899-recommended result with top-119899

becausemicroblog has not had a lot of content in itsmessages(3) The Spectral Analysis employed in subclass popularityprediction can be better than the Three-Time ExponentialSmoothing method Although the average performance ofSpectral Analysis is better than Three-Time ExponentialSmoothing in our other work about time series analysis wefound that some subclassesrsquo cycle diagrams have less strongsignal of frequencies which would tend to overfitting with alarge number of sine-cosine pairs and obtain worse resultsin these subclasses SCENE [3] also used the popular degreewhich is computed as the ratio of the number of users access-ing the article and the size of the usersrsquo pool However thismethod is contradicting to the freshnessThe straightforwardreason is that the freshest news may get few of clicked

823 Diversity Evaluation The recommendation news list ofNEMAHperforms a great diversity on both class and subclassaspects Let 119877(119906) be a news recommended list of a user 119906 andthe diversity of 119906 can be defined as follows

Diversity = 1 minus

sum119894119895isin119877(119906)119894 = 119895

sim (119894 119895)

(12) |119877 (119906)| (|119877 (119906)| minus 1) (30)

where 119894 and 119895 are two different news articles in recommen-dation list for user 119906 and sim(119894 119895) denotes the news profilesimilarity between the news item 119894 and 119895 For this metricevaluation we chooseGoo [11] (a collaborative filtering basedmethod) ClickB [4] (a content-based method) Bilinear [14]Bandit [15] and SCENE [3] (a hybrid method using LSH forclustering and greedy algorithm for news selection) as thecomparison baselines Table 3 shows the result of the diversityresult with |119877(119906)| = 10 |119877(119906)| = 20 and |119877(119906)| = 30 in whichwe use 11987910 to represent |119877(119906)| = 10

From Table 3 we can see that our system outperformsthe others significantly and the straightforward reason is thatwe diverse the news not only according to the preference ofuser but also according to the distribution of candidate newsarticlesWith the recommendation list enlarged the diversitydecreases significantly on the baseline methods because theyrely on the preference of user too much

824 System Accuracy Evaluation In order to verify theeffectiveness of our proposed NEMAH we implement arecommender system that models the recommendation asa contextual bandit problem [15] Also we implement theSCENE [3] prototype system which employed LSH (LocalitySensitive Hashing) for news clustering and used greedy

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

12 The Scientific World Journal

01 015 02 025 0302

025

03

035

04

Precision

Reca

llTop 10

(a)

01 02 0302

025

03

035

04

045

Precision

Reca

ll

Top 20

(b)

02 025 03 035 04025

03

035

04

045

05

Precision

Reca

ll

Top 30

(c)

Figure 9 Precision-recall plot of different recommendations Remark ∘ (green) denotes the bandit-based recommender ◻ (blue) denotesthe SCENE recommender and lowast (red) represents NEMAH

selection for user recommendation For each method weselect 50 users to provide news recommendation results forthem Figure 9 shows the comparison results as Top 10 Top20 and Top 30 news items for each user

In the above experiments we can observe that besides thehigher accuracy the distribution of our system is more stablethan other approaches

In reality if users read a few of news articles every daymany news recommendation systems could not outperformgood result for these users Our system can address thisproblem due to the microblog user profile building Figure 10shows the comparison results for different users groups forall users (5 127 users) Suppose a reader reads 119873 newsarticles per day From this figure we can know that our

proposed system can outperform a reasonable result when itis subject tononactiveusers SCENEalso outperformsnot badresult The reason is that NEMAH and SCENE consider thenamed entities refereed by users Besides NEMAH takes intoaccount the popular degree on a news article

825 A User Study on NEMAH In order to get the otherevaluated metrics to verify our proposed news recommenda-tion system we develop a prototype system of our proposedNEMAH and design a questionnaire which includes the fol-lowing questions (1) satisfaction of news content (2) order-ing of the recommendation list (3) preference of the newssubclasses (4) popularity of news article and (5) noveltyof the recommendation list For each question we define 5

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

The Scientific World Journal 13

0

005

01

015

02

025

03

035

04

F1 sc

ore

GooClickBBandit

SCENENEMAH

N lt 10 10 le N lt 30 30 le N lt 50 N ge 50

Figure 10 Comparison of F1 score of different approaches fordifferent user groups Remark 119873 denotes the number of newsarticles per day which is read by a user

Satisfaction Ordering Preference Popular Novelty0

051

152

253

354

45

Aver

age n

umbe

r of s

core

ClickBNEMAH

Figure 11 User study on different metrics

indexes for selection where 1 So Bad 2 Not Very Well3 Average 4 Good but Needs to Improve and 5 ExcellentWe crawl news articles of the latest three days from severalfamous newswebsites as a candidate set for recommendationAt last we hire 50 volunteers who are required to havemicroblog account in SINA website to help us complete thequestionnaire We send them the same questionnaire withdifferent recommendation lists every week for three timesThe average result of this user study is shown in Figure 11From the result of user study we can see that NEMAH cansatisfy the requirements represented by our questions ofmostof people

9 Conclusion

In this paper we proposed NEMAH system architectureto tackle the personalized news recommendation based onmicroblog and subclass popularity prediction We explore

the intrarelations among microblog content and news itemsand considering the subclass popularity factor similarityamong users place and key person factors synthetically Oursystem supports effective classification and subclass cluster-ing on newly published news articles along with a few ofhistory corpus High quality of classification and clusteringcan construct a better data structure for recommendingExperimental results compared with some state-of-the-artalgorithms have demonstrated the better performance ofNEMAH Besides our work in Sections 4 and 5 can be uti-lized for automatic module layout and channel ranking

For future work to process mass network news articleswe plan to deploy some components (eg classification clust-ering and subclass popularity analysis) onto theMap-Reduceframework on our distributed systemMoreover we also planto integrate the subclass popularity prediction module intoour news search engine due to the effectiveness in our workAnother remarkable point is the interest evolution of users(eg time place and other factors) which is able to provideinsights on the exploration of news reading behaviors

Disclosure

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for profitor commercial advantage and that copies bear this noticeand the full citation on the first page To copy otherwise torepublish to post on servers or to redistribute to lists requiresprior specific permission andor a fee

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] O Phelan KMcCarthy and B Smyth ldquoUsing twitter to recom-mend real-time topical newsrdquo in Proceedings of the 3rd ACMConference on Recommender Systems pp 385ndash388 ACMOcto-ber 2009

[2] G de Francisci Morales A Gionis and C Lucchese ldquoFromchatter to headlines Harnessing the real-time web for person-alized news recommendationrdquo in Proceedings of the 5th ACMInternational Conference on Web Search and Data Mining(WSDM rsquo12) pp 153ndash162 ACM February 2012

[3] L Li DWang T Li D Knox and B Padmanabhan ldquoSCENE Ascalable two-stage personalized news recommendation systemrdquoin Proceedings of the 34th International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIRrsquo11) pp 125ndash134 July 2011

[4] J Liu P Dolan and E R Pedersen ldquoPersonalized news recom-mendation based on click behaviorrdquo in Proceedings of the 15thACM International Conference on Intelligent User Interfaces (IUIrsquo10) pp 31ndash40 ACM February 2010

[5] J Schafer J Konstan and J Riedi ldquoRecommender systems ine-commercerdquo in Proceedings of the 1st ACM conference on Ele-ctronic commerce pp 158ndash166 ACM 1999

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

14 The Scientific World Journal

[6] D Billsus and M J Pazzani ldquoPersonal news agent that talkslearns and explainsrdquo inProceedings of the 3rdAnnual Conferenceon Autonomous Agents pp 268ndash275 ACM May 1999

[7] M Pazzani andD Billsus ldquoThe identification of interesting websitesrdquoMachine Learning vol 27 no 3 pp 313ndash331 1997

[8] U Shardanand and P Maes ldquoSocial information filtering algo-rithms for automating lsquoword of mouthrsquordquo in Proceedings of theSIGCHI Conference on Human Factors in Computing Systemspp 210ndash217 ACM May 1995

[9] R G Cota A A Ferreira C NascimentoM A Goncalves andA H F Laender ldquoAn unsupervised heuristic-based hierarchicalmethod for name disambiguation in bibliographic citationsrdquoJournal of the American Society for Information Science andTechnology vol 61 no 9 pp 1853ndash1870 2010

[10] J Breese D Heckerman and C Kadie ldquoEmpirical analysis ofpredictive algorithms for collaborative filteringrdquo in Proceedingsof the 14th conference on Uncertainty in Artificial Intelligence pp43ndash52 Morgan Kaufmann Publishers 1998

[11] A S DasM Datar A Garg and S Rajaram ldquoGoogle news per-sonalization Scalable online collaborative filteringrdquo in Proceed-ings of the 16th International World Wide Web Conference(WWW rsquo07) pp 271ndash280 ACM May 2007

[12] M Balabanovic and Y Shoham ldquoContent-Based CollaborativeRecommendationrdquo Communications of the ACM vol 40 no 3pp 66ndash72 1997

[13] M Claypool A Gokhale T Miranda P Murnikov D Netesand M Sartin ldquoCombining content-based and collaborativefilters in an online newspaperrdquo in Proceedings of the ACM SIGIRWorkshop on Recommender Systems vol 60 Citeseer 1999

[14] W Chu and S T Park ldquoPersonalized recommendation ondynamic content using predictive bilinear modelsrdquo in Proceed-ings of the 18th International Conference onWorldWideWeb pp691ndash700 ACM 2009

[15] L Li W Chu J Langford and R E Schapire ldquoA contextual-bandit approach to personalized news article recommendationrdquoin Proceedings of the 19th International Conference World WideWeb (WWW rsquo10) pp 661ndash670 ACM April 2010

[16] C Best E van der Goot M de Paola T Garcia and D HorbyldquoEurope media monitormdashemmrdquo JRC Technical Note no I 22002

[17] EGabrilovich SDumais andEHorvitz ldquoNewsjunkie provid-ing personalized newsfeeds via analysis of information noveltyrdquoin Proceedings of the 13th International conference World WideWeb (WWW rsquo04) pp 482ndash490 ACM May 2004

[18] G Wang F H Lochovsky and Q Yang ldquoFeature selection withconditional mutual information MaxiMin in text categoriza-tionrdquo in Proceedings of the 13th ACMConference on Informationand Knowledge Management (CIKM rsquo04) pp 342ndash349 ACMNovember 2004

[19] C Lee and G G Lee ldquoInformation gain and divergence-basedfeature selection for machine learning-based text categoriza-tionrdquo Information Processing andManagement vol 42 no 1 pp155ndash165 2006

[20] A M A Mesleh ldquoChi square feature extraction based svmsarabic language text categorization systemrdquo Journal of ComputerScience vol 3 no 6 pp 430ndash435 2007

[21] Z Wei D Miao J H Chauchat and C Zhong ldquoFeature sel-ection on chinese text classification using character N-gramsrdquoin Rough Sets and Knowledge Technology vol 5009 of LectureNotes in Computer Science pp 500ndash507 Springer BerlinGermany 2008

[22] Y S Lai and C H Wu ldquoMeaningful term extraction and dis-criminative term selection in text categorization via unknown-word methodologyrdquo ACM Transactions on Asian LanguageInformation Processing (TALIP) vol 1 no 1 pp 34ndash64 2002

[23] R Rifkin and A Klautau ldquoIn defense of one-vs-all classifica-tionrdquoThe Journal of Machine Learning Research vol 5 pp 101ndash141 2004

[24] A E Hoerl and R W Kennard ldquoRidge regression Biased esti-mation for nonorthogonal problemsrdquo Technometrics vol 42no 1 pp 80ndash86 2000

[25] F Wang and C Zhang ldquoFeature extraction by maximizing theaverage neighborhoodmarginrdquo in Proceedings of the IEEECom-puter Society Conference on Computer Vision and PatternRecognition (CVPR rsquo07) June 2007

[26] X Zhang W Wang K Noslashrvag and M Sebag ldquoK-AP gener-ating specified K clusters by efficient affinity propagationrdquo inProceedings 10th IEEE International Conference on Data Mining(ICDM rsquo10) pp 1187ndash1192 IEEE December 2010

[27] J D Hamilton Time Series Analysis vol 2 Cambridge Univer-sity Press Cambridge UK 1994

[28] F Abel Q Gao G-J Houben andK Tao ldquoAnalyzing usermod-eling on Twitter for personalized news recommendationsrdquoin User Modeling Adaption and Personalization vol 6787 ofLecture Notes in Computer Science pp 1ndash12 Springer BerlinGermany 2011

[29] SGauchM Speretta A Chandramouli andAMicarelli ldquoUserprofiles for personalized information accessrdquoTheAdaptiveWebSpringer Berlin Germany vol 4321 pp 54ndash89 2007

[30] G L Nemhauser L A Wolsey and M L Fisher ldquoAn analysisof approximations for maximizing submodular set functions-IrdquoMathematical Programming vol 14 no 1 pp 265ndash294 1978

[31] S Khuller A Moss and J Naor ldquoThe budgeted maximum cov-erage problemrdquo Information Processing Letters vol 70 no 1 pp39ndash45 1999

[32] J Leskovec A Krause C Guestrin C Faloutsos J Vanbriesenand N Glance ldquoCost-effective outbreak detection in networksrdquoin Proceedings of the 13th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining pp 420ndash429ACM August 2007

[33] X Cheng S Tan and L Tang ldquoUsing dragpushing to refineconcept index for text categorizationrdquo Journal of ComputerScience and Technology vol 21 no 4 pp 592ndash596 2006

[34] Z Guo L Lu S Xi and F Sun ldquoAn effective dimension reduc-tion approach to chinese document classification using geneticalgorithmrdquo in Advances in Neural NetworksmdashISNN 2009 vol5552 of Lecture Notes in Computer Science pp 480ndash489Springer Berlin Germany 2009

[35] H Cunningham D Maynard K Bontcheva and V TablanldquoGate an architecture for development of robust hlt applica-tionsrdquo in Proceedings of the 40th Annual Meeting on Associationfor Computational Linguistics pp 168ndash175 Association forComputational Linguistics 2002

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article An Effective News Recommendation Method for … · 2019. 7. 31. · Research Article An Effective News Recommendation Method for Microblog User WanrongGu,ShoubinDong,ZhizhaoZeng,andJinchaoHe

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014