social priors to estimate relevance of a resource
DESCRIPTION
In this paper we propose an approach that exploits social data associated with a Web resource to measure its a priori relevance. We show how these interaction traces left by the users on the resources, which are in the form of social signals as the number of like and share, can be exploited to quantify social properties such as popularity and reputation. We propose to model these properties as a priori probability that we integrate into language model. We evaluated the effectiveness of our approach on IMDb dataset containing 167438 resources and their social signals collected from several social networks. Our experimental results are statistically significant and show the interest of integrating social properties in a search model to enhance the information retrieval.TRANSCRIPT
Ismaïl BADACHE, Mohand BOUGHANEM
IRIT, Toulouse University, France
{badache, boughanem}@irit.fr
Presentation Plan
Introduction
Related Work
Approach of Social Information Retrieval
Experimental Results4
1
3
Conclusion
2
5
1.1 Emergence of social Web
Number of active users 2013
1,2 1,41,7
2,4
2011 2012 2013 2014
Number of Internet users
Social content per 1 minute
41000 Publications
1,8 Million Like
~350 GB of Data
Source:blogdumoderateur.comquantcast.comsemiocast.com
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
1
Video
Photo
Web Page
Web Resources
Resource
.
.
.
Social Networks
Bookmark
Comment
Share/Recommend
Motion/Vote
Like/+1
Interaction
Extraction and quantification of
social properties
Information Retrieval Model
(Ranking)
Integration
Query Results
Fig 1. Global presentation of our work
Social Signals
(Source of Evidence)
Popularity
Reputation
Freshness
2
1.2 Example of Social Signals
3
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
1.3 Research Issues
What are the most useful signals and properties to evaluate a priori relevance
(importance) of a resource?2
What theoretical model to combine a priori relevance of resource with its
topical relevance?3
What is the impact of social properties on IR system performance?4
1 How to translate social signals into social properties?
4
What are the most favored signals and properties while using attribute
selection algorithms? and what are the most correlated with documents
relevance?
5
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
2.1 Related Work
5
Sources of evidence (Social Features) Properties Models Authors
• Number of : clicks, votes, records and
recommendations.
Popularity
Importance
Linear
combination(Karweg et al., 2011)
• Number of : like, dislike, comments on
YouTube.
• The playcount (number of times a user
listens to a track on lastfm)
• Presence of a URL in a tweet.
Importance
Machine
learning
and
Linear
combination
(Chelaru et al., 2012)
(Khodaei et al. 2012)
(Alonso et al., 2010)
• Number of retweets.
• Number of annotations (tags).Popularity
Machine
learning
(Yang et al., 2012)
(Hong et al., 2011)
(Pantel et al., 2012)
• Social approval votes ImportanceMachine
learning
(Kazai and Milic-
Frayling., 2009)
• Our IR approach consists of exploiting various and heterogeneous social
signals from different social networks to take into account in retrieval model.
In addition, instead of considering social features separately as done in the
previous works, we propose to combine them to measure specific social
properties, namely the popularity and the reputation of a resource. We also
evaluate the impact of freshness of signal in the performance. In our work, we
use language model that provide a theoretical founded way to take into
account the notion of a priori probabilities of a document.
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
3.1 A Modular Approach for Social IR
6
• We assume that resource D can be represented both by a set of textual key-words
𝐷𝑤={𝑤1, 𝑤2, …𝑤𝑛} and a set of social actions (signals) performed on this
resource, 𝐷𝑎={𝑎1, 𝑎2, … 𝑎𝑚}.
• We consider a set X={Popularity, Reputation, Freshness} of 3 social properties
that characterize a resource D. Each property is quantified by a specific actions
group. These properties are considered as a priori knowledge of a resource.
3.2 Social Signals and Social Properties
Web Resource- Textual key-words
- Social Signals
- Like- +1- Share
- Comment- Dates of actions
Web Resource- Textual key-words
- Social Signals
- Like- +1- Share- Comment- Dates of actions
Reputation
Popularity
Freshness
7
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
3.1 Proposed Approach
• The language modelling approach computes the probability 𝑃(𝐷|𝑄) of a
document D being generated by a query Q by using the Bayes theorem :
• 𝑃(𝐷) is a document prior probability. It is useful for incorporating other sources
of information to the retrieval process.
• 𝑃(𝑄) can be ignored because it does not affect the ranking of documents.
3.3 Query Likelihood and Document Priors
(1)
(2)
8
𝑆𝑐𝑜𝑟𝑒 𝑄, 𝐷 = 𝑃 𝐷 𝑄 =𝑃(𝐷) ∙ 𝑃(𝑄|𝐷)
𝑃(𝑄)
𝑆𝑐𝑜𝑟𝑒 𝑄, 𝐷 = 𝑃 𝐷 𝑄 = 𝑷 𝑫 ∙ 𝑃(𝑄|𝐷)
Document Prior Probability Query-Likelihood Score
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
3.1 Proposed Approach
• Popularity P: The resource popularity can be estimated according to the rate of
sharing this resource on social networks.
• Reputation R: The resource reputation can be estimated based on social
activities that have positive meaning such as Facebook like. Indeed, resource
reputation depends on the degree of users' appreciation on social networks.
The general formula is the following:
Where:
3.4 Estimating Priors: Popularity and Reputation
𝑃𝑥 𝑎𝑖𝑥 =𝐶𝑜𝑢𝑛𝑡(𝑎𝑖
𝑥, 𝐷)
𝐶𝑜𝑢𝑛𝑡(𝑎.𝑥, 𝐷)
(3)
(4)
9
𝑃𝑥 𝐷 =
𝑎𝑖𝑥∈ 𝐴
𝑃𝑥 𝑎𝑖𝑥
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
3.1 Proposed Approach
• To avoid Zero probability, we smooth 𝑃𝑥 𝑎𝑖𝑥 by collection C using Dirichlet.
The formula becomes as follows:
Where:
• 𝐶𝑜𝑢𝑛𝑡 𝑎𝒊𝑥, 𝐷 represents number of occurrence of specific action 𝑎𝑖
𝑥 performed
on a resource.
• 𝑎𝑖𝑥 designs action 𝑎𝑖 used to measure a property 𝑥. 𝑎.
𝑥 is the total number of
social signals associated to property 𝑥, in documents D or in collection C.
3.5 Estimating Priors: Popularity P and Reputation R
(5)
(6)
10
𝑃𝑥 𝐷 =
𝑎𝑖𝑥∈ 𝐴
𝐶𝑜𝑢𝑛𝑡 𝑎𝑖𝑥, 𝐷 + 𝜇 ∙ 𝑃(𝑎𝑖
𝑥|𝐶)
𝐶𝑜𝑢𝑛𝑡 𝑎∙𝑥, 𝐷 + 𝜇
𝑃(𝑎𝑖𝑥|𝐶) =
𝐶𝑜𝑢𝑛𝑡(𝑎𝑖𝑥, 𝐶)
𝐶𝑜𝑢𝑛𝑡(𝑎.𝑥, 𝐶)
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
3.1 Proposed Approach
• In addition to simple counting of social actions, we propose to consider the time
associated with signal. We assume that the resource associated with fresh signals
should be promoted comparing to those associated with old signals. Therefore,
instead of counting each occurrence of a given signal, we bias this counting,
noted 𝐶𝑜𝑢𝑛𝑡𝐵, by the date of the occurrence of the signal. The corresponding
formula is as follows:
• 𝑇𝑎𝑖={𝑡1,𝑎𝑖 , 𝑡2,𝑎𝑖 , … 𝑡𝑘,𝑎𝑖} a set of k datetime at which each action 𝑎𝑖 was produced.
• 𝑓𝐹(𝑡𝑗,𝑎𝑖𝑥 , 𝐷) represents freshness function, estimated by using Gaussian Kernel, it
calculates a distance between current time 𝑡𝑐𝑢𝑟𝑟𝑒𝑛𝑡 and action time 𝑡𝑗,𝑎𝑖𝑥
3.6 Estimating Priors with considering Freshness F
𝐶𝑜𝑢𝑛𝑡𝐵 𝑡𝑗,𝑎𝑖𝑥 , 𝐷 =
𝑗=1
𝑘
𝑓𝐹(𝑡𝑗,𝑎𝑖𝑥 , 𝐷)
= 𝑗=1
𝑘
𝑒𝑥𝑝 −‖𝑡𝑐𝑢𝑟𝑟𝑒𝑛𝑡 − 𝑡𝑗,𝑎𝑖
𝑥‖2
2𝜎2(7)
11
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
3.1 Proposed Approach
• In our case, we have various sources of social information that influences the a
priori probability of relevance. This probability is calculated by combining two
main social properties (popularity and reputation). The problem can be
formalized as follows:
• 𝑃𝑃 𝐷 ,𝑃𝑅(𝐷) define a priori probabilities related to popularity P and reputation
R that include freshness function.
• 𝑃𝑃⊕𝑅 𝐷 defines the probability of priors combination.
3.7 Combining Priors
(8)
12
𝑃𝑃⊕𝑅 𝐷 = 𝑃𝑃(𝐷) ∙ 𝑃𝑅(𝐷)
1. Introduction 2. Related Work
5. Conclusion
3. Approach of SIR
4. Experimental Results
3.1 Proposed Approach
• Objectives
1. First, to evaluate whether social signals, taken from different social networks
improve the search.
2. Second, to evaluate the impact of each signal taken separately and grouped to
represent a certain property.
3. and finally to measure the impact of the freshness.
• Evaluation challenge
1. Absence of a standard framework for evaluation in social IR.
2. Collect social signals from 5 social networks and mount experimentation.
1. Introduction 2. Related Work
5. Conclusion
4.1 Experimental Evaluation
3. Approach of SIR
4. Experimental Results
13
3.1 Proposed Approach
• Textual Content: 167438 Documents from INEX IMDb.
4.2 Description of DataSet
3. Approach of SIR
4. Experimental Results
14
Field Description Status
ID Identifying the film (document) -
Title Film's title indexed
Year Year of the film release indexed
Rated Film classfication by content type -
Released Date of making the film indexed
Runtime Length of the film indexed
Genre Film genre (Action, Drama, etc.) indexed
Director Director of the film project indexed
Writer Writers and writers of the film indexed
Actors Main actors of the film indexed
Plot Text summary of the film indexed
Poster URL of the link poster -
url URL of the Web source document -
UGC Social data recovered -
1. Introduction 2. Related Work
5. Conclusion
3.1 Proposed Approach
• Social Content: 8 social data from 5 social networks.
• Query and Relevance Judgment: from INEX IMDb
- 30 queries (topics) and their Qrels from the set of INEX IMDb.
- Top 1000 documents returned by each topic.
4.2 Description of DataSet
3. Approach of SIR
4. Experimental Results
ACEBOOK
Like
Share
Comment
Date of last action
WITTER
Tweet
GOOGLE+
+1
Share
LINKEDDELICIOUS
Bookmark
15
1. Introduction 2. Related Work
5. Conclusion
3.1 Proposed Approach4.3 Quantifying of Social Properties
3. Approach of SIR
4. Experimental Results
Social Properties Social Signals Social Networks
Popularity P
Number of « Comment » C1 Facebook
Number of « Tweet » C2 Twitter
Number of « Share » C3 LinkedIn
Number of « Share » C4 Facebook
Reputation R
Number of « Like » C5 Google+
Number of « +1 » C6 Facebook
Number of « Bookmark » C7 Delicious
Freshness F Dates of last actions C8 Facebook
• Each social property is quantified based on social signals according to their
nature and signification.
16
1. Introduction 2. Related Work
5. Conclusion
0
0,1
0,2
0,3
0,4
0,5
0,6
Like Share Comment Tweet Mention+1 Bookmark Share(LIn)
Results of individual integration of social signals
3.1 Proposed Approach4.4 Results: Single Priors and Combination Priors
3. Approach of SIR
4. Experimental Results
Facebook signals
17
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
Popularity Reputation All Criteria All Properties
Different combinations of social signals (social properties)
0
0,1
0,2
0,3
0,4
0,5
Lucene Solr ML.Hiemstra
baselines (Topical Models)
P@10 P@20 nDCG MAP
1. Introduction 2. Related Work
5. Conclusion
3.1 Proposed Approach4.4 Results: Impact of the Freshness
3. Approach of SIR
4. Experimental Results
18
0
0,1
0,2
0,3
0,4
0,5
Lucene Solr ML.Hiemstra
baselines (Topical Models)
P@10 P@20 nDCG MAP
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
Share Comment Share+Comment Popularity All Criteria All Properties
Without Integration of Freshness
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
Share Comment Share+Comment Popularity All Criteria All Properties
With Integration of Freshness
F F FF F F F
1. Introduction 2. Related Work
5. Conclusion
3.1 Proposed Approach4.5 Results: Feature Selection Algorithms Study
3. Approach of SIR
4. Experimental Results
Table 1. Selected Social Signals With Attribute Selection Algorithms
--- : Highly selected
--- : Moderately selected
--- : Less favored
19
1. Introduction 2. Related Work
5. Conclusion
3.1 Proposed Approach4.6 Results: Ranking Correlation Analysis
3. Approach of SIR
4. Experimental Results
Fig 1. Spearman correlation between social signals and relevance
Fig 2. Spearman correlation between social properties and relevance20
1. Introduction 2. Related Work
5. Conclusion
3.1 Proposed Approach4.6 Results: Ranking Correlation Analysis
3. Approach of SIR
4. Experimental Results
Fig 3. Spearman's Rho correlation values for the social signals pairs
21
The social signals pairs: (tweet, share(LIn)), (bookmark, Tweet) and (mention +1,
bookmark) are highly correlated, i.e., the similarity scores of these pairs are higher
than 0.70
bookmark, share (LIn) are the less important criteria followed by mention +1.
1. Introduction 2. Related Work
5. Conclusion
3.1 Proposed Approach
1. Introduction 2. Related Work
5. Conclusion
5. Conclusion
3. Proposed Approaches
4. Experimental Results
• Social Information Retrieval based on Language Model
- Topical relevance (retrieval model based content only).
- Social relevance (retrieval model based content and social features).
• Experimental Evaluation
- Superiority of proposed approach compared to textual models (baselines).
- Positive ranking correlation between social signals and relevance.
- Attribute selection algorithms.
• Perspectives
- Integration of other social features.
- Further study on the impact of the temporal property.
- Comparison of the proposed models with other social models.
- Experimental evaluation on other types of dataset.
22
http://www.irit.fr/~Ismail.Badache/
Thank you @IIiX2014 for travel support