[ieee 2012 international conference on information technology and e-services (icites) - sousse,...

Building a social network, based on collaborative tagging, to enhance social information retrieval

Amel Benna, Hakima Mellah, Karima Hadjari DSISM department

CERIST Algiers, Algeria

[email protected], [email protected]

Abstract—Web 2.0 technologies put the user at the center of data production and introduce a strong social collaboration. Therefore, the techniques used in traditional information retrieval systems do not meet the requirements of users who want to take into account their social preferences. The idea reported in this paper is about including not only the social context of the user but also the one for the resource. In the social network that we consider, the user social context brings its interests, which are captured from a collaborative tagging system, while the resource social context is related to clusters of tags, obtained by classification method, and users’ opinions according to their expertise level on the resource. The results of our experiment evaluation on real-world datasets (crawled from delicious folksonomy) demonstrate significant improvements over traditional retrieval approaches.

Keywords- Social networks; social information retrieval; collaborative tagging.

I. INTRODUCTION The advent of Web 2.0 technologies (blogs, bookmarks,

social networks, collaborative tagging ...) has promoted a new paradigm, in which suppliers and users can improve the share as well as the reuse of information published. These technologies put users at the center of data production and introduce a strong social collaboration. Internet users spend more time on social media and share their opinions on a commercial site, blog or social network Different kind of resources (document, video, image…) can be commented, e.g. in 2011, globalwebindex evaluated active users on Facebook as more than 450 million and over 150 million users on Google.

The growing popularity of social media, e.g. according to Gartner Group, social media, in 2011 increase over 41% in one year, has led to the emergence of the Social Information Retrieval (SIR) [1]. SIR is defined as the incorporation of information about social networks and relationships into the information retrieval process [3]. It refers to an approach that assists users in obtaining information to meet their information needs by harnessing the knowledge or experience of other users [1]. It intended to improve the accuracy and quality of results [2].

Thus, the techniques used in Traditional Information Retrieval (TIR) systems don’t meet the requirements of users who want to take into account their social preferences. We can’t have similar results for a query issued from different context dimensions (timestamp, location, user profiles, social preferences, type of resource...).

The contextual information retrieval, that adapts the process of finding information depending on the use context, is confronted to context issues such as modeling, evolution during sessions, and access model evaluation. In this paper, we look to these issues in case of social context dimension.

In our work, SIR includes both the user and resource social context. User social context meets user’s interests, captured from a collaborative tagging system. However, resource social context is related to clusters of tags, obtained by a classification method, and user approval on the resource according to their expertise level. We propose an approach that integrates, in the social network that we define, different relationships and social factors such as: user center of interest, user opinion, user expertise level and clusters of tags. Indeed the incorporation of several social factors can maximize the relevance of results returned [4, 5, 8].

The main contributions of this paper are summarized as follows:

(1) Defining clusters of tags, as nodes of social networks, using classification method, and apply a hierarchical classification for tags within a cluster.

(2) Including user approval on resource according to his expertise level.

(3) Expansion of user query from user social network. (4) Exploring social network and user expertise level for

ranking result.

The rest of this paper is structured as follows: Section 2 introduces the related works on SIR based on collaborative tagging systems, user interests and social networks. Section 3 proposes SIR model based on the social networks and practices of collaborative tagging. Section 4 presents the results of our evaluation. We conclude by presenting the future direction in section 5.

978-1-4673-1166-3/12/$31.00 ©2012 IEEE

2012 International Conference on Information Technology and e-Services

II. RELATED WORK Several approaches on the SIR have been proposed. These

works can be distinguished according to social factors and the type of social relationships used.

We briefly review three categories of works which are closely related to the approach that we will propose: (A) collaborative tagging, (B) social network and (C) users interests.

A. Works based on collaborative tagging systems The study of structure of collaborative tagging systems in

[23], showed regularities in user activity, tag frequencies, kinds of tags used and a remarkable stability in the relative proportions of tags within a given URL.

To improve the web search, [5] explores the use of social annotations. Two new algorithms are proposed: the first one calculates the similarity between social annotations and web queries whereas second captures the popularity of web pages using social annotations.

A new probabilistic generative model for the generation of the resource content and associated social annotations is proposed in [6]. This model is simplified to a computationally tractable hierarchical Bayesian network. A user opinion is also included. A SIR model presented in [8] is based on social approval votes of documents. This model shows that social information on documents can improve research and the sources of approval provide more details on user needs, particularly, when votes are provided by experts.

To define user expertise level, [12] proposes a user model and integrate it in calculating the weight of a tag. The evaluation is based on the closeness degree between user interest’s and resource area, expertise and personal assessment for tags associated to the resource.

However, systems that use collaborative tagging suffer from a number of limitations such as:

- Uncontrolled user vocabulary due to user own terminology and own way of thinking,

- Ambiguity due to the existence of synonyms,

- Variability on writing some tags,

- Absence of semantic links between tags.

These issues deplete the information research potential while the rate of tagged content is growing every day, and affect the response time and quality of the result.

B. Works based on similarity between users In addition to user expertise level, user center of interest

can also be defined from their tags. Several studies [7, 13, 14, 15] have used the tags to define user profile that includes center of interests user. To determine the social relations between users, some studies [10, 11, 38] have considered the relationship of co-authors and/or citations in the bibliographic resources, other works [4, 7] are based on center of interests of users and friendships relationships [2].

A similarity between users and resources is included in [4], to classify documents according to their relevance. It is defined by the cosine vectors [16] between users. While similarity between two tags, is determined when a resource was tagged by the two tags twice.

The K-Nearest Neighbor algorithm is applied in [7] to determine which users have the maximum points of interest like that which initiates the search. Standard cosine similarity is used to compare the feature vectors of users.

A scoring model in [2] is developed based on mutual relationships of users, content and tags, where users relationships are based on friendships.

C. Works based on social networks analysis Social networks analysis is used to represent social

information, as in [3, 8, 9, 10, 11, 29] that show how the incorporation of the authors social importance, improves over the classical search.

Reference [9] introduces author network, based on co-author and citation relationships in federated Digital libraries. It measures closeness and betweenness centrality of actors and their influence within a social structure.

The main limitations of SIR follow from its domain model. i.e. it is only applicable where a social network is present in the domain, or can be derived [3].

Based on the works [3, 9], evaluation of model in [10] proves that the extent of Hub (the centrality of the authors) is the measure to better assess the social significance of documents.

A suggested research approach in [29] demonstrates that co-author relationships are valuable information in the process of computing the importance of document.

The model proposed in [11] illustrates an example of a study by applying four centrality measures (degree, PageRank [30], closeness and betweenness) to evolving co-authorship network. In this work, the measures of centrality include the impact of the article, i.e. its citing accounts and scope of author.

The works we have cited in SIR, either neglect user expertise level, links between tags, links between users, or are limited to assessments of collections for scientific publication in which the links between users are restricted to co-author, and citation relations and annotation.

Inspired by these works and in order to consider a maximum number of social factors we modeled and explored a social network for social research, using user tags and including links between tags and between users and a new corporate entity, which is the user expertise level in a center of interest.

III. SOCIAL INFORMATION RETRIEVAL MODEL In this section, we describe the SIR model based on social

networks and practices of collaborative tagging. Our approach is based on, users that have tagged at least one resource. The resource keywords used by author are also considered as tags. Consequently, each resource is tagged at least by one user. The structure of the proposed social network is presented in section A. We describe the social search process on the social network in section B and the ranking of results in section C.

A. Social network structure Formally, the social network is a graph G = (V, E).

V refers to social network nodes and E shows its edges.

Where U, C, R are respectively a set of users, a set of clusters of tags and a set of resources.

Figure 1. Social network structure

The building of social network involves identifying the nature of nodes and relationships between them. The nodes of the proposed network (see Figure 1) are users, resources, and clusters of tags. The user of social network can give its opinion on resource according to its expertise level. Most users use names as a grammatical form for tagged resources [16].

We describe four kind of relationships in our social network including: 1) relationships between tags represented by clusters of tags, 2) relationships between clusters of tags and resources , 3) relationships between users and cluster of tags and 4) relationships between users based on sharing common interests.

1) Relationship between tags : clusters of tags Users tag randomly resources. These tags can be associated

with different types of resources (videos, images, bookmarks, articles, and blogs) and require no skill from user.

In order to use the most significant tags, to decrease redundancy or tags ambiguity [17], to find the similar semantic tags [18], reduce the response time and improve the quality of results [19] we do not consider the tag as an independent entity of the others tags but we define clusters of tags. A cluster of tags is a set of semantic link between tags obtained by a classification method. It also represents the most common way to gather additional information in collaborative tagging systems [20].

The classification process of our tags folksonomy includes three steps:

a. Creating tag-tag data matrix and its transformation into semantic distance matrix,

b. Applying k-means [20] classification algorithm, to generate clusters of tags.

c. Defining hierarchy of tags in a cluster. These steps are described in the following.

a) Tag-Tag data matrix The analysis conducted in [23], shows a remarkable

stability in the relative proportions of tags within a given resource. Empirically, once a resource has been tagged over a hundred times, each tag’s frequency, in a proportion, remains stable compared to the total frequency of all other tags used for this resource.

Let F=(U, T, R, Y) the formal structure of a folksonomy [22]. U, T and R are finite sets, whose elements are respectively users, tags and resources. Y is a ternary relation between them such that: Y � U × T × R. The weight of edge between two tags ti , tj is given by the number of posts that contain both ti and tj. A post is a triple (u, tur, r), where tur is a tag used by user U to tag a resource r.

The co-occurrence matrix in the context tag-tag [24] is determined by the co-occurrence between each pair (ti, tj) of tags according to (1).

Based on work in [26], after creating the data matrix, the matrix is transformed into a cosine matrix by measuring the cosine distance between vectors according to (2). A vector Vtj represents the number of times a user Ui uses a tag tj.

b) K-Means classification After conversion of the Tag-Tag data matrix into cosine

matrix, a classification is made. We apply the k-means method on the cosine matrix to find the set of k-clusters of tags We build clusters of tags by minimizing the variation within clusters (between tags in a cluster) and maximizing the distance between clusters of tags.

c) Hierarchy of tags in a cluster To avoid spelling variations of tags and composite words in

each cluster of tags, we use the Levenshtein distance [27]. Each tag with its variant spellings is grouped into a single concept. According to work in [24], the tags given by the co-occurrence and measurement of FolkRank [25] correspond to the concepts above in the hierarchy, and the tags given by the distributional measures tend to have the same level in the hierarchy.

(2)

(1)

The tag that has a high degree of co-occurrence in the resources is chosen as a concept. A hierarchy of tags in each cluster of tags is constructed by applying hierarchical classification algorithm proposed in [21]. This classification structures the clusters of tags as a tree, where tags represent nodes and each branch of this tree contains a hierarchy of tags.

2) Relationships between cluster of tags and resources In this step the resources already tagged are reassigned to

clusters of tags. Indeed, after having classified the folksonomy F into clusters of tags, we evaluate a resource ri, degree of membership, Dij, to the cluster Cj by (3). Where occ(tl, ri) denotes co-occurrence of tag tl with a resource ri and tl belongs to cluster Cj.

In order to form clusters that contain similar resources, each resource is associated to the cluster of tags whose degree of belonging to it is maximal.

3) Relationships between user and cluster of tags A relationship between a user and a cluster of tags is

defined by the fact that a user tagged resource with at least one tag of the cluster.

4) Relationships between users : sharing common interests To represent the relationships between users, we rely on [4]

where the links between users are based on common interests between users.

To define user interests, we were inspired by the model proposed in [7]. The user interests are deducted from his tagged resources. User interests are represented by a vector Ug according to (4).

Each dimension W(gi), define the importance of a particular center of interests gi for user U. |G| denotes the number of center of interests. W(gi) is calculated for each user U according to (5).

After assessing the weight for center of interests, the degree of similarity between two users U and V is calculated using cosine similarity as defined in (6).

B. Social research process Once social network is built, the social search process can

be initiates. Figure 2 illustrates the overall architecture of the research process. When user initiates query, the first step of this process is query expansion, the second step is to find resources by matching between social networks tags and request tags that are considered as user keywords. The last step is the ranking of results returned by a query.

Figure 2. Social research process architecture

1) Query expansion When a user issues a query, it is disambiguated by

detecting variations in spelling of his keywords, using the Levenshtein distance with a threshold equal to 0.8. Indeed, Reference [17] noted that most tags are names, and thus the lemmatization methods are not recommended

e.g, when user through ignorance or misspelling issues query "jav" which does not correspond to tags database, the query is replaced by the tag "java".

e.g, When a user, through ignorance or misspelling issues, queries "jav" which does not have correspondence in the database tags, the query is replaced by the tag "java".

After query disambiguation, the social network is used to determine tags of users who share the same interest as the user who initiates the request. These users expertise level, in query center of interest, is higher or equal than the user expertise level that initiates the request.

e.g, When a user query uses the tag "Java" in center of interest "Music" it’s enriched by the tags "music, mp3." The same request will be enriched by the tags "search, picture" in center of interest "Science and Health."

2) Finding resources After a query expansion, a similarity score, Scsimi, is

computed according to (7), for each user Uj. Uj shares the same center of interest and brings similar expertise level to user that issued the request. The user expertise level in a research field may vary according to his knowledge. This level can be determined on the basis of user tags depth in an ontology [12].

(5)

(6)

(4)

(3)

(7)

Where:

vrc : denotes tags vector for the user Ur who initiates the request.

vvc : denotes the cluster of tags vector C for user UV.

m: denotes number of clusters of tags.

sim(vrc, vvc): denotes semantic similarity between expanded request tags vectors Vrc and clusters of tags vectors Vvc for users having the same center of interest as the user who initiate a query.

exi: denotes user Ui expertise level on request center of interest.

3) Ranking resources The ranking of resources returned in a search is performed

according to their degree of relevance by a ranking function R(p) for user who tagged the resource p. This function is based on the similarity between users, between tags and includes network user approval according to their expertise levels.

R(p) is calculated for each query. Where for each user Ui who tagged the resource p we calculate a ranking function defined by (8).

Sim (Vrc,Vci) denotes the similarity between expanded query vector Vrc and Vci vector. Vci is a cluster of tags vector for user Ui who tagged the resource p for cluster C. Sim(Ur,V) denotes the similarity between user Ur who initiates the request, and user V, who tagged the resource p �i denotes user Ui opinion for resource p. exi denotes the expertise level for user Ui in the request center of interest I.

IV. EVALUATION AND RESULTS To describe our experiment, we present a: (A) datasets

used, (B) social network developed and (C) experimentation results.

A. Datasets In the absence of a standardized collection for test [2, 3],

we evaluated our approach on real-world datasets crawled from del.icio.us bookmarking system. We extracted data from 'deliciousdata' which contains a set U of 2000 users, a set T of 2000 tags, and a set R of 73 resources for 3577 annotations.

Data analysis for 2000 tags showed that 200 tags have a high co-occurrence frequency for 1879 users and represent more than 70% of users annotations. We had used only the triplets of tags, users and resources. These triplets represent 80% of folksonomy tags [24].

We performed a series of tests for three types of research: (i) a TIR, based on the vector model, (ii) a SIR, based on the model defined in section 3, which includes the relationships between users and the relationships between tags and does not

take into account the user expertise level, (iii) an Extended Social Information Retrieval, called (ESIR), which extends the SIR by integrating the users expertise level and their views on the relevance of resources.

B. Building a social network As a first step, we seek to build our social network and thus

to build clusters of tags and centers of interests. The set T of tags has been classified using k-means method, with k = 17. The result is a set C of clusters with an average of 12.3 tags for each cluster. The similarity distance between two resources assigned to a cluster is greater than 0.65. The resources are grouped in the same slice of branch in the hierarchy and have a similarity of more than 0.9.

The link between users is determined by common interests. Two users are connected if they share at least one center of interest.

C. Experimentation results To evaluate relevance of our approach, we initiate several

queries in three research types: TIR, SIR, and ESIR. A comparison based on the measures of recall and precision between the various systems was performed. The degree of relevance was defined by researchers in information technology.

We measured the relevance, of the top 10 recommended resources, for a query ‘‘algorithms’’ in the center of interest 'Software’.

The recall-precision curves are plotted on Figure 3. This curve is decreasing for all three graphs; this is due to the fact that the two measures vary inversely, precision decreases as the recall increases.

The evaluation demonstrates that SIR and ESIR search performs better than TIR search. The curves of SIR and ESIR are proportionally similar in particular during the initial results, and then we see that the curve of ESIR is higher compared to that of SIR, this means the importance of the amount of social information submitted by the system.

Figure 3. Recall and precision curves of social network

(8)

V. CONCLUSION In this paper, we have presented a SIR model based on a

social network of users, clusters of tags and resources and users’ opinion according to their expertise level. We used the collaborative tagging system to build cluster of tags, and relationships between users against their interests.

To address the problems of collaborative tagging systems, we do not see the tag, in our social network, as an independent entity. Thus, we apply the clustering algorithm, k-means, to define clusters of tags and hierarchy in each cluster. Resources relevance is based on user opinions according to their expertise levels. Then, we define a ranking function to classify resources according to their relevance degree. Our approach also offers the opportunity to enrich a user request, by suggesting tags from the user social network.

The first conclusions that emerge from evaluation of relevance tests are that the integration of different social contexts provides very conclusive results and that the integration of several social factors makes the search results more relevant. There are others social context related information which can be used concerning users, such as trust friendships and colleagues relationships, and concerning a resource, such as its location and the situation in which it has been produced.

The evaluation was flown on the bookmarking system del.icio.us. It remains to evaluate the performance of the solution on a larger sample. According to [28], the scaling issue is particularly raised in the case of collaborative access models involving a user base with hollow matrices models.

REFERENCES

[1] R. Vuorikari, “Can social information retrieval enhance the discovery and reuse of digital educational content?”, RecSys, ACM, 19-20 October 2007, MN, USA, pp. 207-210

[2] M. Bender, T. Crecelius, M. Kacimi, S. Michel, T. Neumann, J.X. Parreira, R. Schenkel and G.Weikum, “Exploiting social relations for query expansion and result ranking,” in ICDE Workshops 2008, IEEE Computer Society, April 7-12, 2008, Cancún, México, pp. 501-506.

[3] S. Marius Kirsch, M. Gnasa and A. B. Cremers, “Beyond the Web: Retrieval in Social Information Spaces”, Advances in Information Retrieval. London, UK : Springer, Lecture Notes in Computer Science, 2006, Vol. 3936, pp. 84-95.

[4] V. Zanardi and L. Capra, “Social ranking: Uncovering relevant content using Tag-based Recommender Systems,” RecSys'08. ACM, 23-25 October 2008.

[5] S. Bao, G. Xue, X.Wu, Y. Yu, B. Fei and Z. Su, “Optimizing web search using social annotations,” Proceedings of the 16th Internationa Conference on World Wide Web, WWW 2007. ACM 2007, 8-12 May 2007, pp. 501-510.

[6] D. Zhou, J. Bian, S. Zheng, H. Zhaand and C. Lee Giles, “Exploring social annotations for information retrieval,” Proceedings of the 17th International Conference on World Wide Web, WWW 2008. ACM 2008, 21-25 April 2008, pp. 715-724.

[7] B. Jiang, Y. Lingand and J. Wang, “Tag Recommendation based on social comment network,” , International Journal of Digital Content Technology and its Applications, Vol. 4. Nov 2010.

[8] G. Kazai and N. M. Frayling, “Effects of social approval votes on search performance,” Sixth International Conference on Information Technology: New Generations,” ITNG 2009,. 27-29 April 2009, ISBN 978-0-7695-3596-8, pp. 1554-1559.

[9] M. Peter, “Enhancing information retrieval in federated bibliographic data sources using author network based stratagems,” ECDL 2001, LNCS 2163, Springer, 4-9 September 2001, Vol. 2163, pp. 287-299.

[10] L. Ben Jabeur, L. Tamine and M. Boughanem, “A social model for literature access: towards a weighted social network of authors,” RIAO 2010, pp. 32-39.

[11] E.Yan and Y.Ding, “Applying centrality measures to impact analysis: A coauthorship network analysis,” 2009.

[12] S. Kichou, H. Mellah, Y. Amghar and F. Dahak, “Tags weighting based on user profile,” Active Media Technology - 7th International Conference, AMT 2011. Lecture Notes in Computer Science 6890 Springer 2011, 7-9 September 2011, ISBN 978-3-642-23619-8, pp. 206-216.

[13] Y. C. Huang, C. C. Hung and J. Y. Hsu,” You are what you Tag,” In Proceedings of AAAI 2008 Spring Symposium, Series on Social Information Processing, 2008.

[14] C. Max, J, Christine and S-D. Chantal, “Collaboration & Social nformation & Access: techniques for approached User Modelling,” CHAP: Adaptive User Profiles. s.l. : IGC Global, 2009.

[15] E Michlmayr, S. Cayzer and P. Shabajee, “Add-a-Tag: Learning adaptive user profiles from Bookmark collections,” ICWSM’2007. 2007.

[16] G. Salton, A. Wong and C. Yang, “A vector space model for automatic indexing,” Communications of the ACM. 1975, Vol. 18, 11, pp. 613-620.

[17] L. Spiteri, “Structure and form of folksonomy tags: The road to the public library catalogue,” Webology, Vol. 4, 2007.

[18] J. Gemmell, A. Shepitsen, B. Mobasher and R. D. Burke, “Personalizing navigation in Folksonomies using hierarchical Tag clustering,”, Data Warehousing and Knowledge Discovery, 10th International Conference : DaWaK2008. Lecture Notes in Computer Science, 2-5 September 2008, pp. 196-205.

[19] G. Begelman, P. Keller and F. Smadja, “Automated tag clustering: Improving search and exploration in the tag space”, Proc. of the Collaborative Web TaggingWorkshop at WWW. 22–26 May 2006.

[20] A.K. Jain, “ Data Clustering: 50 Years Beyond K-Means”, Lecture Notes in Computer Science. Springer, ECML/PKDD(1), 2008, pp. 3-4.

[21] W.-T. Hsieh, W. S. Lai and S. C. T. Chou, “A collaborative tagging system for learning resources sharing,” In IV International Conference on Multimedia and Information and Communication Technologies in Education (m-ICTE2006), 2006, pp. 1364-1368.

[22] A. Hotho., R. Jäschke and S. C. Stumme , “Information Retrieval in Folksonomies: search and sanking,” Knowledge & Data Engineering Group. University of Kassel, Wilhelmsh¨oher Allee 73, D–34121 Kass : Department of Mathematics and Computer Science, 2006.

[23] S. A. Golder and B. A. Huberman, ”The structure of Collaborative Tagging systems,” CoRR abs/cs/0508082. 18 August 2005.

[24] C. Cattuto, D. Benz, A. Hotho and G. Stumme, “Semantic grounding of tag relatedness in social Bookmarking Systems,” The Semantic Web - ISWC 2008, 7th International Semantic Web Conference, ISWC 2008. 26-30 October 2008, pp. 615-631.

[25] A. Hotho, R. Jäschke, C. Schmitz and G. Stumme, “FolkRank: A ranking algorithm for Folksonomies,” LWA . 9-11 October 2006, pp. 111-114.

[26] C. Cattuto, V. Loreto and L. Pietronero, “Collaborative Tagging and semiotic dynamics,” CoRR, May 2006.

[27] I, Levenshtein, “Binary codes capable of correcting deletions,” insertions. 1966.

[28] S.S . Anand, and B. Mobasher, “Introduction to intelligent techniques for web personalization”, ACM Transactions on Internet Technologies, Vol. 7, 2007.

[29] L.Kirchhoff, K. Stanoevska-Slabeva, T.Nicolai and M.Fleck “Systems, using social network analysis to enhance information retrieval,” Applications of Social Network Analysis, ASNA, 2008.

[30] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: Bringing order to the web,“ WWW’98, pp. 161–172, Brisbane, Australia, 1998.

[ieee 2012 international conference on information technology and e-services (icites) - sousse,...

Documents